In 2019, Digitalis Commons began exploring the mining of public data sets to identify patterns of value in biotech and related domains. Starting with the OpenFuego open source project that came out of the Nieman Journalism Lab, the Commons team moved on to create an entirely-new application that pulls public data from Twitter APIs and scores mentioned URLs to identify resources of interest to a given community.

That work had been running at synthesis.bio, an automatically-curated list of the most interesting web pages among a community of Twitter users with a strong interest in biotech. We've recently expanded the Synthesis project to support multiple channels around different topics, including Covid-19. That work is now here at synthesis.digitaliscommons.org.

How we built this

The project is maturing rapidly and Digitalis Commons plans to release it as an open source project later this year, after adding additional capabilities. The project is written in Python and makes use of the rich open source ecosystem for data projects, including Pandas, which is the ubiquitous toolset of manipulating columnar data in memory at high speed. The project runs in Docker containers deployed on Amazon's serverless cloud computing infrastructure, and operates at remarkably low cost -- illustrating the extraordinary potential for the creation and operation of next-generation data analytics platforms in the cloud: cheap, fast... and good.

Technology Stack

The Synthesis technology stack includes:
  • Python services listening to Twitter feeds
  • Docker containers running the Python code
  • AWS Cloud Run running the Docker containers
  • S3 buckets collecting the data in JSON files
  • GatsbyJS generating a static site
  • React running dynamic web pages