In 2019, Digitalis Commons began exploring the mining of public data sets
to identify patterns of value in biotech and related domains. Starting with
the OpenFuego open source project that came out of the Nieman Journalism
Lab, the Commons team moved on to create an entirely-new application that
pulls public data from Twitter APIs and scores mentioned URLs to identify
resources of interest to a given community.
That work had been running at synthesis.bio, an automatically-curated
list of the most interesting web pages among a community of Twitter users
with a strong interest in biotech. We've recently expanded the Synthesis
project to support multiple channels around different topics, including
Covid-19. That work is now here at synthesis.digitaliscommons.org.
How we built this
The project is maturing rapidly and Digitalis Commons plans to release it
as an open source project later this year, after adding additional
capabilities. The project is written in Python and makes use of the rich
open source ecosystem for data projects, including Pandas, which is the
ubiquitous toolset of manipulating columnar data in memory at high speed.
The project runs in Docker containers deployed on Amazon's serverless cloud
computing infrastructure, and operates at remarkably low cost -- illustrating
the extraordinary potential for the creation and operation of next-generation
data analytics platforms in the cloud: cheap, fast... and good.
The Synthesis technology stack includes:
- Python services listening to Twitter feeds
- Docker containers running the Python code
- AWS Cloud Run running the Docker containers
- S3 buckets collecting the data in JSON files
- GatsbyJS generating a static site
- React running dynamic web pages