DataPub #3: Solving Urban Mysteries with Open Data

Learn how this month's guest speakers solve "urban mysteries" - like subway ridership trends and regional property history - and make public data accessible and usable for citizens, organizations, and policymakers.

We just wrapped up DataPub #3, where our guest speakers, Chris Whong and Jonathan Leek, shared how they use open datasets to solve “urban mysteries” in their respective locales, including: how many people ride the NYC subway system and how does ridership vary across burroughs (and before and after COVID-19)? How do we make St. Louis property data accessible and usable for citizens and policy makers?

If you were able to join us for our May meetup, thank you! If you missed it or want to re-watch, check out the recording and detailed recap below.

Recording of the May DataPub meetup

We hope to see you at this month’s event: DataPub #4: Cataloguing the World: Medicine, Tweets, and Beyond.

RSVP to join us Tuesday, June 16th (1pm PT / 4pm ET / 8pm GMT)
If you can’t attend live, register anyway, and I’ll send you the recording and resources shortly following the session.

Guest Speakers and Session Summary

Prashant Sridharan, Timescale VP Marketing and open data enthusiast, spent a few minutes welcoming everyone and sharing his own personal experiences exploring open datasets. Over the past few months, he’s become fascinated with public data and what it makes possible (he likens combining public datasets to the way he felt as a kid playing with LEGOS: turning a jumble of blocks into a village, ship, or other imaginative creation).

You can see a bit of his work on this topic in Charting the spread of COVID-19 using data and Public Dataset Tips & Tricks: How to weave together public datasets to make sense of the world.

Speaker #1 Chris Whong, Outreach engineer at Qri: “Taming the MTA’s Unruly Data”

Chris is a long-time urbanist, mapmaker, and data junkie, and he’s spent much of his career in data-related roles and working with various public datasets. He’s based in New York City, arguably one of the most progressive cities for open data (New York City introduced the Open Data Law in 2012, mandating that all city agency data be fully accessible to the public) – and his locale forms the basis of his talk.

The New York City subway system is an iconic and integral part of the city, and Chris was keen to understand ridership trends across burroughs and the possible impacts of demand on an aging system. While the New York Metropolitan Transportation Authority (MTA) publishes turnstile data that details weekly NYC subway station entry and exit data dating back to 2010, these are massive 27MB CSV files with 200K+ rows and millions of values.

In other words, when you download the weekly datasets, they’re messy and hard to read; the turnstile counters show the reading at a moment in time and singular rows aren't valuable by themselves. Sussing out insights requires multiple calculations (using multiple readings) to make sense of the data before you can start to run your analysis.

Chris shares the state of the NY MTA raw data files (and why they’re so tricky to parse)

In his talk, Chris walks us through how he’s built a PostgreSQL pipeline (you can find the GitHub repository here) to clean and aggregate the data. He shares techniques and best practices for dumping new weekly data into a database (14M data entries) to running calculations to clean it up (e.g., SQL queries to join time and date data and creating unique identifiers for multiple variables).

Example of raw MTA turnstile data CSV file‌‌

MTA turnstile data after Chris wrangles and cleans it

After generating a clean table, his pipeline runs aggregation queries to sum entries and exit counts for turnstiles and stations on a daily basis – reducing 14M records to 2M. From there, he can start to analyze the data, ask meaningful questions, and identify trends.

Chris works as an outreach engineer at Qri - an open source project for versioning and sharing datasets (aka “NPM for datasets”) - and, each Saturday, he pushes each week’s CSV update to Qri Cloud as a new commit, where anyone can discover and use it.

Follow Chris on Twitter for his weekly updates and to see how New York City subway ridership trends continue to evolve over the coming months.

Chris’ weekly Twitter updates always feature a quick trend graph for quick reference

Find more information on Chris’ blog: Taming the MTA's Unruly Turnstile Data

Link to Chris' slides

Speaker #2, Jonathan Leek, Data Architect at the St Louis Regional Data Alliance, “Building a Regional Analytics Database”

Jonathan is a Data Architect with the St. Louis Regional Data Alliance, a non-profit organization that’s focused on making data more accessible and usable in the St. Louis region.

His team is working on the solution to a big urban conundrum: “Why can we easily find the history of a car by its VIN number, but huge purchases like homes or commercial buildings are a blackbox? Where can we find the history of inspections, construction, and other relevant lot and structure information? Why can’t we have a CarFax for housing?”

Echoing Chris’ talk about the messiness of NYC’s transit data, Jonathan kicks off explaining - using St. Louis as a case example - the reality of open data. In municipalities, each entity or department typically uses its own database, identifiers, and technologies, making it difficult to consume and correlate public data in a useful or comprehensive way.

The Reality of Open Data

To solve this problem, his team is building “The Regional Entity Database,” an open source ETL infrastructure and analytical database to give everyone the ability to access public data in a human-readable format. Right now, his team is focused on residential and commercial property data, but the Regional Entity Database framework and schema can be extended to all types of public data and to all municipalities.

The St. Louis Regional Entity Database

Jonathan details how they’re using PostgreSQL, Apache Airflow, AWS (VM, S3), HashiCorp Terraform, and various other technologies to set up the project, store data, schedule jobs, handle errors, and continually update the database.

St. Louis Regional Data Alliance reference architecture example (Terraform module for creating an Apache Airflow cluster in AWS).

To explore the project, try it on your own, or get involved in their work, check out the St. Louis Regional Data Alliance data pipeline and architecture on GitHub.

Link to Jonathan’s slides

In closing

Thank you to everyone who helped make this meetup fun and engaging, from our speakers Chris Whong and Jonathan Leek, to all the people that registered (and/or attended live), and the Timescale team members that played a role in bringing DataPub to life.

Our next session is on Tuesday, June 16th (1pm PT/4pm ET/ 8pm GMT). Join us if you’d like to learn more about open data, are already part of the community, or are simply looking for a way to connect with new technical folks - everyone’s welcome :).

June featured guests:

Tudor Oprea, Professor of Medicine for the Department of Internal Medicine at University of New Mexico School of Medicine, will tell us about DrugCentral: a fully annotated resource that catalogs pharmaceutical drug data from the US FDA and other regulatory agencies and maps active ingredients, side effects, genomic signatures, and more.
Suhem Parack, Developer Relations for Twitter Academic Research, will dive into Twitter Developer Labs and its applicability to the open and public data community.

RSVP here.

We can’t wait to see you all in June!

If you have an open data project that you would like to share with the rest of the community, please reach out to events@timescale.com