Crowdsourcing Civic Data — A Playbook of Tools and Techniques

Chris Whong
qri.io
Published in
7 min readDec 17, 2020

--

At Qri, we’ve been organizing a volunteer effort to manually geocode records in a 5000-row dataset. Each record is a mini research project. Some have a discrete address, others are found through local knowledge or google sleuthing. In this blog, we’ll cover why we had to call on the crowd to accomplish this task, and share some of the tools and techniques we employ to keep things organized.

A call for volunteers to help geocode capital projects in our newly-scraped dataset

The Data White Whale: Mapping New York City Capital Projects

I recently had the opportunity to scrape and do some simple analysis of a fascinating dataset, NYC Capital Project Detail Data. In this dataset are details of the billions of dollars my city is spending on capital improvements. This includes water and sewer mains, park improvements, building construction and renovation, etc. For a city this large, it’s a lot to take in (there are over 5000 projects happening simultaneously at the moment and the source PDFs are published as 5 volumes), but this dataset, and the PDFs it was scraped from, represent the best comprehensive project-level data available to the public.

One big problem: The projects aren’t mappable. Not yet anyway. The information system these projects are tracked in does not keep track of spatial data (a point, line, or polygon geometry). Rather, projects are associated with one or more of New York’s 59 community districts, giving some idea of the applicable geographic benefits, but not allowing for precise “mapping of the budget”. Furthermore, some projects are deemed to have borough-wide or citywide geographic impact (think computer systems, fleets of vehicles, or other less construction-y investments).

“Everything happens somewhere” is an old saying in GIS circles, along with “all data are spatial”. In the prior blog post about scraping the data and the possibilities it could unlock, I attempted to manually assign point geometries to every project I could research in a single community district. To make this dataset spatial, I literally added two more columns, latitudeand longitude , and then literally googled keywords in the project descriptions to see if I could figure out where a project was happening. This is easy for libraries and public housing projects with discrete names, not so much with sewer projects that may have obscure acronyms or other jargon-y descriptions (But now I know what “guniting” is)

The result is this bubble map showing the location and approximate total cost (dollars spent + next 5 years of planned spend) for a few dozen projects:

A map of capital projects in Brooklyn Community District 6

This is a fascinating overview of projects that the system of record says will benefit my local community. All it took was a couple of hours of research, looking for clues in the project descriptions to figure out where they were located. Google, Google Maps, Community News Sites, Blogs, and whatever else I could pull up on the internet were the tools of the trade.

So… what would it take to do this for the whole city? We’re not talking millions of rows. There’s only 5000 projects, and only 4000 or so of them are non-citywide. Once we have the entire dataset geocoded, this opens up the door to more thorough analysis, along with a proper web application for anyone to search and explore every capital project. Someday…

Divide & C̶o̶n̶q̶u̶e̶r̶ Geocode

The new dataset contains a column community_boards_served, making it easy to filter for each of the 59 districts. Community districts are a known geography in NYC, and are a simple way to assign a subset of projects to a volunteer. I did two myself, so now all I need is 57 more civic data nerds to repeat the process!

Airtable — Data Editing

Airtable is an extremely useful way to make little filtered views of a larger dataset. It takes some clicking, but after uploading the original dataset and adding two numeric columns for the latitude and longitude, I just need to make a view for each volunteer to work through.

Airtable views are a great way to divide up this large dataset into digestible lists for each volunteer to work on

On the left we have a big list of views, one for each community district, and a few more to track overall progress. When a volunteer claims a district, it’s easy for them to drop into the airtable, find their view, and get to work!

One caveat: Everyone working in this dataset has “Editor” access, which could be chaotic. It’s not really a controlled environment, and in theory users could mess up other people’s records, modify fields from the source data, etc. It hasn’t been a problem at all yet, as the task at hand is pretty simple and people are only really editing two fields for each row. I don’t share the link for this airtable publicly, it is only shared with people who have committed to working on the project.

Google Docs — Instructions

We use a google doc for instructions. This effort doesn’t have a website (yet!), so a google doc serves as easy-to-edit and easy-to-share content. I have it sitting ready in my bookmarks bar, and the instant someone expresses interest in helping out, I fire off this URL.

Inside are a quickly digestible background of the project, and a step-by-step guide for communicating with the organizer (me) and geocoding. We even have a notes section with tips from other volunteers that may come in handy.

A google doc serves as “home base” for people who want to contribute to the project
A tweet with tips for looking up NYC capital projects using third-party tools

Google Sheets- Assignment Tracking

I could have just as easily used another Airtable for this (and probably will). For now, I use a sheet to keep track of who has claimed or been assigned to each community district. It’s a bit chaotic, but I use background color to track what is in progress/done, and can get a good idea of how many districts are yet to be claimed. This sheet is also easily embeddable into the instructions doc, and lets new volunteers know what is still available and how much is left to go.

A google sheet helps keep track of who has claimed which district to work on, and how much progress they have made. Keeping this up to date takes meticulous attention!

Twitter, Linkedin, Facebook, Medium — Recruiting

The original blog post about the data scraping included a call to action at the end that dreamed big about a crowdsourced effort to map the whole dataset. This one will include the same. Aside from that, we regularly post on social media asking for volunteers to take off a bite-size chunk of the dataset. (We even discussed this effort in a recent community call and put it on YouTube)

I like to frame it as a great way to get involved in a community data project without much skill or time commitment. Literally anyone can look up these projects. A bit of local knowledge of NYC helps, but isn’t required.

Ask for help with small, manageable tasks, performed in the open… the community will show up!

It’s also a lot of fun to look up projects. You can learn a lot about various entities that improve the infrastructure (I am talking about Business Improvement Districts (BIDs) and Local Development Corporations (LDCs) as well as myriad cultural institutions that get capital funding.

In-progress mapping of capital projects in Queens, NY

qri.cloud — Archival and Publishing

Managing data in airtable can be chaotic, especially with many cooks in the kitchen. Qri is a version control system for datasets, and is well-suited for creating snapshots of the data from time to time.

I’ve created a node.js script that pulls the data from the airtable API, transforms it into a CSV and saves it to disk, then executes a commit on the qri dataset chriswhong/nyc_capital_project_data_geom. If the commit is successful (meaning there was some change from the last version), it then pushes the dataset to qri.cloud, which you can see here. The dataset, along with its version history, is published and available to all.

qri.cloud profile page for the in-progress capital project geometry edits

You can download it, pull it if you have Qri Desktop, or connect to it via our API. You can also explore older versions and will soon be able to get a report of what changed from version to version. It’s easy and free to start publishing your own data on qri.cloud. If you have a CSV that’s tired of being hidden from the world on your computer, consider publishing it with Qri.

Next Steps

At the time of writing, we have just over 1,000 projects geocoded! This is tremendous progress for this low-tech distributed effort we’ve thrown together over the past week. The recipe we’ve outlined here appears to be working out, and we’re always looking for additional hands. If you’d like to help out, take a look at the instructions and reach out!

--

--

Chris Whong
qri.io

Urbanist, Mapmaker, & Data Junkie. Outreach Engineer at Qri.io