Open Data: Messy and Great

This morning, Walk Score launched some exciting new features. The new Transit Score algorithm measures how well a location is served by public transportation. Nicely coupled with Transit Score are commute reports, which make it a snap to investigate daily commute options and costs.

What excites me most about these new features is that they all make heavy use of open data.

Transit Score and commute reports are built on top of GTFS, the General Transit Feed Specification. GTFS is a straightforward file format that transit agencies can use to surface their schedules. Today, over 100 agencies provide their data to the public in GTFS format. We make use of a community-maintained clearinghouse, the GTFS Data Exchange, to automatically update and acquire newly released transit feeds. Without these formats and services in place, Transit Score simply wouldn’t be possible.

The GTFS specification is tightly written, with little room for interpretation. With such careful thought behind the specification, one might expect that working with actual GTFS data is relatively straightforward. Alas, nothing could be further for the truth. While the GTFS specification is clean, the data surfaced by public transit agencies is anything but. For our Public Transit API, we’ve done the hard work of dealing with the myriad ways that good feeds can go bad.

Examples abound of how GTFS can go wrong:

Many feeds contain stops serviced by no routes, or (more baffling) routes that service no stops.
Agencies keep “test” routes in their publicly distributed feeds. St. Louis’ (older) data contains a bus that takes a safari through Gashaka-Gumti National Park in Nigeria.
Pittsburgh and other agencies abuse GTFS’s multiple mechanisms for specifying calendar dates. This leads to bizarre circumstances where, for example, a route might run ten days per week!

These examples are fairly straightforward to explain and once identified have straightforward (if delicate) fixes; we dealt with many more technically challenging issues along the way, too.

Of course, it’s not just the data that can go bad; mechanisms for acquiring data can go bad, too. The GTFS Data Exchange is a wonderful resource that has changed the nature of internet transit geekery, but it is not without its problems:

GTFSDE’s data model assumes that each agency has a single “latest” feed file. However, agencies in Philadelphia, New York, and Boston actually split their files between bus and rail.
GTFSDE’s data model also assumes that GTFS agency names are globally unique. This was recently proven false: Seattle’s Metro Transit collided with St. Paul’s, causing one city to overwrite the other’s transit data.

As consumers of the GTFSDE APIs, we’ve had to write fragile workarounds for these and other difficult issues. Luckily, the developer who built GTFSDE is awesome, aware of these issues, and near to open-sourcing the site. We’ll have some patches ready to go shortly thereafter.

Despite the wide range of problems we encountered with both GTFS and GTFSDE, open data was simply too valuable and rich to pass up. On a foundation of messy, overlapping, and broken data we successfully built a robust API that not only calculates transit scores, but also provides reliable query mechanisms against public transit stops and routes.

Walk Score isn’t solely interested in GTFS. The upcoming “Street Smart” Walk Score algorithm, currently in preview mode, is built on OpenStreetMap. OSM is an ambitious collaborative project that aims to collect the world’s geographic information into a single unified store. Similar to GTFS, OSM data is rich and valuable but fraught with problems that take time and diligence to uncover… and that take time and cleverness to work around. One of the thorniest issues with OSM data is duplicate networks. At the moment, if you point the OSM map editor at Topeka, Kansas, you’ll see a fairly sane looking street grid. Move a street out of the way, however, and you just might discover that there is another street in the exact same space. In fact, Topeka has one “full” network that is well connected, and one “ghost” network of streets that duplicates perhaps 30% of the full network. As you might imagine, this makes it quite challenging to decide which streets to route against. Street duplication also complicates the calculation of key urban planning metrics such as intersection density.

At scale, open data is not for the faint of heart or the impatient. But, with proper care and feeding, open data can very well be the foundation of exciting new technologies. GTFS and OSM represent the beginning. As governments make their data sets available, and nonprofits sunlight difficult to acquire information, I expect to see radical new innovation. What can you do with open data today?