The problem of having unstable route id in GTFS

by multimob — written on 2024-04-11


We really need to be able to perform tests to match GTFS and OSM data as much as possible. This time for bus route relations.

When we can achieve that, it will be much faster to download existing routes and check whether the tags and order of stops match existing GTFS data.

We wrote DLrelimport with that purpose: you feed it with a master route relation and it will inspect the local database. Remember, De Lijn provides about 1,280 routes in the GTFS with no less than 635 route numbers. There is no 1:1 match just from the route number: there are 18 routes named "Bus 1" and 17 routes names "Bus 2". So we must find additional match, like the subnetwork, the destination or the colour. It if has a match, we can assume this is the same route and perform the tests. In most cases, tags are insufficient and the database finds multiple matches for the same relation.

This is where digging deeper into how GTFS files are published is interesting.

If anything changes with the structure of a route, GTFS data will be duplicated. The feed runs over a period of, say, 2 weeks. But the first week is the old version of the route, and the second week will have the new version. Both will be present. Same route number, same network, same colour tags, same destinations, only a tiny difference, maybe just one extra stop in the new version, or the same stop but with a different stop id.

List of bus relations taken from a GTFS feed: several routes show as two almost identical copies

This is one of the cases where we want to perform a manual validation. Human review is core here, there is nothing worse than letting an automated script run wildly and massively erasing "bad data". How to do this: run the tests for every route relation and try to find the closest match, then submit to human validation. When this is done, we write the "pair" of items—route id in the GTFS and id of the OSM master relation—so that we can start running tests.

And here comes one more problem: route ids in GTFS change all the time. Normally, this should never happen if nothing has changed. But this is really how it goes. The GTFS feed from Monday can be very different from the one published on Tuesday, and so forth. First, this completely defeats the idea of adding gtfs:route_id to OSM objects, it will be impossible to maintain or verify. But it also keeps breaking the pairing obtained with the DLrelimport validation. The current rate of data refresh on our server is twice a week, but we plan to gradually increase this. Already, this is much faster than the time it would take for someone to perform all the validation tasks. We will need to find a better system, perhaps concentrate only on the main routes, or review one region at a time.

August 2024 update: a new version of DLrelimport is being tested locally, to address this problem, i.e. an ability to match several routes and an automated check which could be run every day. Getting this module to work correctly would be a major step towards efficient monitoring of route relations.


Permalink: https://blog.multimob.be/zzmphnsr0y.htm

Back to the index

Screenshots with maps are © OpenStreetMap contributors