A permanent quest for missing stops on the map

by multimob — written on 2024-07-15


Getting the stops on the correct location on the map is a challenge. Not to mention names. But also some unique identifiers from operators.

The unique identifier (ref:De_Lijn for De Lijn) is how we can check for discrepancies between official data and OSM. This is also how we build new relations or check existing ones. In a nutshell, this is about getting all the stop codes for a relation in the right order.

However, those unique identifiers change with time. They may even change very frequently, unlike what is expected of a "primary key" in a relational database.

Operators may have many reasons to renumber an existing stop. For instance, they have internal databases which match a ref code with a stop name. Once in a while, they want to rename a stop… and figure out that changing its ref code is the best solution to keep old data clean, as they wouldn’t want to see the new name on old data. Similarly, STIB/MIVB in Brussels seems to hard-code distances between pairs of stops; as a consequence, if a platform is rebuilt and the stop is relocated 10 or 20 meters further away, they will change the id of the stop. Until recently, they spawned a completely different code; nowadays they use letter suffixes, so that 3310 becomes 3310A, then 3310B and so on.

The problem with this is that correct data in OSM can age fairly quickly. After every wave of network changes, we see a lot of discrepancies for the ref:De_Lijn tag.

We have a script (DLcompare) which does this: it compares OSM data with GTFS data for a given area and prints differences. It uses this unique identifier to sort them out.

It is very tempting to run it automatically, add all the missing stops, remove all the old stops and you are set. Big mistake! This would completely miss the issue of existing stops that were renumbers for whatever reason… And it would cause unnecessary version bumps in route relations if you remove some nodes from them while adding new nodes. It should be better to keep the same nodes in OSM, so that relations will not be impacted.

Here is how we proceed: we take every new stop and correct its location if it is outrageously wrong (in the middle of the road, inside buildings, sometimes in the middle of a large field several dozens metres away from the road). Then we look for neighbouring stops. In many cases, new and old stops are on the same location, only a few metres apart, so it is trivial to know the story: the stop was renamed or renumbered. Copy the tags from the new stop to the old one and it should work. If the location is almost identical, you can keep non-GTFS data such as bench, bin, covered, lit, mapillary, shelter and tactile_paving tags. On the contrary, if the stop was substantially relocated further away, it will be safe to assume that those tags no longer apply. As usual, it should always be a decision by a human, don't fall into the convenience trap of automating.

View of about 100 changes around Brussels, Antwerp and Ghent on a single day solely to update stop codes on existing data

Consequently, a common trap is to believe that we can algorithmically solve the problem, by matching the locations of old and new stops from the distance. This idea may work for a few stops, and maybe even a large number of them, but will unfortunately fail in many cases, mostly because the location of stops is too inaccurate. We already spent a lot of time figuring out which stop was on what side of the road when but nodes were located in the middle of the road, not to mention a large number of stops which GTFS locates on the wrong side of the road. More than ever, building a correct map means investing proper resources in human review.


Permalink: https://blog.multimob.be/zzaoghumo7.htm

Back to the index

Screenshots with maps are © OpenStreetMap contributors