Street address interpolation

Today we’re announcing a new address interpolation system. It was designed for the Pelias geocoder and is now available in Mapzen Search, but can also be used as a stand-alone application or be included with any other geographic software or search engine.

The term ‘interpolation’ refers to a method of creating new data points from a set of known data points. For geographic search, this is important for street address data where some house numbers may be missing from our data providers.

The OpenStreetMap and OpenAddresses projects provide a huge cache of street address information; between them over 500 million geographic address points are freely available to download and use for routing people and cars from A to B.

The issue is that these datasets are rarely complete, they contain 'holes’ in the data where house numbers or geographic coordinates are not available.

Consider a driver asking for directions to 45 Shortland Street, Auckland, NZ:

screenshot

Red and blue dots are known address points from OpenStreetMap and OpenAddresses.

The 'holes’ in the data result in our search engine returning a less granular result for queries which request the location of one of these missing addresses.

By filling in these gaps in the data, we can intelligently estimate where the missing house numbers would lie on each road:

screenshot

The system is also aware of the geometric shape of each street and uses this information, along with the point data, to ensure that the location it returns lies on the road network and not somewhere which could be a potential driving hazard.

Interpolated data points are never as precise as the rooftop accuracy data we get from our providers. However, we aim to provide a location which is not only safe to drive to but also close enough for a driver to visually identify the letter box or driveway of their destination.

Improving address discovery

Over the last few months we have been rolling out improvements to Mapzen Search on indexing and retrieving postal addresses.

Back in September we finished a project which imports all of the OpenStreetMap road network in to Pelias. This first step allowed us to return the mid-point of each road geometry in lieu of having the coordinates for an exact house number.

The next step was to produce an interpolation engine which could go one step further and return an interpolated address location based off the other data points we have for that street.

Design goals

We set out with the following strategic goals in mind:

  • Ensuring every street in OpenStreetMap is indexed and retrievable.
  • Supporting address ranges as provided by OpenStreetMap, TIGER, et al.
  • Combining and de-duplicating distinct address point data sets.
  • Designing the system to scale beyond 1 billion address points.
  • Allow room for future extension / improvements.

These changes will improve the user experience by:

  • Providing house number interpolation where address range data exists.
  • Falling back to a street centroid in lieu of a satisfactory house number.
  • Reducing noise by only showing a maximum of one result per street.

How it works

Road network

First we need a good source of open road geometry data. We can get this from OpenStreetMap.

The data is provided as 'ways’ which first need to be joined together in order to form the longest contiguous line string. This will allow us to not only interpolate over 'ranges’ of data (such as blocks) but also along the whole length of the street, even if it runs between different countries!

To accomplish this we use the Valhalla routing engine to export the routing graph into an encoded line geometry format called Polyline.

polyline The polyline for Berlin’s Grolmanstraße

Once we have these polylines for each street in the world, we can go ahead and compute a point halfway along the path to use as a mid-point value, then those streets can be imported directly in to Pelias for use in a 'fallback’ scenario when we cannot find the exact address the user is searching for.

For the purposes of interpolation, we use the same polyline dump to import those geometries in to sqlite3. We use the rtree module to create a fast spatial index of the street bounds, then we store the geometry along with the bounds and the street name(s) in a database called street.db.

bbox Putting Grolmanstraße in a bounding box.

In order to normalize the street names we run the street names through the libpostal address expander. This allows us to do things like remove diacritics, handle unicode and perform synonym substitutions.

The importer takes a few hours to run and produces a single street.db file (~5.3G) which can be shared with any other applications which need a conflated road network data set. Since the file is sqlite3 it can be accessed from any programming language or copied to a embedded device such as a mobile phone or even your car!

Addresses Points

Next we download >400 million OpenAddresses (OA) and perform street name normalization which allows us to use text matching techniques to match them with the road network.

For each data point in the OA files we attempt to find a corresponding street in the street.db using the rtree index. This results in a >80% success rate for matching one data set with the other.

Now we have the address points and the street geometry we can then 'project’ each point on to the shape of the street. To do this we use a linear algebra function to find its orthogonal projection; this is basically the point on the road closest to the front door of the house.

projection

Click for an interactive demo.

We then store each address in a database called address.db: each row contains both the original position and the projected position, along with id of the corresponding street in street.db.

But it’s not only OpenAddresses! We also import OpenStreetMap entities tagged with both addr:street and addr:housenumber.

We then finish off by importing all the block ranges from the US Census Dept 2016 TIGER files, giving us a very high level of coverage in the United States.

Following the contour of the street

The final part of the build is to perform some pre-computation on the street geometry itself; we want to build an index which is highly performant at search time.

In order to avoid loading the street geometry every time we want to perform a search, we iterate over each vertex (corner) of the street geometry and compute a 'fractional house number’.

Using the data we already have about where each house number lies on the path, we can now ask the system “if this corner of the road was a house, what house number would it have?”

table

This step is very important because it allows us to preserve the shape of the street, meaning we will never return a point which does not lie on the path.

We designed the system in such a way that a vehicle driver would never be given a destination which lies in a hazard or is not routable using turn-by-turn navigation systems.

Contributing & Feedback

As always, this project is open-source and available on Github. Please support the project by opening descriptive issues when you find them. (Note that when Mapzen Search returns an interpolated address, you’ll see interpolated for the match_type in the results.)

},
"properties": {
  [...]
  "name": "207 Spear Street",
  "housenumber": "207",
  "street": "Spear Street",
  "confidence": 0.8,
  "match_type": "interpolated",
  "accuracy": "point",
  [...]
}

We will be soon be offering database downloads of the interpolation index and we’ll continue to improve the data quality and interpolation algorithms as we work towards having over a billion address points available for search worldwide.

As always, we would love to get your feedback on Mapzen Search and our open source geocoding engine, Pelias. You can find us on here, there, and way over there