The Next Chapter of Search

We’re shaking things up a bit here in Search land! We recently released Pelias backed by libpostal for analyzing a /v1/search input into its constituent address parts coupled with taking steps to ensuring that only the most accurate results are returned. This is the first release of many to fulfill our goal of turning Pelias (and thereby Mapzen Search) into a top tier geocoder and POI search engine.

As Mapzen Search is the service and Pelias is the actual software that powers it, any changes made to Pelias are reflected in the user experience of the Mapzen Search API.

French village

image via Cooper Hewitt, Sidewall, French Village, 1950–52; Manufactured by Katzenbach and Warren, Inc.

Libpostal

As described in this blog post by Al Barrentine, libpostal is an international address parser trained on OpenStreetMap, one of the primary sources for venue and address data in Mapzen Search. By incorporating libpostal, we’re moving away from addressit, a module that specializes in identifying the street tokens of an input while leaving the remaining administrative area and query tokens loosely identified. This has worked pretty well for Mapzen Search so far but we’ve had to jump through some hoops to incorporate the output intelligently into subsequent steps of the API pipeline. Now, as we transition to more of a true geocoder with business rules beyond term matching, it makes sense to switch to libpostal.

As libpostal is a full address parser, its usage only applies to input into the /search endpoint. libpostal is not used in the /autocomplete, /place, and /nearby endpoints.

Only Return What You Asked For

Evolving how a production geocoder operates is tricky business. There are lots of edge cases to consider but fortunately we can change incrementally. This will be the first step in a long journey towards geocoding enlightenment that we’re confident will make a better experience for our users.

As this section is entitled, our driving philosophy going forward will be to only return results that apply to what you asked for. For example, consider a search for 30 W 26th Street, New York, NY. Until the introduction of libpostal, this is what Mapzen Search returned:

matches some tokens

The results consisted of a very precise result first since it matched the most tokens along with results that matched most but not all of the tokens. Note that in the 2nd-4th results, the house number and street matched while in the 7th-10th results, the street, city, and state matched while the house number is different. This is because the geocoder gave a lot of weight to the number of tokens matched without determining whether those results made sense.

With the libpostal integration and introduction of geocoding rules, the geocoder will now return only the most accurate result:

matches all tokens

How It Works

The input is first passed to libpostal for text analysis. If the parsed input has a street then Elasticsearch is queried for the parsed input, otherwise the geocoder will use a geographic search engine approach to resolving the input. Take, for example, 30 West 26th Street, New York, NY. Libpostal parses this as:

{
  "house_number": "30",
  "road": "west 26th street",
  "city": "new york",
  "state": "ny"
}

Much like before, the geocoder will query Elasticsearch for the parsed input fields, though libpostal more accurately identifies the non-street portions of the input. The difference is now that the geocoder will only return results that match all of the parsed fields that are queried for. As illustrated above, this is a departure from subjecting the input to a geographic search engine approach.

What happens when user intent fails?

In an ideal world, the data would be perfect and users would only enter inputs that are perfectly formed, spelled correctly, and geographically make sense. Here are some scenarios in which an input can be incorrect:

  • misspelling, e.g. - Appalloosa Way instead of Appaloosa Way
  • street name changes between towns, e.g. - West Broadway turns into East Main Street
  • a street name may not exist in a city, e.g. - 1220 Calle de Lago, New York, NY

Libpostal will happily parse these inputs as containing a street because they are all technically valid, just not geographically consistent:

{
  "house_number": "1220",
  "road": "calle de lago",
  "city": "new york",
  "state": "ny"
}

1220 Calle de Lago is a valid address in Socorro, NM (among other cities) but not in New York City.

To deal with these problematic inputs, the query logic now discards the most granular parts of the parsed input to attempt to match something less granular. In this case, 1220 Calle de Lago isn’t in New York, NY, so it will discard the house number and street and attempt to just match New York, NY.

First, let’s look at an input that matches everything: 30 West 26th Street, Manhattan, New York, NY, USA, which libpostal would interpret as:

{
  "house_number": "30",
  "road": "west 26th street",
  "city_district": "manhattan",
  "city": "new york",
  "state": "ny",
  "country": "usa"
}

The geocoder would query Elasticsearch with the following combinations of fields identified by libpostal:

House Number Street Borough City State Country Layer
30 west 26th street manhattan new york ny usa address
west 26th street manhattan new york ny usa street
manhattan new york ny usa borough
new york ny usa city
ny usa region
usa country

This all happens in one Elasticsearch query to reduce the number of roundtrips between the API and Elasticsearch. Since all the parsed fields match (including house number and street), the result that matched everything is returned and everything else is thrown away.

Fallback to Street

Our engine currently only supports address points and not house number interpolation (under active development), so when a house number isn’t an address point in our data, our logic will fallback to trying to find the street instead. Much like a street that doesn’t exist in a city, 32 W 26th Street, New York, NY isn’t an address point in our data but W 26th Street, New York, NY does exist a just a street so it will return a street result instead.

Fallback to City

Another likely scenario happens when an input contains a street, city, and state but the street is either misspelled or doesn’t exist in the city, for example, Calle de Lago, New York, NY. The street Calle de Lago has been identified by libpostal as a street but just does not exist in New York City. When the street lookup fails, New York City will be the fallback.

Fallback to State

Quick question (no cheating): What US state is Hilton Head Island in? If you said North Carolina, that’s understandable, but it’s actually in South Carolina. That’s one of the most common city/state mismatches that users make. While we’ll soon handle corrections like this, but until then the fallback logic is to match the state only. That is, the input Hilton Head Island, North Carolina will fall back to North Carolina.

You can see where this is going

Listing all the possible fallbacks and failure scenarios is too exhaustive for a blog post but the approach that Pelias takes with an input is to first try the most specific combination of analyzed fields, then fallback to less granular combinations until a result is returned.

Catastrophic Failure

If the engine is unable to come up with any results based on the libpostal interpretation, it will revert to the old behavior. For example, the input 10 Main Street, United States of America is parsed as a street/country but our data only supports United States and USA (alternate names support will be added in the near future) so no results would be returned. In this case, the engine falls back to the old behavior of finding some tokens from 10 Main Street, Fair Haven, VT, USA to 10 Main Street, Swanland, England, United Kingdom.

Incremental Improvements

We’re starting our transition incrementally. It’s certainly possible to release an epic update all at once but we’d risk alienating users whose applications may not be ready for such a monumental change and, as an open source project, we value the input of our users.

For our first iteration, we’ll only be modifying behavior for inputs that libpostal says contain a street. That is, the new rules will apply for inputs like:

  • 30 West 26th Street, New York, NY
  • West 26th Street, New York, NY
  • 30 West 26th Street

but not:

  • New York
  • pizza in Lancaster, PA

Inputs like the latter set are parsed correctly but involve introducing advanced business rules for ranking. By concentrating our efforts on the former, we see what works and what doesn’t for our users. When an input does not contain a street then the geocoder will resort to the original behavior.

Accuracy flags

With the introduction of the new fallback behavior, it became important to signal the user when it occurred. To accomplish this, We’ve added two additional properties to results of all queries processed using libpostal: accuracy and match_type. These fields, in combination with the confidence score, give a sense of how relevant and accurate the results are.

The accuracy property indicates what type of geometry the result contains. When accuracy is set to point, the result is either a POI or address that was represented with a single point in the source data. When accuracy is set to centroid, however, the point representing the location of the record is a centroid of what was once a polygon in the source data. The centroid accuracy type is common across streets and all administrative areas, such as cities, counties, regions, etc.

The match_type property has been added to indicate how precisely we’ve managed to match the user’s requested address. In the case where everything lines up just perfectly, the match_type property will be set to exact. However, if for some reason the housenumber requested isn’t found the result will be a fallback to the centroid of the matching street, and the match_type will be set to fallback.

Example time! Let’s first see what happens when everything goes smoothly.

30 West 26th St, New York, NY will result in

[...]
  "confidence": 1,
  "match_type": "exact",
  "accuracy": "point",
[...]

Now if we look for a housenumber that doesn’t exist on that street, like 1 West 26th St, New York, NY, you will see that the result is the centroid of West 26th Street and has the following properties

[...]
  "confidence": 0.8,
  "match_type": "fallback",
  "accuracy": "centroid",
[...]

Well, what if we screwed up the street name also and it doesn’t exist in New York, maybe like 30 West 2006th St, New York NY? Search will have no choice but to fallback to city level and return the centroid of New York City with the following properties. Note that the confidence score is significantly lower in this case because the result is less granular than a street centroid.

[...]
  "confidence": 0.6,
  "match_type": "fallback",
  "accuracy": "centroid",
[...]

Keep in mind that the new match_type property will only be returned in cases where libpostal and the new query logic is employed. In the case where then engine still relies on textual matching the property will be omitted.

Get in touch!

Let us know what you think and what you see! Libpostal is constantly improving, but if you see results you don’t expect, please file an issue on Github or chat with us on Gitter.