Meaningful Geocoding - Geocoding Places

In the first post of our series, Meaningful Geocoding, we covered geocoding for addresses. In this post we’ll discuss scenarios that cover users who aren’t looking for something exact. That is, how do we handle users that are just looking for a neighborhood, city, region, country, or a postal code.

Searching for neighbourhoods in London

Searching for neighbourhoods in London.

To clear up some terminology, we’ll use the term “coarse geocoding” (for lack of something more descriptive) to refer to geocoding that doesn’t involve streets. That is, inputs like:

  • Vancouver, BC
  • Juba, South Sudan
  • Auteuil Nord (a neighborhood in Paris)
  • 90210 (the postal code in Beverly Hills, not the TV show)

For clarity and when available, place names in this article will link to Who’s on First, the gazetteer being developed by Mapzen.

Coarse geocoding isn’t quite as fraught with peril as address geocoding but it does present some interesting challenges. Let’s get started!

Unambiguous Input

If you’re lucky, the input supplied by the user can be geocoded to exactly one place.

Example: Truth or Consequences, NM

There is unambigously a single place in New Mexico state named Truth or Consequences and it’s a city.

Example: Bangkok, Thailand

There is a single city named Bangkok in Thailand

Example: Maui, HI

Maui is a county, not a city, and no other place named ‘Maui’ in Hawaii exists.

Example: Prince Edward Island

Most, but not all region names are unique. There is a single place named Prince Edward Island and it’s a province in Canada. A geocoder should return:

In all 4 cases above, a geocoder should return the 1 unambiguous result.

Alphanumeric Canadian postal codes are unique in their format, so they should be treated homogeneously.

Example: K1A 0B1

A geocoder should return:

Ambiguous Inputs

More often than not, user inputs will be ambiguous, so business rules dictating result ordering should be applied. For instance, while not common, there are instances where the a city name appears multiple times within the same state.

Example: Tuckahoe, NY

A geocoder should return:

Ordering of the two should be driven by an application’s business rules. For example, the latter Tuckahoe is just to the north of New York City, so if the user has their map centered on New York City, it’s entirely appropriate to return that one first. Alternately, business rules may want to return results in descending population order if the data is available.

Chance ordering should be avoided if possible since the user may not realize that an ambiguity exists.

Example: Las Vegas

Two US states can claim a place named Las Vegas, though most users are referring to the one in Nevada, not New Mexico, as the latter is much smaller. A geocoder should return:

Without state or business rules, users could be confused if a Las Vegas that they never heard of was returned before the Las Vegas that they were most likely expecting.

Some inputs could be interpreted ambiguously because it could be a city or county.

Example: Lancaster, PA

Lancaster is both a city and county in the state of Pennsylvania. A geocoder should return:

While counties are important governmentally, users are normally looking for the city when the name could be interpreted either way. In this case, it’s safe to return the city first and the county second, but your business rules may dictate otherwise.

Some inputs are wildly popular place names and may appear anywhere in the administrative hierarchy.

Example: Luxembourg

A geocoder should return:

Countries with less-than-imaginative postal code formats can result in ambiguities, too. For instance, the US 5-digit postal code format is used by a number of countries. Postal codes may not be included in the gazetteer serving as the basis for your source data and may need to be augmented from additional sources.

Example: 90210

This specific postal code appears in a number of countries, such as the United States, Mexico, and Thailand. A geocoder should return results based on what data is available.

It’s even possible for what appears to be a region and country to be interpreted as a city and state.

Example: Ontario, CA

In this case, both Ontario, California and Ontario, Canada are legitimate interpretations of the users input.

A geocoder should return:

Anomalous Inputs

In a perfect world, users would only enter inputs that make sense. We don’t live in a perfect world so geocoders have to deal with inputs that make little (if any) sense.

Mismatched cities, states, and countries

One common type of anomaly is a mismatched city and state combination.

Example: Hilton Head, North Carolina

The city of Hilton Head is actually in South Carolina but users seem to confuse the states quite a bit.

A geocoder should return:

In the same vein, cities in regions that border other countries can be confused by users.

Example: Strasbourg, Germany

Strasbourg is in the Alsace-Champagne-Ardenne-Lorraine region of France but within mere kilometers of Germany. A glance at a map in that general area of France shows many town and city names that could easily be mistaken for German by users unfamiliar with the geographical nuances.

A geocoder should return:

Changing country borders can lead to users entering city and country combinations that aren’t correct.

Example: Juba, Sudan

South Sudan officially split from Sudan in 2011, naming Juba it’s capital.

A geocoder should return:

  • Juba, SS

When inputs aren’t concordant, then they’re not easy to geocode, so it’s time to apply editorial and/or algorithmic business rules. Editorial mechanisms can be a quick fix but require constant maintenance and log file analysis. If your gazetteer has adjacency information or bounding polygons, it’s possible to correct mistakes such as entering a city with a state right next to it, as in the 'Hilton Head, NC’ example.

Inputs that just don’t make sense

Not all anomalous inputs can be solved editorially or algorithmically.

Example: Las Vegas, Maine

There’s no neighborhood, city, or county named 'Las Vegas’ in the state of Maine

Regardless of how incongruous an input is, it’s what the user entered it so a geocoder should try to make some basic sense of it. The input may not even be a mismatch but something spelled so horrendously incorrect that no algorithmic spell checker in the world can fix it. In this case, there are two general approaches a geocoder can take:

  • Return ’Maine’ (the state). In this case, the user is clearly looking for a place in Maine, our geocoder just isn’t sure what, so it’s safe to return 'Maine’.
  • Return ’Las Vegas, NV’. 'Las Vegas’ is a pretty big city, so it’s possible that the user was just confused on which state Las Vegas is in.

A combination of the two approaches is achievable with available population or popularity data to classify the city portion as big enough to pass a scoring threshold. In either approach, efforts should be made to inform the calling application that the granularities of information entered versus returned were different so that actions can be taken to correct the error.

Diacritical Marks

Efforts should be made to accommodate users who may not be aware of place names containing diacritical marks.

Example: Munster, Germany

A geocoder should return:

While certainly important to pronunciation, it’s unreasonable to expect users unfamiliar with a language to enter diacritical marks correctly.

Misspellings

Some place names, like “New York” and “London” are arguably easy to spell (at least if you’re a native speaker), but others aren’t. I consider myself an excellent speller in English but struggle to remember how many P’s and S’s are in Parsippany or which Pittsburgh’s end in H’s. Woe be to the non-native English speaker trying to remember the rules (and exceptions to those rules). Autocomplete engines can help a bit for patient users, but a spelling-correction library should be included in any geocoder. There are a number of open-source libraries for this task from the simple n-gram approach up to more complex approaches like Hunspell.

What’s Next

Parts 1 and 2 of this series covered how to geocode streets and administrative regions such as city, states, countries, and postal codes. It’s not enough to just return results, that which calls the geocoder must also be given information on how the results should be interpreted. In part 3, we’ll discuss the different types of information that applications would be interested in.

If cracking the code of how we organize the world sounds interesting to you, we’d love to work with you. Our project is 100% open, so it would be great to have you as an open source contributor, not to mention we’re hiring for another person to join our Search team to work on geocoding full time.