Statistical NLP on OpenStreetMap, Part 2
Data scientist Al Barrentine has been working with Mapzen to crack one of the hardest problems in geocoding and place search: international address parsing. He continues to improve libpostal, a state-of-the-art, lightning-fast C library and statistical model for parsing and normalizing addresses around the world. We use it in Mapzen Search, and are happy to see that over the past year the parser has increased to 99.45% full-parse accuracy! The full post is on Medium.
A while back I wrote a piece publicly introducing libpostal, an open-source, open-data-trained C library and companion NLP model for parsing and normalizing international street addresses. Since then, libpostal’s user base has grown to include governments, startups, large companies, researchers, and data journalists from over a dozen countries around the world.
Today I’m very excited to announce the release of libpostal 1.0, featuring a new address parser trained on over 1 billion examples in every inhabited country on Earth from great open data sets like OpenStreetMap and OpenAddresses. There are several new tags for complex sub-building structures (unit, floor, staircase, etc.), P.O. boxes, and categorical queries, each handling around 30 different languages. Best of all, the address parser now features a much more powerful machine learning model (a Conditional Random Field or CRF), which can infer a globally optimal tag sequence instead of making local decisions at each word.
The new model achieves 99.45% full-parse accuracy on held-out addresses (i.e. addresses from the training set that were purposefully removed so we could evaluate the parser on addresses it hasn’t seen before). This is more significant than just a half a percent improvement over the original parser’s 98.9% result because now it has to deal with more labels and more languages/countries.
The new parser in GIF form
You can read the full post on Medium, but here’s a quick list of the topics Al covers:
East Asian system addresses, randomizing address formats, admin boundary mappings, no umlaut left behind, reverse geocoding to buildings, point-based places and neighborhoods, a places-only data set, a streets-only data set, OpenAddresses, GeoPlanet postal codes, partments/sub-building components, units, level & floors, staircases, entrances, PO boxes, simple place name queries, consistency, abbreviated toponyms, category queries, chain stores, feature extraction, N-grams for rare/unknown words, hyphenated sub-words, right-context phrases, postcode contexts, Conditional Random Fields FTW, exact inference, greedy parsing (YOLO method), CRF parsing (Spock method), a fast, scalable CRF implementation, state-transition features, sparser perceptrons, 100GB of public training data.