Saturday, June 6, 2015
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Government & OpenStreetMap: Landscapes, Perspectives and the Horizon
(Panel)
Bibiana McHugh ñ IT and GIS manager at TriMet, Portland, OR
Colin Reilly - Director of GIS of New York City
Barbara Poore - former research geographer at USGS
Holly Krambeck - Senior Transportation Economist, World Bank
Saturday, June 6, 2015
ALYSSA: Are we good to go? Hi, we've got government here. OpenStreetMap? Bureaucracy's kind of down that! So I'm going to do some talking and some walking -- or working. So, yeah, we decided to put this panel together, Bibiana she's coming from Portland. So I'll just pass it over to her, and we'll start it off.
BIBIANA: In working on this session, I thought I would just share a five-minute liking show, giving you some kind of a background of who we are and what we're doing and then we're going to go into some questions and we'll get some feedback and discussions from all of you. I think this is a really important subject to be talking about and linking on. But a little bit. I'm Bibiana McHugh. And let's see, I -- technical difficulties.
ALYSSA: Luckily I'm here.
BIBIANA: I'm Bibiana McHugh. I started my career in GIS in transportation about 30 years ago. For the past 20 years, I've been working with GIS and I've been an IT manager at TriMet for the past 20 years, for the past ten years, I've been a strong advocate and a doctor of GIS, of open source and open-source software in government. Let's see... back in 2009, we really started taking a look at OpenStreetMap and it was really due to the development and the limitation of an open trip planner. Since then, we've really replaced all of the jurisdictional and proprietary basemaps in our agency in OpenStreetMap including the dispatch and SCT systems at a significant cost savings. And this decision was already based on the requirements of the pilot study. We needed a seamless coverage, one that supported multimodal routing and one that was cost effective and it wasn't -- and OS it wasn't the obvious solution; it was the only solution for us, and, you know, the one that made most sense.
So traditional satellite files, they're really, for government, they've been the base map but they're really designed for geocoding and they -- the format and the data just doesn't support multimodal routing with regards to connectivity, multiuse paths, multiattributes at intersections like turn you restrictions.
So basically, in 2010-2011, we hired four college students and used jurisdictional data and aerial photography to prove access to the geometry. That was really important for the OpenStreetMap community. And all of our work is documented in in an online document, the open trip planner final report. And so this is my student, and he is a GIS technician and his work is dedicated to OpenStreetMap data maintenance and continued work in the Sutton county area. This is actually subsidized by external agencies and he uses SpiderOSM, it's a free open-source software package and it basically compares any satellite file with OpenStreetMap and teases out differences in geometry. And so in the Portland region, we really see OpenStreetMap as the larger, more comprehensive subset for the data file. And other data are subsets of OpenStreetMap that live in perfect harmony and are parallel with each other.
I hear a lot of currents about adopting OpenStreetMap in government agencies. We've found that it really is much more cost-effective to improve the data. Hiring the students and spending that first year improving it was less expensive for us than licensing proprietary data for just a one-year period. This is something that we can put public assistance, and something that the public can benefit from. Many eyes were on the few. And we found that the open-source community they're very protective of their data. Very few things fall through the cracks. And others, with regards to compromise, learning new things, licensing issues. Does anyone know what this is? It's an adapter. That's one of my concerns. So keys to success for government OpenStreetMap. Government, you've heard this before in Mikel's presentation, but government is really -- the role for government is a contributing member of the OSM community; not a controlling member. We have to start aligning our data, licensing issues, again, more information and education.
This is something else that we've talked about, what we need is a Red Cross donation data center, a place where government agencies can literally sign a license agreement to allow data to be used for OpenStreetMap contributions, many of these agencies just can't hire legal counsel for us, that's what really, that's what it comes down to, it's very complicated and a place where the OpenStreetMap community can go to filter the datasets that they want to use that are authorized for contributions.
And I've always said that 99.9% of the world's problem can be solved with data and the better the data, the better the solutions that we have for everyone. Thank you.
[ Applause ]
COLLIN: Okay. While Allyssa gets the slides up. I'm Collin Reilly. I guess I'm representing the local realm of the government ladder, or some of you might refer to it differently, or hyperlocal as a manage GIS here for the City of New York. And I've been around enough... let's go over here. Yeah, that's fine. And I've been around enough in government to have seen the data pendulum swing from a culture of no sharing data largely, to more of a culture of open data sharing although many, probably here in the City of New York, would say there's room for improvement. I would disagree with that. But I think we're heading certainly in the right direction but probably, some, or what many don't realize that it can be as liberating for those in government releasing data, as it can be for those who are directly benefiting from that data release. And that's certainly in line with my personal opinions.
So to me, you know, OSM is sort of open data on steroids, right? And I've always been attracted to it. And I've been glad over the last couple years that I could actually participate in it. And just, really quickly by show of hands, who out there does work for a government entity? Okay. So I would say at least 50%. So I'm not just preaching to the choir here; I'm actually talking to people who have direct input into those sort of things. So a bit about me. I do manage GIS for the City of New York. Managing the city's data repository, and largely sharing most of that. And I tried to get out there, and and be engaged with the community and not hide behind the government firewall and we're hopefully accessible with the software and data that we use. And so so cover some of the OSM projects that we've worked on. My role has largely been the, as an instigator and a contributor to these projects. The one that I did last year at the OpenStreetMap conference in D.C. was import maps of the City of New York. And I worked with Alex and his team at Mapbox to do that. And more recently, the bike lanes with the local New York City, OSM community. Eric Lawson will be giving a lightning talk about that. Eric, give a wave out there. And that would be using MapRoulette to ingest data, to what it is. With the potential of improving OSM data. Also covering where we're going in the future. So we hope to use OSM for routing and turn-by-turn directions. I was hoping to release -- to announce a couple of data releases but not having gotten all the approvals for those yet, but I'm hoping to make some big announcements in the foreseeable future. I'm trying to use Pilot GeoGig for people if they want to input that data into OSM, or whatever else, they'll be able to track those divs and do it more easily, and use MapRoulette to do additional challenges. And so, back to audience participation, does anybody know The Unicorn Song growing up?
ALYSSA: I have a follow-up question.
COLLIN: So it's cats, and maps, and elephants. And so it's cats, and maps, and -- "Meine Welt" is "my world." But it should be titled "unsere Welt." But to me, data makes the world a more helping and wonderful place. So I'll hand it off to the next presenter.
BARBARA: I'm Barbara Poore and I've worked at the U.S. Geological Survey for many years. And I've been working on data-sharing with government since 1963. But I've just now retired, so now, I can really speak truth to power about governments.
And I'm really happy to follow some excellent presentations this morning, particularly by Tom and John because they conflate up the framework that we have to operate in the government. And the U.S. Geological Survey made telegraphical maps on piper for over a hundred years. 50+ maps all over the country, and quite a time turn to get them ail, you know, electronic. One of the issues has been this fragmentation of different agencies having different responsibilities for different things. Nobody had the responsibility -- or the responsibility was very fragmented for what they call "structures." And what we mean by "structures" are buildings -- anything manmade. Buildings, bridges, et cetera. However, after 200, the department of Homeland Security has declared certain structures and the U.S. Geological Survey has been trying to get those into the maps ever since. And as we know the funding for every organization except for the defense department is going that way (gesturing down). So in about 2010, a bunch of us thought, "Well, we could use crowdsourcing to do this." And we had a little meeting. Steve Coast was there. Andrew Turner. A bunch of people in the Oakland data community and we proposed that they tried to crowdsource this structure of data. And I can give you more information about that if you want to talk to me, there's some publications that we've done. So we started this experimental project mapping structures around Denver. And as soon as it became successful and we proved that it could be done and that it could be done really accurately using volunteers, they took it away from us. So I don't run this program now. But it's called The National Map Corps. It's online. You can Google it. Basically it's been a huge success but they've had to accommodate themselves in certain ways to OpenStreetMap. We could not directly use the data because of the licensing issues. We had to put out our data in the public domain. So we can't do any kind of share like, you know, if we take OpenStreetMap data, we can't then put it into that license. In the research project that we did, we found that we had about 25 different types of structures that we were collecting and we found that it was just too complicated for people. And so they simplified it to those ten structures that mostly had to do with public safety.
They used geographic names and information system, can and if you guys are familiar with -- here's the editor, by the way. I'll go back into that in a minute.
If you're familiar with the Geographic Names Information System, it's a big database of names -- official names for structures in the country and we've brought those points in. Part of that goes into the Geonames database as well. We have tiered editing system. And so we have volunteers. And so the whole question of what's authoritative, instead of relying on the crowd to do that, they're actually making it much more structured. In other words, not everybody can edit and check other people's work. You have to reach a certain level.
It's been very successful. They're feeding it back into the GNIS system. So I think, via Geolinks, all that stuff will eventually come into OpenStreetMap. They have a lot of volunteers that use some of the techniques we've been familiar with. Virtual recognition badges for doing certain number of points. However, it is not a community in that it does not self-identify, talk among themselves, et cetera, et cetera.
We've used Map Challenges, social media, et cetera. 110,000 points have been mapped in the last two years. And I think, there are about 300,000 mappers right now. In the future -- one thing I forgot, we couldn't use our software, so we based our editor on Pot Latch, too. However, I've learned they're moving away from the OpenStreetMap software and they're improving their own editor now, which is disturbing. They're also developing a mobile app. In the future, there's also no known plans to contribute directly back to OpenStreetMap and I think people who are much higher level than me need to be persuaded. I just want to shout out to Sophia Liu. She's worked on me with OpenStreetMap projects. And also some of you might know Eric Wolf. He helped me with the first iteration. Thanks.
[ Applause ]
HOLLY: Hello, my name is Holly Krambeck. And I'm here to talk about how U.S. Government uses OSM to support transport practices. Okay, the World Bank is part of the United Nations system who support global economic development that's sustainable, equitable, and just. The transport team -- capacity building and training program for government transport agencies. That's the intro. Now, suppose, you had a city of 12 billion people where 70% of all trips are made on transit. Let's call it Manila and you want to get somewhere on a bus, or plan stop, what is something you might need? A map! Right, you might be surprised to know that 32% of the hundred largest cities in the world do not have maps of their transit maps and if you just take the middle, 92% of cities do not have maps of their transit systems. As an organizational system of transport, this is a problem. And it's not always maps. They don't usually have the technical or the financial capacity to collect most of the data we need to support transport infrastructure, planning, and investment. GIS road maps, traffic speed information. Accidents, so, why does the World Bank transport seem support OSM? Two reasons, the first, OSM provides a very low-barrier gateway to a city, or even a country's first road GIS map. For example, in the Philippines we were approached by the Department of Agriculture to produce GIS shape files for the world road mapper. Rather than complete these shape files which would sit on someone's desktop in Manila. We treated it as an enterprise solution, a repository that would be accessible by all local, and federal level agencies to access not just the rural roads but the whole road network. So we've been working with Demsey to work on some of the existing network tools for road asset management. Second reason is OSM provides an opportunity to help us scale open-source software that would help us support traffic management and planning agencies do their jobs better. For example, in Manila, we worked with Conveyal, and they developed an Android app that makes it much easier to map public transit systems. And also working with them we developed an open-source tool that traffic -- that transit agencies can use to maintain these data. And this effort was extremely successful. In addition to producing that map in Manila, these software tools have been used in at least eight different countries to our knowledge and this is largely made possible because they're based on the OSM. Now something we learned is that if a transit network hasn't been mapped for many decades it's going to look something like this. Which is a mess. And our counterparts don't always know what to do with that. So we've also supported tools for traffic agencies to be able to make sense of these data. For example we've been working with Azavea for open transit cauters. This pulls in transit data, GIS census data and the OSM to generate comparable benchmarks across cities. What this means is, if you take a place like Nairobi where the indicators tell us that 30% of the rural population don't live within walking distance of a transit stop, we can simply create a scenario -- made, what if we add a bus route in this low-income area, by how much can we improve that metric. Similarly my colleagues in Latin America have been working with Conveyal with analysts to develop alternatives with their transit system authorized to create safe and accessible jobs for lower income populations in that city. We also work with axiom data. And most of Southeast Asia, most accidents are recorded in logbooks. There are thousands of logbooks on accidents in Southeast Asia and so we worked on an open-source platform that these agencies can use to record their accidents on an OpenStreetMap. Now, this application is largely just points on a map but what was surprising to us was how rapidly it's scaled. This program is now used by the chapter police of Manila and now as the Philippines national police as their official prime map and because of this support in graphics scale we were able to get additional funding to develop with Azavea, a more robust form for accident reporting and development tool, and other countries hope to donate this as well and the scale, in part is possible because of the OSM. Finally, we move our thing on a platform for traffic management agencies that takes in GTS data generate by taxis and helps them travel time surveys and look at real time congestion and and we're rolling this out now in six cities, a competitor to Uber in Southeast Asia. So to make these applications work, we just need to afford the completion of OpenStreetMap. So we work with ICP, and ITP and HOT to make these maps. All of these can be found on the World Bank GitHub site. And inclusion, that's how our team uses the data to support our team. Thank you.
ALYSSA: Great. Thank you, everyone. I'm avoiding my presentation. I'm just moderating. We have some -- we do have some prepared questions that we ask everybody to respond to, but I thought we could start by opening it up to the audience to see if there are any specific questions or experiences that you can respond, or talk about. Anyone? I have -- somebody has raised their hand. Great. I think you press a button.
AUDIENCE MEMBER: I didn't know which button it is. I'm just going to shout.
ALYSSA: Okay.
AUDIENCE MEMBER: With the project that you mentioned for the World Bank where you facilitated action through Conveyal and transit web how did you get people on the ground, did you recruit students from the university or how did you go about that?
BARBARA: So.
BIBIANA: So it's actually built by the transport agency staff. First we worked with the students from the University of Philippines in data logics and the challenge of that project is it's remarkable difficult to build a transit map when you don't have a map if you think about that for a second. So the students basically hailed any transit vehicle they saw and rode all of them over a period of eight months and were able to map 800 routes. When we found was that this method was completely unsustainable for keeping that map up to date and that's why we developed and supported transit one. What it does is allows the transit agency to look at that route and it allows that to be uploaded to our server in a GTFS format and it is editable in this GTFS user-friendly editable interface. Good question.
ALYSSA: I think we had another question on the floor.
AUDIENCE MEMBER: Which button was that? Does that work? Yeah. Green. Hi. Thank you very much for all your presentations. My name is Heather McCodd. And I also lived incal cut at a and I was in a meeting where people were trying to figure out how to start, how to start using OSM transportation. So I saw during the Nepal response that Chad created guide for how to map IP counts. So I was wondering based on your experiences have you started to think about guides for what you learned. What you just imparted is really important for my work, professionally, but I think for a lot of people, has the World Bank, written for example, a guide how did I implement this with your work? Have you started to document those processes so that you can replicate it?
ALYSSA: I would say, I think that's a really good question to kind of work through at all different levels of government so I'm going to hand it over to Collin first and just go down the line.
COLLIN: Do you just want me to take the mic? Yeah, that's a very good point I mean, you're fairly familiar with imports which is kind of why using MapRoulette where you're sort of attempting import but you're looking at each geometry as it's coming through as a set of eyes is sort of the intermediate approach, right? So so that was a lesson learned. But in terms of documentation, I've written a blog post about it, certainly a more in-depth write-up is probably warranted. That's a very good point.
BIBIANA: I spent an entire summer writing the trip planner report. It's about eight pages and that's exactly why we wrote it, to help other government agencies, universities, everything to get started. There's an appendix that has everything, all of our procedures, so again, you guys can do just a search on, "Open trip planner."
ALYSSA: Yeah, I'm going to say something. I think everybody in this room, especially the people on this panel, we're early adopters. Trailblazers, we can go on and on. And I think in order to set that example and inspire others, I think, obviously, a challenge in trying to be part of like the journey, but recording that process and sharing it with others, I think is really part of the work that we do as well. So I commend everybody on the team and everybody here as well. And it says three minutes and I'm going to try to get your eye, since we started ten minutes later I'm just going to go over. Executive decision. Awesome. Those other guys, they talked a lot. A lot. So don't go on our time.
>> I guess I'll just say, in terms of documentation, you know we wanted to build some successes first and make sure these attempts work because I think a theme through the sessions in this room anyway, that these initiatives which may seem to this audience, are not straightforward to the bureaucracy as things were. So we need to actually prove that things work. But actually I guess we're actually at the stage where we should be doing a better job documenting and sharing knowledge. So I think that's good.
BARBARA: We did a report on the experimental phase of our project. We didn't go through every single procedure but I think you can get an idea from that. I don't know what to tell you to search on. Talk to me if you're interested. What occurred to me, though, was there's now this federal community of practice for citizens' science, which has representation from all the agencies. And that might be a really good vehicle to start to compile some of these resources. And, um, I know Sophie is involved with them. So if you want to talk about that with her, that would be awesome.
ALYSSA: Sorry. I have one other statement. You know, Kate was saying, government likes to talk, talk, talk, sometimes they like to get work done. I think in my experience with government, they're very good with reports as well. So I would encourage anybody here, you know, that's part of government when you're not talking or working, so actually, you know, kind of write and provide resources. We have a question in the back here. Their button is the being pressed.
AUDIENCE MEMBER: Hi, I'm Patrick. I just had a quick comment about lessons learned and tracing guides. I think one of the things that we're trying to do with the GitHub repo is modularize guides. So you can really have lessons learned small chunks at a time so you don't have to read an eight-page report. And I think that's really a direction we need to go, and hopefully, you know, at the end of projects, you know, the outcome isn't an 80 page report, but, you know, a guide that people could consume very easily and not -- they don't need to sit down for, like, two hours to read it.
BIBIANA: It has lots of diagrams and it's double spaced.
ALYSSA: I guess we have a question in the back here.
AUDIENCE MEMBER: With the one of the things that would be great, I think that storytelling and, like, sharing experience is very, very important because anecdotal stories like, what you guys all do are pretty amazing and inspiring for people who are working in government, and similarly with musicians who are working with similar programs work with the community and forms around here to get your story out. There are, you know, like, we have an OpenStreetMap blog, we talk to people, basically just amplify the stories, to really make sure that, you know, people get to hear about it.
ALYSSA: Yeah, I mean, I think having talked to everybody for -- I feel like the work like, the relationships with the community and how each of these initiatives initiatives represented here kind of work with the OpenStreetMap community influences like the successes and challenges of the work... I wonder if it would be useful to kind of go again by the panel and kind of talk about kind of sort of the relationship with the community -- and "community" being us.
COLLIN: Sure. Yeah, I mean, that's a very important point. I mean, I try to get out there, as I said to sort of break down the government firewall. There's pleasant of geo related events. And not just here in New York City that Allyssa has her hand in them. You know, GeoNYC, even BateNYC and so I try to attend these, and present at these and try to hear what these communities are hearing and interested in. Just to -- you know, because the public is what our clients think of our government serving. It's not just other city agencies and the public at large. So if you're not out there how do you know what's being said or what's being asked for?
BIBIANA: I'm going to give you two really quick examples. Our work with the community. The first one is, you really -- we did a lot of research up front and we were very involved online and we also had houses and met a lot of the local users. But early on, we made a very generic mistake. We started the usernames with "TriMet" and it's really about attributes need to be tagged by people; not organizations. So that was kind of one example. Another one is compromise with regards to data attributions and really knowing what needs are and being respective of that. For example in our community, people tagged multi-- rather than go down that road, we merely just changed our code. So it really is a relationship and you act as one.
>> I talked with how we engaged the OSM community. If it's an existing community we'll hire them. Um, and if it's a new community where we work, they'll support training so that when we leave there'll be a critical mass of people who can help to keep the mass up to date and change government, students, and government staff.
BARBARA: The person that really worked most closely with the OpenStreetMap community when we started was Eric Wolf who did who knew bunch of the people in this room, probably. But I don't think that there's an ongoing relationship anymore, which is kind of sad.
ALYSSA: And as a follow-up, I mean, I feel potentially the projects presented, you kind of encountered some of the highest challenges, do you think that perhaps the investment in the community relationship is willing to also, like, in the effectiveness of the program within the agency?
BARBARA: There's a huge investment in current structure that exists within the US GIS of having this kind of hierarchy of people. And they don't fully realize that OpenStreetMap is a different kind of community that might make a different model. I think it goes back to what John was saying. You know, these institutions have -- he called it "code" but it restricts how they think about things. It was much easier for Sophia and I to actually talk to scientists about using crowdsourcing. And they were buying into it immediately. But mapping division did not. But there's hope. I mean, there are people, you know, young people, they're coming up through the ranks. So there is hope.
ALYSSA: I mean, are there any other questions, or don't you want to see more of my slides with cats on them?
COLLIN: I think Philipp's got his hand up.
ALYSSA: Are there any more questions? Phil?
AUDIENCE MEMBER: Hi, so I'm Phil, I talked with David Deka (phonetic) my general understanding is that two of the biggest impediments of collaborations between government and the OpenStreetMap community is licensing compatibility, which you talked about earlier and just the idea of any kind of bulk contributions being kind of antithetical to the process of individual contributions and some of this might be more specific to the current government but I guess, if you see any initiatives address those head-on. Or if there's any consensus if there's agreement with problems, even just admitting what they are, just making sure we're making it clear that those are problems.
BIBIANA: I would really like to see you come up.
ALYSSA: All in this room.
BIBIANA: And I really think that governments. It's too costly to hire legal counsel. In our area we spent so much time trying to figure it out. Finally we slapped the same OSM on our data to make sure that there weren't any issues, and I think really there could be arhats where government agencies can really sign terms of use of something really simple that lines everything out and, you know, we can go forward with that. But until all data, all government data is public domain, you know, it's going to be an issue.
ALYSSA: And I would also say, I think also businesses have a responsibility, and potentially, the need to help with that. Legal.
ALEX: Public service announcement. We don't want to cut this conversation short.
ALYSSA: It's green. So I think it needs to be red.
ALEX: Yeah, quick public service monument. I don't want to cut the discussion short. We're going into lunchtime, which is fine. But when you go to lunch, the way how you get there is when you leave this conference room, you'll turn right and you pass -- you walk past the exhibitor space. You keep going straight and you walk out the building and you'll walk into the next building. There'll be volunteers posted to guide you along the way.
ALYSSA: And we are together? Like, are you going to stay for? We'll close up in two minutes. So you can also lead us, right? 'Cause I don't know how to...
ALEX: I will have to skip right now.
ALYSSA: I'm going to follow Alex. So yeah, we're going to close up just just a little bit, are there any other questions or I can show you another cat. Left.
AUDIENCE MEMBER: A quick question about government engagement. So, government engagement to make most of the applications that you're talking about. So I come from Nigeria. It's very difficult to get the government to use what you picked. So what happens is, you put a lot of efforts to do these things and then, the government doesn't use them. So, in what ways are you guys going to ensure that these things that you -- especially for the in wartime. Especially in developing countries, what ways will you try to use to first of all get government especially in mind in using these in application, and for continuous use by the government to make decisions?
HOLLY: Yes, I guess a few things if first, when it comes to developing open-source software, that never comes out of most. We don't want our counterparts to pay for things that are risky, and may not work and we intend to use in lots of places. And so all the work is grant funded. Second, we only work with our existing counterparts where we already have projects and where those tools will fulfill a need in our project development. And so for example, in the Philippines, we have bus rapid transit projects in Manila, and Subul (phonetic) in Manila because the government didn't know where the existing transit network was, it was very difficult to build a VRT because we didn't know how it connected to the existing system. So in order to solve that problem we could have hired an international consultant to make a map and say, "Here you go." But what we wanted to do was to do something a little more sustainable so that actually after that VRT's built, the local government could continue to update that map over time. So that's the approach. One grant funding. And two, we leverage existing projects and use the inputs from the counterparts and the design.
AUDIENCE MEMBER: And from your experience, when you deal with the -- when you hand it over to the government, do you, especially with the tools in the Philippines, for example. After handing it over to the government, so how long ago, did you hand it over to the government and what is the current state after handing it over to the government? Is it improving? Are they actually doing what they should be doing to keep it updated? Or are they just dropped it?
HOLLY: So the initial that I've presented, so far we've only finished one of them; the rest are still in progress, so I can't speak to them. But the one we finished was the transit mapping in Manila, so with that one, yes. So if you go to the Department of Transportation website, you can download an updated transit map. We've been tracking and there are updates that have been made to the maps since we finished the project like the franchise bureau. But more importantly, they use that map to create a group rationalization plan. They spent many years creating that plan. And they're reducing the number of I didn't havey routes, mini bus routes over a 20 year period so as the group franchise expired, they just let them go, and introduced routes. So it's a good example of how they made the map. They're able to update the map. They made it publicly available and they were able to use it for a good development outcome.
AUDIENCE MEMBER: Thank you.
HOLLY: Thank you.
ALYSSA: I mean, I think we are good for lunch. I don't know if anybody else is hungry but I think this is a really great conversation to have started and I encourage us to do a breakout session later. We'll bring Heather and everyone to that. And so we'll put that on the board and hopefully we'll be talking about that soon. So thank you.
[ Applause ]
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Panel: Peripheral Data in OpenStreetMap
Panelists: Ian Dees, Diana (?), Jan Erik Solem, Jereme Monteau, Drew Dara Abrams
Saturday, June 6, 2015
>> OK, if people want to take their seats, we'll get started in just a minute. We have not one, not two, but one, two, three, four, five speakers up here. Don't worry we'll keep it short. But thank you all for coming to our panel on peripheral data projects, data that is in, outside, near, or alongside OpenStreetMap. So we're all involved in a number of different open source and open data projects that are connected in with OpenStreetMap, and thought it would be a good idea to chat about some of the similarities and differences among our projects and of the reasoning behind why we're working on these projects that aren't part of OSM property but are very tightly connected in both technically speaking and in terms of community involvement. So I'm just going to set the stage with a couple of remarks, and then each one of the panelists will give a little introduction to their project, speak a bit about how you can how it relates to OSM technically and communitywise and then we'll have some time at the end for discussion and questions and we'll have time for questions from y'all, as well.
>> So just to set the stage, isn't OSM big enough for everything under the sun? It has a very large data model, very flexible. It has an even longer API doc and to go with each of these pieces of technology there's a mailing list or three mailing lists. So shouldn't there be room for everything within OSM proper? Some of the problems and topics that we and some of our colleagues and collaborators have been tackling don't exactly fit this model. We have been doing bulk imports, sometimes that's a four-letter word, but bulk import can also mean fed rating authoritative data and OSM has a real emphasis on community valued communication, there's also value in the datasets not just as a one-time import but as a continual process and a continual aggregation. Also say ephemeral data, like traffic or the use of infrastructure, things that are not base map which are relevant to how that infrastructure is used. The OpenStreetMap data model is very flexible, but there are other types of objects it doesn't represent very well, like 3D models, some -- out in the lobby area, at the mapsend table we've got some nicely printed out 3D models of Manhattan and you can see both what the OpenStreetMap data model allows for in terms of extruded polygons and the things it doesn't allow for in terms of curves and roads and things that can represent a polygon in other ways, but not within the OpenStreetMap data model. Photographs, imagery, and temporal data are all things that we all are interested in that don't fit exactly within the OpenStreetMap universe. So among the five of us up here, we've got a number of projects we're going to speak about. I'll let each one of the panelists introduce themselves, and tell you a little bit more about their background. And these are the questions we're going to consider. We're going to think about why in the case of these projects was it relevant to create something new? From OSM proper? How do these projects integrate technically with OSM in terms of IT schemes, APIs, and how are these projects connected within the OpenStreetMap community in both in terms of people, but also in terms of things like user accounts or legal issues? And I'll be curious to see how many of these patterns -- these things are patterns among all their projects and maybe patterns that we'd like to share with other projects in the future.
So with that as background, let me hand over the mic to Ian. He's going to talk about OpenAddresses project.
>> Ian: Hi, everybody, my name is Ian. I started I suppose, I helped start OpenAddresses.io, which is a project designed to store and collect address data, open address data from around the world. My background is that I graduated from college and was in need of something to do with my spare time, so I went to OpenStreetMap and 10 years later I am still working on it.
[laughter]
I started OpenAddresses because I spent a lot of my time trying to import data into OpenStreetMap and met resistance -- rightfully in some cases -- and I needed a spot to keep track of where I was making progress in collecting open data that didn't involve cautioning ripples in the fabric of OpenStreetMap community, so I started this other project to keep track of that.
Hello, I'm Diana, I work for Mapzen and I'm relatively new to the geospace. I started maps about 6 months ago, before that I was in software in various industries so when I started at Mapzen one of my projects was to evaluate the quality of the administrative data within OpenStreetMap data. And so part of that meant that we were going to extract that data on a regular basis and make that publicly available for download and so that just launched today, actually, and we're excited to make that available and basically the idea is to isolate these subsets of data and then this way it makes it easier to for people to evaluate the quality of that data, to fix it where it needs to be fixed, to fill in the gaps, and then feed that back into OSM and then every time we do the extract which will be monthly, those changes will be reflected in the dataset. In doing that kind of evaluation and quantifying those subsets of data is very difficult when you're looking at the large file so creating these extracts is something that we think is going to be helpful for the community and this is just the beginning and we're looking at other datasets that could be interesting to be extracted in such a fashion that it's a regular extract and also do it separately. We also do meta extracts which is interesting in part of the community and that is just slicing the planet files into cities or countries that people have been interested and that's been one of our more popular offerings that's out there if you want to check it out. So I'll pass it on.
>> Hi I'm Jan Erik from Mapillary, we are a small startup. Our users, most of them use smartphones; they walk around by ground, drive, and use our app to take photos, the photos combine between users, across time to stitch together to form streetview essentially, and the whole reason of this is to be an independent source outside of the bigger mapping silos, and to fix the crappy coverage that you see today from mapping vans, so anybody who wants better coverage to go out and map the areas that they care about, and so, yeah, I started in my home town, it doesn't have street view and I mapped it in two weekends with a bike, and it's very easy.
And so we're not from the mapping space initially, we're computer people, so extracting information from the images, stitching them together, creating 3D data is what we do, and so this screen shot here shows our traffic sign recognition, and some samples. I guess we'll talk more about those later.
>> Jereme: Hey, everyone. I'm Jereme Monteau. I'm a cofounder of OpenTrails; parks and data types of information. I talked about it a little bit earlier today, and did anybody in here catch that? Cool. I won't go into it in the same detail, but basically you know, we -- we've been working with parks and recreation agencies for a number of years on helping them publish their data, put their data up in the world in a way that's useful to park visitors and they have a number of: Today, OpenTrails really addressed for them this need to have an already-defined format for the agencies to agree upon for if they wanted to collaborate with each other or collaborate with the developing community and so we've been doing as much as we can to push that forward and let people know about it, and try and work with the various communities around parks and recreation data, in the plus side, because that's our mission is to get more people outside and get new people outside, and yup, it's a lot of fun.
>> My name is Drew Dara Abrams and at Mapzen we've been building a service called Transitland for aggregating and bringing together public transit-related data so public transit involves both geographic and temporal data so knowing where the routes are and where the stops are but also when they're served and so when this project we found that OSM handles the geographic component of transit very well, but not so well is handled the temporal component, so timetables for bus schedules, calendars like knowing Monday the whatever is a holiday schedule, these are the kinds of things that don't fit very nicely into an OSM tag so that's why we've created this separate project that connects together with OSM data so it's not its own silo, but also not trying to cram the temporal data of transcript into an OSM tag.
So that's the project I'm representing on this panel, and now that we've done brief introductions, I'd be kind of curious to hear from everyone a little bit more about the nitty-gritty of how your projects connect into OSM, either touching an API, using an IT scheme or if you're just making use of OSM titles and that's the touch point, so curious to hear both the good and the bad.
We plan to do a monthly extract that runs automatically so any of the data represents what is in OSM except it's in JSON format so that's how we're tied to it.
>> Ian: So OpenAddresses does not really connect too much with OSM right now. I think where we touched is where the community kind of comes together around open data. A lot of the people who are interested in the same sort of open datay things are both in OpenAddresses and in OpenStreetMap. I think our commonality is trying to make open data a thing with normal people. And making it accessible and interesting on a more practical sense, OpenAddresses does, and it did come out of the desire to import data into OSM, and at some point in the future, I hope somebody spends the time to take the data that we have and put it into OSM but it's not happening right now, but it would be a very easy thing to do if somebody were to take that hint hint.
Jan Erik: For us we're not using any OpenStreetMap data yet, that may change. But the other way for allowing OpenStreetMap to use any of our data for editing maps, so that would be our images, all the data we extract from images, like traffic signs and things like that, connections between images, we're integrated in the ID editor, there's also a project over the summer, so hopefully everyone can get access, so the street level imagery tab or layer in ID comes from Mapillary. So we're very much connected in the sense that we want to push as much to make it available for OpenStreetMap but we're not consuming any OpenStreetMap data yet and of course there's an overlap of the user base, too.
Jereme: So we do a lot of stuff with OpenStreetMap as a company, OpenTrails specifically, you know, was designed as a standard from the very beginning, thanks to sort of who started, and where it came from, to be interoperable with OSM in a variety of ways which I think has really helped OpenTrails become adopted and be sort of -- and grow, so we, in our products, we were -- we're trying to come up with ways to make it really easy for people to become OSM editors, or just become more engaged with OSM,so we do -- we connect the O op and starting to do some publishing that way and OpenTrails really for us is exciting because it represented an opportunity to help improve something that we had been using for a long time and you know, we were always kind of looking for a way to give back and this seemed like a way that we could do that, but also really helped us honestly as a business to tell park agencies hey, there's an option now for getting your data out there that could be public domain, but it also Ben Fritz from everything that comes with that and then hey, guess what be it's kind of designed to be interoperable with a lot of other stuff including OSM so this could be a way to get your data, so I mean it's very early but signs are good on that working out potentially. But those are things that are opportunities that are exciting where commercial product can help improve OSM and we are a doing that by supporting OpenTrails and talk about OpenTrails.
Drew: In the terms of technical with OpenStreetMap, we effectively do a crosswalk so we have our own ID scheme for stops, for reps, for transit agencies and then OSM has its IDs for nodes and ways. And we will take a stop on our side and conflate it with the OSM way, say what's the closest way to a stop, how we store that in our data store and serve it out on our API and like licensing ehwise our iDs could fit very nicely into OSM text so our goal is to have loose ex cans, have a crosswalk so that users can move back and forth between datasets, but effectively never have to raise the question of a bulk import one way or the other. So speaking of multi-ports, I heard one wish from Ian. I'm curious to hear if other panelists have a wish list item either technically or communitywise that could maybe help your project be better integrated with OSM or maybe bring up some more possibilities?
>> Jereme: It's funny, because you know we've been sort of circling around the OS community and just as a consumer of the data and trying to start understanding the community and sort of the issues around licensing and everything, and really we're not trying to come in and try and say this is how it's got to be, try to prescribe something, same thing on the park side. We work with a lot of park agencies that have existing policies and we're excited because it looks like OpenTrails could be a way to not necessarily put those kinds of demands on community and try and be a bridge between those things and I mean I think a lot of the discussion around licensing and are great, but we're still pretty early, as being a, you know, involved part of this community, so we're excited to just kind of be an intermediary to a certain degree and understand what the issues are on the OSM side or any other community, for that matter and what are those issues with park agencies and how can various technologies or standards be a sort of a go-between so we don't have any specific wishes except that the community just continue to be as receptive as they have been to hearing other communities either issues or challenges with working with OSM, so just keep on, you know, talking and having an open mind for this stuff because I think some really exciting things are about to happen in a lot of different ways.
>> Jan Erik: Like I said, we're not using OpenStreetMap today, but we're starting off a project now over some of our route be. Routing using photos, meaning that you can ask for how to get maybe good photos along that route, or as maybe the case for us in and our users, get a route from A to B that doesn't have photos where where we need to go and capture, so that would be a place where we would do a little map matching between them be and so after summer we'll know but I think that's an exciting project where the two datasets will blend.
>> I don't think that we have any wish lists right now for OSM itself, I think we're just wishing that the community starts using these extracts tracts and that leads to holes of data and invalid data and so people just kind of feed that right back into the OSM dataset and it will help fix everything.
>> Ian. I think I've already answered this but I think my wish would be that OSM continues being awesome and kind of spreads that awesomeness around to these other organizations. And when you're bored tracing outlines of buildings, consider also searching for open data in your area, or mapping out trail instead of a building or something like that. So just -- I think for me, the interesting part is that there's all these other projects out there that are very similar to what OSM is already doing, and are very compatible with the OSM community should use them.
>>
>> Drew: So not to go into this with too much depth but I'm curious to hear if any of you all have had successes or failures with licensing issues? Licensing issues with --
>> Ian: So one of the sticky points with OpenAddresses is that the -- we leave the worry about licensing up to the person, the consumer of the license -- or of the data. So that means that it's kind of annoying as a data consumer of OpenAddresses, because you have to go in and read about what the licensing and understand the risk if you feel there is any. And understand that -- and I think for our project, licensing in general is kind of a -- an annoyingly complex thing and I think if you were ever to create open data or to ask your local government official for open data, like you should be doing always and forever, you should be asking for data to be released in a format or with a license that's specified and very open and preferably even just CC by or CC0 or something like that. Something that is specified so that there is no ambiguity about what is going on with the licensing.
>> Jereme: Yeah, we're in kind of a funny situation with licensing, since we, you know, a large component of what we do is open data and encouraging local licenses and trying to remove as much of that question mark as possible, for when like people want to come and use data, they don't have to necessarily worry about licensing, they can just know that they're going to be free and clear and do what they want with it, and that's one aspects but we work with -- you know, our customers are governments a lot of the time, work with nonprofits, as well, but you know, governments specifically they're going to have some very specific requirements and sometimes we we can, you know, talk through parts of it but sometimes it's just hey, you know, stuff's got to be public domain so we're just trying to understand as best we can what the issues are and trying to educate both sides and so far so good. I anticipate it's just going to be a thing that we're going to have to worry about. Er we're trying to build a business, too, so we're always trying to figure out what the parts of the data that's not in the open trail set can we lay a license on and we're sort of -- you know, we're not coming down hard yet, anyway, but it's just something that we spend time thinking about, and the less time we have to think about it, the better, but the other side of that is we can feel like we can be pretty helpful with navigating those issues as much as we can, so.
Drew: I'll add in terms of -- we're in a similar situation to OpenAddresses of aggregating transit data that's provided by many different publications, all attached to their own licenses and terms to those datasets, so each one is its own license. Each one has its quirks and then when you multiply them out, the pros and cons of each license, it's -- there are a lot of combinatorial hostilities, let's just say, so OpenAddresses and some of these efforts to aggregate authoritative public sources, do run into some issues and some challenges. Is that a question?
AUDIENCE MEMBER: Yes, I have one.
>>
AUDIENCE MEMBER: Hi, my name is Gavin Trecko (ph), specifically more about OpenAddresses is the point I wanted to make. New Zealand is a nice and -- in New Zealand, the government has been trying for the past four years or so to make a framework for release of ground information and that's the NZ bold project which is a government agreement on licensing which basically releases things that are private that should be going out under creative Commons licensing basically and so this is a number of those services including the lens, which is marking up in the metadata, the quality and the licensing of the data we've been using or is it pretty much just been thrown and therefore the user to sift through all the different aspects of it drew: Correct me if I'm wrong but New Zealand is taking a national effort to standardize on some creative and this may be making address data available in a standardized way and then a question of whether these types of projects are including metadata that can actually help users make sense of licensing constraints, but also the accuracy and the uncertainty of those datasets?
>> Ian: So OpenAddresses, the whole source file JSON blob thing that we run off of, one of the keys is the license and it's usually a link to a document to some page somewhere. Sometimes it's just a word, sometimes it's not there. So we do keep track of that metadata and sometimes it's not obvious or correct, but we're trying to make that better. Sorry, it's almost always correct when it's there, but sometimes it's not there. And then same thing with accuracy, we source our data from different places, some of it is parcels that we take centroids of. Some of it is data that came from a place that came from another place, and so we have different levels of accuracy that we keep track of, and with the goal there being so that if we overlap data from one provider and another provider, we could pick the one that's more accurate in theory. So we do keep track of that kind of stuff, and it's all human-entered so of course there's error, but it's crowdsourced so eventually it will be right.
>>
>> Jereme: Yeah, so OpenTrails, part of me wishes that like there was public domain built into the spec, but the reality is you can't do that, because then it won't be adopted, but we do -- the whole point of OpenTrails is authoritative data, I mean from agencies and we aggregate it together and that's actually just built into the spec that there's a license that goes with that. It's by organization. You can load up into like 10 organizations into an OpenTrail and download and each one of them can specify a license. We encourage public domain. But like you know, you know, I wish that it could just be public domain and be done with it, but, you know, for adoption we're dealing with potentially like thousands of thousands of agencies, just you gotta give them that choice, but it's built into so the metadata is there, at least in OpenTrails, it seems like it's nonspecific, so -- drew.
>> Drew: Are there any other questions?
AUDIENCE MEMBER: As far as OpenTrails and Transitland, were there any technical reasons for not not using the OpenStreetMap stack and then from a licensing perspective, is there any conflation with existing trail and transit information and how do you deal with that?
>>
>> Jereme: So the first question -- I heard was basically are there any technical reasons for not using the OpenStreetMap stack. And just to clarify you mean using OpenStreetMap as a platform or like using OpenStreetMap software?
AUDIENCE MEMBER: Just using OpenStreetMap proper, to store the data.
Jereme: Cool, and then the second question was? What was that question? Oh, conflation and conflate, so conflation with -- can you just repeat that?
AUDIENCE MEMBER: So for example did you import any existing trail information from OpenStreetMap and are you sending any inflammation back and forth between the two datasets.
>> Jereme: The answer to that is no, we are not pulling data over from OSM because there are issues with that with a lot of customers we work with. The first part about the stack, it's kind of the same thing. Like you know, we wanted to actually create a -- another location that was just potentially, you know, well, OpenTrails isn't really a location. OpenTrails, you know, is just that spec that is -- gives them like the agencies the ability to sort of publish this data in a way, but like the key was that that way was designed specifically to potentially get closer to OSM in the cases where that was going to be possible. So that like the -- it's -- I was kind of like balancing this, it's like for our software, we're like outer spatial the platform, we're doing a bunch of other stuff in addition to the raw locations and you know, the relationships, so, you know, we wanted to have a place for all that stuff that without cramming it into -- figuring out how to get into OSM potentially, so that's kind of why we did that. But you know, like I said, we're more than happy to try and facilitate that and, like, use it as much as we can, figure out creative ways to do that. as much as possible. So --
>> Drew: Yeah, so in terms of the Transitland project, we have a loose coupling with OpenStreetMap data. We include -- we do conflate with OSM. In our data service and our records, we'll say the closest OSM way to a stop in our records, we'll also in the future say like if there is a stop on the OSM side, do that matchup. The point isn't to duplicate data but to allow cross-talk back and forth between services. And the reason for building up Transitland as a slightly separate data service is that to really capture transit data properly, you need to both be able to represent temporal data and you need to be able to represent the network of a transit system, and OpenStreetMap data model doesn't provide everything that's needed there. Like couldn't -- so part of what Transitland serves is it serves as a data source for routing engines that are going to route to people on transit networks that will the OpenStreetMap model wouldn't give you everything you needed to represent a transit schedule and a network in a way that would be useful to a routing engine.
>> Jereme: Like we're actually really in -- OpenTrails is really excited about some of the work that Transitland does. We got really excited when we saw the first pieces of Transitland coming out and some of the way that they're tackling that because I think there's a lot of applicability to a lot of different kinds of specifically government data that are basically the same thing. It's like some geographies, metadata, maybe some schedules, and we're really excited to support that stuff and see where it goes, because if it's even just pieces that worked out, it would be good for a lot of domains.
>> Drew: Other questions from the audience?
>> All right, then I will pull a question back out of my head. If you were to draw a Venn diagram of OSM users and the OSM community and a circle over of your own project and its users and its contributors, what would the overlap be between those two circles?
>> Jan Erik, so I can answer that from my side. We have about 25 to 30% of our users that have some sort of overlap with OpenStreetMap as far as we can measure.
>> Diana: It's hard to say right now because the data is new but I assume people will find the data because they know of OSM but once it gets more popular I can see people who wouldn't be able to wrap their heads around the planet file use this data for their projects because it's in a format that's presented on map, so I could see that being, you know, a relatively large percentage once the dataset is filled.
>> Ian: The overlap directly is probably not very big, but oh, sorry, with OpenAddresses, but, the -- in the future, I think, and probably right now, the users of open addresses data are probably folks who are looking for, at this sort of metadata at OpenStreetMap and not finding what they're looking for and I imagine if they're going as far as to find open addresses they're probably somebody who would be really excited about seeing that data in OpenStreetMap, in open addresses, in some website somewhere. So I think our users, in open addresses are pretty technical people who are pretty excited about data and I think that's a pretty good chunk of people in OpenStreetMap, as well, so I think those are very similar people.
Jereme: So I asked this earlier but like how many people here like to go outside, and do edits on OSM? Raise your hand. Right? So that's basically the Venn diagram for us. I hope that there's a lot of -- my sense is that there is a lot. I don't know what the exact numbers are, but it just seems like parks and recreation and it's kind of just inherently, you know, very intertwined with what's going on at OSM, so I hope there's a lot more all the time.
>> Drew: I'll borrow Jereme's technique. And say how many of you use public transit. I would say we're also in a situation with having with Transitland a decent amount of overlap in terms of audience with OpenStreetMap apartment the same time transit data has its own unique history. About ten years ago, the Portland area transit agency worked together with Google to create a spec called GTFS for transit data and kind of laid down some tracks pretty early on in the open data world of standardizing around the spec of sharing transit data, and so it's been about ten years of this transit data community coming together, a bit in parallel with OSM. So definitely overlaps, but they have their own mailing lists, their own get-togethers, and hopefully with Transitland we'll be able to draw some more connections between those two communities. A question in the back?
AUDIENCE MEMBER: Portland Oregon, I guess. I have a question about Transitland and also with OpenAddresses. Now, I am on the open trip planner, it uses three different separate datasets. It uses OpenStreetMap, it uses the national -- dataset and it uses GTFS which has all the transit information it and it does this to create a network like you said because of OpenStreetMap by itself does not function as one, so I'm interested in if that's what you're talking about with Transitland or if you're talking about more of a permanent iteration of the transit data into OSM.
>> Drew: Sure, OK, so the question is about -- and please correct me if I'm paraphrasing well, so the open trip planner, a nicely well established piece of open source software that can do multimodal routing, it draws in data from OpenStreetMap to do the pedestrian component, it draws in GTFS feeds from aiming sis and it also draws in elevation datasets, brings all of this together, and then lets users plan a journey against that. And the question is, is Transitland trying to do a similar type of integration on kind of an ad hoc basis or more -- in a more permanent way, and I'll just say we -- we're really attempting to split the difference here of doing that type of integration work in a way that it can be shared across projects, shared across developers, that it's possible to say move from the IDs in one dataset to another easily, but not unnecessarily duplicate data, where we're not trying to bring GTFS into OpenStreetMap proper where they would not be represented fully and they would get out of date, and likewise we are also not trying to bring OpenStreetMap data directly into a transit feed or another data source where it also would get out of date, and would raise licensing issues. So I'd say in this respect, Transitland is part of a pattern I'm hearing from a lot of these projects up here, looking for ways to bring data together from OpenStreetMap, from some authoritative government source, directly from some other user communities, using different tool, different mobile apps, and find ways to bring these data together in a loosely cobbled way, in a way that allows some new applications while maybe not creating new silos of data platforms. And with that, we just got the end sign. So I'd like to thank all the panelists and please thank them, as well. I appreciate their time here. And thank you all for coming.
[applause]
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Quality Route Guidance
Duane Gearhart
Saturday, June 6
DUANE: I am dedicated to improving maps for the open community and this passion has led me to the OpenStreetMap team. Today I will be discussing quality route guidance. Mapzen started a few years ago, from a handful of visionaries here in New York City as a mobile focused mapping company. The Mapzen team works to improve the same hypothesis that OpenMaps are the future of maps. Mapzen is founded on two principles: Build on open-source software and data. Mapzen, starts where you are. Applying this philosophy of starting where you are, you overcome the shortcomings of the past and present, but you still look to the future. Now I want to talk about the Valhalla project. Taken from Norse mythology, this is a great hall for valiant warriors. So our warriors, have strategic set out to mount attack on route guidance. Here are the Twitter handles of our warriors. We're all here today. Please check out our table. We would love to talk about Valhalla with you. And please find our GitHub so you can find out more about the Valhalla organization. In the last few months we've developed. Valhalla's name was inspired by key features of the routing engine the key features of this project, once again have names associated with Norse mythology. Loki, a play of words on the module that finds location on a graph file, or THOR, which stands for tiled hierarchical open routing. Using tiled routing allows us to have a smaller memory footprint and thus, the possibility for offline routing. Dynamic costing can help with alternate routes. ODIN which stands for open directions and improved narrative -- we'll get to this in a bit. TYR, which stands for, "Take your route" is our open service component. Now I want to take a high level look at how information is closed within our service. Requests come in to our prime server and then the HTML request is validated and set on to Loki. Loki takes that and associates that to a graph edge. That information is passed onto THOR where the trip path is actually calculated. THOR sends that to Odin, and Odin transforms the trip path to trip directions and we'll get a little more into that later. Trip directions are forwarded onto Tyr. The response is formatted and then it's sent back to the user.
So this is Odin, the one-eyed god. He's focused on improving route guidance. The outline for the Odin discussion, to define some terms, we're talking about the process of going from trip path to trip directions.step through some examples. Touched on some testing methods and then discussed the future of Odin. The bottom there is our GitHub link so you can check out Odin. First the goals. First and foremost, generate quality route guidance. Then to transform path information into directions. This is more than simple terms with road names. The improved narrative shall be succinct, useful, and easy to follow. We'll have collapsed maneuvers and simplify transitions at complex intersections. Exit and directional information on highways removes ambiguity at these key decision points. Tailored to different locations and languages, so extensibility and community are key. And so all this lends itself to a calm, competent user.
I'd like to find some routing terms when I reference it in the examples. The first one, doubly-digitized roads. It's a two way road represented as two separate edges. And in this example, Jonestown road is a doubly-digitized road. Internal intersection edge occurs at one or more doubly-digitized roads. And you can see that there's highlighted four internal intersection edges. Typically these edges are marked in commercial data. Valhalla derives these based on the intersection attributes. Turn channels. At great turn lanes, typically marked in commercial data sets. Valhalla drives these based on the link tag keep attribution attributes. And common base names. This is when we're trying to combine two edges, or combine to maneuvers. And so we're comparing two lists of street names and we're looking for a coordinate, or a common base name for the U.S. one to ignore the prefix and suffix directional. So for example, here in New York City, if you're on West 26th Street, cross 5th Avenue, and you go to east 26th Street. So by concentrating on the common base name of 26th Street, we combine those, and we don't owe anything out. And another example, East Chocolate Avenue, U.S. 422, and we find the common base name and we combine. And we'll have another examples and I'll illustrate this further. So the actual processor. Odin receives the trip path from THOR which is a list of nodes and edges and shape.
So we call our maneuver builder and it will start walking the edges to suggestions to break the maneuvers. It will be looking at the common base street names between the edges. It will be looking at the geometry for the edges on the trip path and also the edges that intersect with the trip path to determine if we should call something out, or how we should call something out. When we're inspecting other attributes such as internal intersections, turn channels, is it a ferry? Is it a roundabout? All of that will come into play to our maneuvers. Once we have our initial maneuver list created then we're going to combine the maneuvers to further simplify. An example of this would be if there's a turn-channel maneuver we want to collapse that with the subsequent maneuver to have an easy call-out of turn right onto Main Street. After we have the final list of maneuvers then we call a merry-go-round (phonetic), which we'll assign to each maneuver and each maneuver will be passed onto Tyr. Our first example here is simple left turn looking at the map on the left. You can see that the highlighted edge's names are noted in the parkway data. So if we were going from the start pin to the end pin it would call out go northeast on Brooklyn Parkway, continue west onto Patuxent Woods, and -- however, checking out the picture on the right, you can see that the Snowden River Parkway is on the left. And so we have Snowden River Parkway as an internal intersection edge in our combined logic and we combine that edge with the subsequent maneuver to get the simplified maneuver to turn left on the Patuxent Woods Drive. Similar example here for doubly-digitized U-turn. Going off from the map on the left going from the start pin to the end pin looks like we go northeast on Georgetown, and then make a left onto Devonshire Road, noticing on the picture of the right, that's not really what the user would expect. You see that the closing lanes of Jonestown really aren't that separated as you see on the map. You're really just making a U-turn. So again, the Devonshire link, we make the simplified U-turn to Jonestown road. On this street name example. This is one from Harrisburg, Pennsylvania. It is in the real world, U.S. 320 West. So what I did was turn off the street name logic and somewhere used a simple equals to compare the two street name lists and this is the result on the top right. There are 12 maneuvers of road name changes over that 77-mile stretch of road. So I don't know if I had that. Street name changes at point A, point 2, and point 7 miles. Clearly not what we want to do. We turn the common base street name logic on and we collapse down those 12 maneuvers down to one maneuver.
This is a directional exit info example. This one, users on Interstate 83 North and in this example, the interstate actually comes to an end and then there's a fork in the road. So it's looking at the picture on the right, you can see that the user has to go to the left to go to interstate 81 south, or right to go to Interstate 81 North. At this point, when the Interstate 83 ends, if we gave the instruction of continue onto Interstate 81 without no highway directional, the user would freak out because they would have no idea which way to go here. So looking at narrative line two, we call out the proper exit number, 51B, was the proper relative direction on the right. The branch road that's going onto interstate 81 north and additional tour information, I-78 Hazleton/Allentown. We pull out the highway directional from the relation and use that in the graph to make sure that key piece of information is supplied to the user.
So for this case, this is definitely -- we're going to keep that the user come up. Sometimes when an A/B exit departs in the highway, departs has location, and farther down the ramp, it splits and fork in one direction, A going one way and B going another two, there's usually exit sign information. And more exit sign information where the fork is in the ramp. So you can see that from the photograph. This is the exit sign information when both A and B depart the highway. If we would provide that for the user, that's the base narrative is showing, that would clearly be information overload. So as Odin is walking the edges it's using the consecutive information to rank, sort, and collapse the information so you can see in the improved narrative it removes all the clutter and keeps the user focusing on the path they need to traverse.
This is an example for narrative examples using different value methods. There are multiple methods to supply information on the street map. So on the left box there's the node and way combination way, and for that one, we would pull, from the ref tag of the node, and then destination, and destination of the way to get the narrative at the bottom. That's the native at the bottom. And you can see, in the box on the right, is the node only technique. And we pull from the exit two tag and the ref tag to form the exit narrative at the bottom. So whether it's the node way, or the no, we pull the information up on one of the exit narratives. There's a Wiki link at the bottom if you want to see more exit info tagging again.
So testing methods could be a whole 20 minutes discussion on its own so I'm just going to skim the surface here. We use unitesting. Great for specific parts of the code. Then, you can sit around all day and think about the test cases for driving directions and then the real world paths. You run the routes and you find the few issues that you never thought of. And so we use this RAD testing which stands for, "Real world analyzed delta." And so we create test files based on real world test cases that you're familiar with, and it's definitely a Starks moment. And then we run the test file and analyze the outputs and form a baseline. And then any time the software the data changes, we rerun it, and compare the results to the baseline results and if everything's fine, then start a new baseline. If there's issues, typically, the problems are not unique to the test cases. Usually, it's an underlying pattern and that's what we try to focus on is when we're fixing the problems that we find, we make sure that the scope is right. We're making sure that it's not this pico-test case, but it's a general pattern and if we break it down, then we'll insert those into our union test cases. This is usually advocated for automated tests for everything. Except this is probably one time when we don't advocate for that. For two reasons: It's very difficult to implement effectively for this RAD testing and also, there's a positive side effect that the developer actually understands the problem-space better and thus, provides a better solution.
So looking at the future of Odin. We want to improve the mobile experience. To return verbal strings that would help with the straight abbreviations we had sent back. Merge onto BA, QA-3 and then the mobile app could call it out and say, "Merge onto PA-283." So we want it to be able to say "Pennsylvania." So for road influx and names in different countries and we want to handle street name inconsistencies. So there could be a case where you're on road name A and it says you want to go onto road name B for a little bit and then continue onto road name A again. This is a signature that we want to find and collapse in the near time for the user. But hopefully for the long-term we can put that onto issues on MapRoulette. We want to support specifically languages. We want to do landmark routing. For the U.S., this could be, "Turn right on the Main Street, Starbucks is on the right." For other locations around the world, they use landmark or reference points for the retro-orientation. So the landmark could be key in those locations.
Again, thank you, everyone for coming to the talk. Start where you are, continue to improve with Valhalla, Loki, THOR, Odin, Tyr, we're working to improve open routing. If you have any questions about our routing, the route team is going to be out on the table for the next two hours. For right now does anybody have any questions on the quality of routing?
[ Applause ]
AUDIENCE MEMBER: Very interesting stuff that you're doing there. And yeah, very much need to do. With OpenStreetMap, having directions. I'm intrigued to know, to what extent is a route's directions preprocessed and to what extent is it post-process once you've got -- once you've got the actual directed out of the database? Is it pre-processing planned, or is it preprocessed before you do anything?
DUANE: We do the data conversion and when the data conversion happens, you have the intersection, the internal intersection. But it's just the attributes coming through into Odin and we transform that trip path and then the trip directions.
AUDIENCE MEMBER: Okay, I mean, one example that I was particularly interested in was when you were talking about the common base street names and showed something that says, you know, continue on U.S. 322. Given that the first part of your route was, like, U.S. 22 as well, if halfway through the route, you swerved onto another U.S. 322 which isn't U.S. one would it say, carry on from the start?
DUANE: So that's not preprocess.
AUDIENCE MEMBER: Oh, excellent.
DUANE: Yeah, thanks.
AUDIENCE MEMBER: Hi. Have you faced any issues using the OD why intersections are not joining at nodes. The big ones -- on the routing you might have found some issues with the data itself. So do you have anything in your testing to recognize that?
DUANE: Yeah, that's -- we were under a pretty tight schedule to get to where we were right now. But absolutely we're always finding issues with the data and it's building more and more test cases with the coverage and we're hopefully trying to track that and get more statistics on that. But we're always going to be find issues with that.
AUDIENCE MEMBER: Are there any... Android apps... that we can try with this app?
DUANE: Not yet.
AUDIENCE MEMBER: Just wait for it.
AUDIENCE MEMBER: I think it's the closest I think you have on Android, right? Open the Mapzen?
DUANE: Correct, you have to use OS --
AUDIENCE MEMBER: The closest thing -- not probably...
DUANE: Any more questions? We are going to be out at the table for the next two hours if you have any questions. Please stop by. Thank you.
[ Applause ]
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Panel: Vector Rendering.
Panelists: Matt Blair, Konstantin Kafer, Steve Gifford, Hannes Janetzek, Moderator: Mike Migurski
Saturday, June 6, 2015
Mike: Knock-knock. Oh, good.
Hi, everybody. Welcome to the vector rendering panel. This is I think the last panel of the day, so after this will be the beer-drinking panel.
AUDIENCE: Woohoo!!
>> I'm up here with four experts in vector rendering. I've got Matt Blair from Mapzen. Konstantin Kafer. Steve Gifford on the far side there. Working on the WhirlyGlobe and Maply rendering library. And Hannes from OpenScienceMap? My name is Mike Migurski. I used to do maps. Now I do panels. We're going to talk a number of different topics, that includes server, client-side processing, vector data tiles, rendering technologies. Kind of awful the stuff that has to do with getting data out of a database and onto a screen or a desktop. I'm going to do a round of quick introductions. I'm going to ask them to do a little bit more introduction about themselves with the few pictures that we have up on the big screen and then I've asked each panelist to define vector rendering because it seems like a useful thing to define. So I'm going to start with Matt Blair. Works on the Tangram project at Mapzen. Mapzen is a company here in New York with offices in San Francisco and a few folks scattered around the globe. Produce beautiful interactive sometimes 3-dimensional and often very cool looking renderings of OpenStreetMap. Matt would you like to tell us more about the screen shots here?
>> Yeah, thanks, Mike. So as Mike said, Tangram is based on open GL in the form of web GL in browsers and open GL yes in mobile devices and so what you're looking at now is a pretty conventional view of 3D buildings in New York. This is Tangram VS we call it the mobile device centered version. So what you're looking at here is the web GL version, they're very similar in architecture and style sheets, but this is a more artistically oriented map, as you can see, it's geared towards really allowing you to be expressive in your map design, more than anything we've seen before as far as I can tell, and this is sort of what vector rendering means to me, is I would say it's freedom, vector rendering is about getting your data in a form that you can transform yourself in any way you can imagine. I think there's one more shot, is there?
>> Oh. oh, I think that's for Hannes, actually. Including things like this. This is visualizing OpenStreetMap data in ways that you normally would not think about, but are entirely possible when you get your data in vectors. So that's what I like about vector rendering.
Hey. I'm Konstantin. I work on MapBox GL, and MapBox GL is a mobile SDK that you can embed in your iOS application and your android application to render beautiful maps. It's based on the MapBox vector tiles which are based on OpenStreetMap and what you can do with MapBox GL as well is you can create completely custom styles, can you base your work on either one of the predefined styles and improve them or you can create a new style from scratch and creating a style means you can define anything, including the font, you can use custom fonts to render maps, you can use any sort of colors, you can add any of the data that you want. And we also have a mobile Analytics dashboard. When you add MapBox GL to your iOS app or android app, can you view these Analytics and see where people are using, how people are interacting with your map, you can get lots of statistics like operating systems, screen sizes where people are looking at which just helps you to understand the users of your app a lot better and finally we have this shot that shows you how you can smoothly animate MapBox GL from one place to another. It's -- it has seamless zooming obviously so you can show any zoom level you want. You don't have any predefined zoom levels, you can rotate the map, as well, and you see the labels popping in when you get closer.
>> And what does vector rendering meaning to you?
>> Vector rendering means rendering the data directly on the device and it allows you to change the view in response to the actions of the user so you don't have to go back to the server to render a new image, but rather you can adapt the view, you can add your own data locally without having to submit it to the server, and you can add and remove it, you can change the style instantly without having to re-render the map on the server, so it gives you a lot more flexibility in terms of the design and the interaction with the map.
>> Mike: Thank you. Steve?
Steve: Yeah, so my toolkit is a little different in that it's not tied to any specific data format. It does a lot of different ones, so yeah, go on to the next slide. So it's known for doing raster and vector data, so this first slide is from an app called dark sky. It's a very popular weather app on iOS device and on to the next one and this example is National Geographic World Atlas, and this is an interesting one because it uses a hybrid of image and vector data together to get kind of those fancy backgrounds by very sharp symbols and texts and things like that but it's just standard box data format even if the data contained within that is not OpenStreetMap. And this one is an OpenStreetMap example. Again it's vector data, but in this case it's being rendered on device and the routing is being done on device. In this case that's my toolkit interfacing to another toolkit which is doing the routing information and such so two of those examples are using vectors in kind of interesting ways and in terms of what vector rendering means to me, to me it means sort of half digested data. Like these renders, whether they be based on JavaScript or their native renders for mobile devices they can't process like an entire planet file from OpenStreetMap, like they would just choke hon that so you've got to chop it into little bits on them and it to them in a way that they can process that so it's not like you're displaying image tiles where all the selections are made you're kind of predigesting it and handing it to them in that form and they can do interesting things. So you can do not quite as much as you could do if you got the whole planet file but you can do way more than if you just got image files.
Hannes: Our project is slightly different. It's started as a university project to do studies also in the cognitive science group for user interaction with maps, and for this, we required continuous zooming and fast rendering, and at that time it was just not available. If these options that were all mentioned were available, we probably would have chosen one of these. So our renderer is an open GL vector tile renderer primarily for android. And we started with maps forge, client-side bit map tile rendering on android, which is quite popular, which does already staying I styling of OpenStreetMap tags and preprocessed geometry, and then the part that we developed was tile server which sends basically precomputed and simplified OpenStreetMap geometries to the client. Yes, and here you can see our map application, showing the university building of brayman (ph), where we developed this, and yeah, it got -- we did some research with it, implemented a few papers, and some students in the project wrote their master thesis based on this, in this field. But it also became popular for people who really wanted to make map applications, so for example Mapzen and their map prototype chose our renderer, and you know, some many small companies asked for how to integrate it, so this was quite amazing how this project evolved in this regard.
>> Mike: Yeah, and how would you describe vector rendering for yourself?
Hannes: So the client-side vector rendering it's the option to style the map interactively, and choose which features you want to show, highlight, get more information about this particular feature, instead of making the round trip through the server and of course the smooth interaction of scaling, rotation is not possible with precomputed map tiles on the server or where the vector rendering is done on the server.
>> Cool.
>> Mike: Thank you so one of the themes that I've been hearing in kind of what everybody's description of what vector rendering is for them, is there's this theme of moving things to the client and I think Matt, you used the term freedom in your description. I think it's pretty interesting, because I'm hearing a little bit of that in Steve's description, as well, this kind of developer choice. Matt, can you talk a bit more about that freedom and that developer choice?
>> Matt: Yeah, I I'd keep a pretty high level description, I would say of what that meant, and I mean the nuts and bolts of it is that in this context when we talk about vector rendering, it's that what you get from a service from a server that's not, you know, something you own, what you get on your device is not -- it's not a pre-drawn map it's not instructions on how to draw a map, it's just things that you can draw, and how you draw it is up to you, and that's the difference between a client-side vector rendering and what's been done before and what's been done before in terms of what we've called raster renderers or bit maps you might call it. So to me it's a way of democratizing style. It's a lot easier to design something on the client side and get immediate feedback, and experiment that way, and that's one of the cooler things that I see about it. Mike: Konstantin, it seems like are to you performance and rendering speed are very important. Important. I know from reading a lot of your MapBox blog posts over the years that you have focused on things like correct line rendering and glyph sets and just general high performance situations, that seems to be something really important on mobile, as well, can you talk a little bit about the implications on performance and speed that you've seen from vector rendering?
>>
Konstantin: Basically the need for performance arises because nobody wants to interact with a map that is slow and is stuttering so the need for speed is actually mostly driven by the user and the difficult part is just using raster images on the client is very, very fast, because mobile devices are super good at displaying text tours, but the difficult part is when you try to render a map dynamically for every frame, so you have to render everything from scratch basically for every frame, because when you rotate the map, you cannot reuse any of the previous frame's images, because the rotation is just different. So when we started working on MapBox GL, we -- one of our main priorities was to have very high visual quality that is comparable to software based rendering, like rendering with Mapnik, so we did a lot of research for ways of rendering lines that look as good as possible, and that need arises from the fact that that open GL is not focused on rendering high visual-quality images. For example open GL allows you to draw lines but these lines are very, very variable but you don't have any features in open GL that allows you to draw round joints when the line makes a turn and the antialiasing is very, very device and implementation dependent so you can't really use all of the features so basically what you have to do for rendering lines in open GL is just convert them to polygons and then use a shader that creates those pixels in the polygons in the correct way. Same thing for text rendering there's no facility for rendering text in open GL at all, so you have to create everything from scratch. You have to like load your own glyphs, place them in the correct position and make sure they look nice. By using -- we use a technology signed distance fields which looks up the distance from a font outline to a texture, because rendering actual vectors from fonts doesn't really work in open GL, because it only supports drawing triangles and not any curves objects. These are just two of the examples that we focused on when making MapBox GL and the reason why we focused on that is because we wanted to achieve a very high visual quality to just get nice-looking maps.
>> Mike: Steve, I think your project is one of the ones that kind of most takes advantage of the full interactivity and like the circularity of the earth and you really do the zoomed out view in contrast to a lot of the. I also know from your own personal history that you've been doing this for many, many years and have a deep background in computer graphics and rendering technologies. It seems like from what Konstantin has been saying is that you have to be up on the academia and the papers in order to take advantage of this. Can you talk about your own journey working through this stuff and how you got to where you are with an eye towards computer graphics as a technique as compared to just pixels on a screen?
Steve: Oh, yeah, I've been doing open GL back before it was open GL so I followed it through the years and it's actually got a lot simpler on mobile devices because the open GL got very bloated and complicated for a while, very nonfunctional and then they chopped most of that out for the imbedded systems version so they started with a clean slate and that can be very confusing but at least it's simpler than it used to be. But really I used to think rendering was really the main issue like actually getting everything set up for open GL and feeding it through the pipeline on these devices but as I've been working on toolkit now for more than 4 years and ascus merse come to me with different requirements I'm finding that it's really the data fetching and kind of the data management that's the hard part. So for example if you're drawing vector tiles it's really getting them all together in a coherent fashion, getting the lines for example to look right, building up the texture atlases for the font glyphs and stuff like that that tends to dominate more than the actual rendering just kind of getting all the data together and I have to handle more of that on the toolkit side than maybe, you know, MapBox or Mapzen or anybody else does, because the data is often coming to me in a less predefined way, you know, it's really nice to have like the MapBox vector tiles come down the pipe because they're well defined and you can anticipate what you're going to get, reuse resources it's really nice. You get something less well defined and you have to put it all together yourself so you build up these data structures for doing that, and if they have data that's changing, like they want to display like 5,000 plane positions moving in real time, then you kind of have to structure that in different ways, so I guess that's what I've found is it's really the data organization that's the tricky part. Which you know, you can do on the server side, makes it a little bit easier but when you have to do more of it on the client side, it gets tricky. It gets very, very tricky.
>> Mike: So you've just mentioned data and data formats and it seems like the way the data gets to the server in order to do this client-side rendering forms a very important part of the pipeline. We've seen a lot of this in the OpenStreetMap system in the last several years. Different providers. We're starting to see a lot of proto buffer based kind of binary formats now. Hannes, I think you actually developed a format using prototypes from different sources. Can you talk about that a little bit?
>> Hannes: Sure, as I said, back then when we started there was no vector tile rendering for mobile devices. As far as I know. And you know, we looked into the different -- we looked into the different approaches, and first wanted to start geoJSON and then I found ... did the rendering in the converse in the HTML converse and they slightly modified the geoJSON format to preprocess them into Mercator projection and scale it into a fixed tile relative coordinate system and we took this idea as it was beneficial to compress the data when you have a fixed between the points you can, yeah, you are just encode the difference between the points, and then I've found the protocol buffers format to have a good encoding for small sign numbers so we picked protocol buffers as editor also as it seemed to, you know, as it was good for prototyping, it gives you other tools to quickly create the data, and export with it and if you need faster parsing, you can write quite simple parsers.
Matt: I had something to contribute on my thoughts of the data format aspect of this rendering stuff. So one of the consequences of getting vector data on your client side as opposed to getting an image or, you know, something closer to an I am image is that this dataa is not particularly well organized for drawing. The fact that it's not telling you how to draw something, means that you often need to look at all of the data to figure out how to draw it. You can't just look at part of it and draw part of it. You can sometimes do that. But if you can address tile formats with something like the jpg data, for instance, jpgs are images, you'll probably be familiar with this phenomenon from the 90s, 56-K internet you'll get a partially loaded image, and that you'll be looking at this data progressively and that's often useful and you do that with a jpeg, vector tile data is not organized in in way but it might be useful if you can come up with something and then less necessary details can come in later, that would be a real boon to client-side rendering.
>> Mike: So it seems like between the cartographic intent and the there's lot of concerns that you guys are raising that aren't really graphics concerns. Do I actually have two minutes left? OK, I'm getting the 2 minute marker I was hoping that you guys could kind of wind up by basically talking about what part you thought would be easy that turned out not to be easy. Maybe start with you Matt and move in this direction.
>> Matt: Text. That's easy. Text rendering is really a big issue. It's as Konstantin pointed out, rendering fonts the way that an operating system might where you can actually render the entire vector image of each letter. Is just not at all practical in open GL. You'd have to generate tons of particles for each individual letter and you have hundreds of thousands of these things on a given view on your dye and it's not at all feasible with current technology, so -- and further complicated by things like international or like unicode sets of characters that are often -- I mean like Arabic characters, Hindi characters, Chinese characters -- these are even more difficult because they can be contextually dependent. The glyphs that you need to draw depend on the glyphs that are next to each other, this is called text shaping and this is a huge huge field so we've had a lot of fun getting text shaping to work correctly in Tangram and we're still not all the way there. So that was harder than I expected.
>> Konstantin. All of what Matt just said is exactly true. In addition to that, the I guess the second-hardest thing we've encountered is label placement which is different from actual text rendering. And the recent -- for why text labeling is so complicated is because there's no right solution. There's literally an unlimited amount of ways how you can label a map and really how you want to label a map depends on the context. So for some reasons for some use cases, certain types of labels may be more important than other labels. For example, if you want to catch a metro, the labels for the metro station may be a lot more important than the labels for a coffee shop. And I guess, vector rendering on the mobile device helps us in that case because we can dynamically change what's displayed on the map.
Mike: Hannes what's some hard stuff that you thought would be easy.
Hannes: I'm continuing with the label placement. That's in itself it's a hard part to get a good placement for and just for esthetic maps it's a whole research project and to have cartographers to lay out the labels in a way that the most important features are shown. And you know, with the continuous zooming, it even gets harder because you want to have consistency between the frames, the labels, you don't want to make the label placement on the data that is only available in the frame, because then it depends on the -- or it can depend on the order which label is chosen, and they start to flicker, you see it even in some commercial maps still. And to get really continuous zooming in where labels don't jump around or are selected on a lower zoom level, then you zoom in on suddenly they disappear. This is still a challenging aspect.
>> Steve: Yeah, I've actually got a different one just base of the nature of my toolkit. The thing that I spent the most time on that I never expected to is tile fetching. So if you've got a tile source which is, say, coming from one of these standard guys, it's in spherical Mercator, it starts at level 0, now imagine you're warping that into a global or you're running it on a different projection. And you have to do this on a mobile device in a tiny fraction of a second because nobody really wants to wait for that and it gets more complicated in 3D versus 2D, you know, and users kind of want to see something quickly so you're going to have to decide if you're going to fetch a lower level first before going to the higher level and it's much more complicated than I thought it would be.
>> Mike: Cool. So I was actually expecting 45 minutes instead of half an hour and was hoping for audience question time.
>> Let's hear some audience question time. Right here in the front?
AUDIENCE MEMBER: Yuri from wiki media, there were some conversations in the hallways about inter -- IATNN, so the question is with vector tiles, how do you see 900 languages or 8 hundred languages that Wikipedia is kind of currently semi-supporting, and where do we go from there?
>> Mike: So the question was how can vector rendering support the 9 hundred languages of Wikipedia, Konstantin, I think you mentioned a couple of things about that.
Konstantin: Yeah, sure, so encoding all of these languages into vector tiles is definitely possible, and when you look at the location names, most of the names for most of the languages are the same so for example look at a city name like Berlin, it's Berlin in almost all languages, there are some languages that have an O added to the end, there are some languages that use different scripts but these are typically very, very low so you can basically add all of the -- basically just uncode the main name, like Berlin and you encode the various languages that have different names. And in terms of file size, this typically is not very big. Typically for most cities you have a couple of names, and then street names don't typically have local translations when you look at the amount of street names that have more than two street names in OpenStreetMap that number is very low. There was another question. Oh, here, Clifford?
>>
AUDIENCE MEMBER: So how stable is it vector tiles, how stable are there they and what's keeping OSM from using them?
>> Mike Mike so the question is how stable is vector tiles and whose keeping OSM from using them? Matt, do you want to take that? Matt: I'm not sure I understand the question. What kind of stability. Mike: What kind of stability is the question.
AUDIENCE MEMBER: I'm not seeing OSM rush out and using them so I'm thinking there's something unstable about them maybe it's not a mature technology?
Mike: So the following question was sort of why isn't OSM using them. Is there some sort of maturity issue there. I guess there's a bigger question there which is why not and why not yet at the same time.
Hannes: So our ours was a university-backed project, but there are now vector tile services and it's, yeah, the format that was developed by MapBox is now supported by many renderers, so I guess it's -- yeah, they -- Mapzen has some, also vector tile services providing these formats and you can use it in your applications already, and --
>> Mike: So Matt, you guys are fairly new to this, I'm curious since you're developing this a little bit more recently, you probably have a little bit further view of all the stuff that's been tried and maybe where it's gone and where it hasn't gone. Maybe you can talk about the current state of vector data formats that might be useful to OpenStreetMap as an organization?
>> Matt: The format issue is I guess you could say not yet totally mature. There are a few emerging formats that are very, very commonly used and widely accepted as being a pretty good format, however, there are probably at least four formats that are substantially popular, and that's, you know, maybe a big number for people who are looking at it in terms of a unified platform, so as far as I know, Mapzen services are pretty -- we're pretty open in terms of the formats that we choose to support. We support JSON tiles, geo JSON tiles, MapBox tiles and OpenStreetMap tiles. And yeah, we're open to future formats, as well, but that's sort of in a way more of a -- I don't know, I see it more as a server issue in that the server has to manage these datasets and the client can adapt to a lot of things, so I don't know if that answers the question, but that's my take on it.
Mike: Did you have something to say?
>> Steve: No.
>> Mike: Question right there? Andy?
AUDIENCE MEMBER: It's not a question it's an answer. The question was why aren't OpenStreetMap doing it yet, and 80% is manpower. We don't have spare manpower to do these things so it happens slower than you would otherwise expect and the other 20% is the server software for generating vector tiles isn't mature. There's still a lot of different options, there's not an obvious update path from our current version. So to change the entire software stack is again a problem that we need to solve so I would say 80% manpower, 20% the server side software, all of the rendering and stuff like that, that's all code. That will work when this needs to work.
>> Mike: Thank you. Other questions? Yes right here.
>>
AUDIENCE MEMBER: We need Mapnik3 hopefully, hopefully this week.
>> Mike: So Mapnik3 possible answer. I think you had something to add, as well, Hannes,.
Hannes:
>> If they use -- for the tyrex for the -- for our render structure we tyrex to to update the tiles that works quite well so if someone wants to experiment with this that's on the OpenScienceMap GitHub account and maybe one thing that's also stopping, or -- is really hard, is the generalization of the data for the: And to do this in a cartographically acceptable way is quite hard so just simplifying geometries, yeah, can destroy the topology or make it hard to read, so this should be improved to make really good maps.
>> Mike: Thank you. There was a question back here with the hat. Hat.
AUDIENCE MEMBER: Hi my name is Derek from GC trails and I was just wondering your opinion about your the offload download capabilities of vector maps compares to raster maps.
>>
>> Mike: who's interested in answering this one?
>> Oh, there we go. See.
>> Steve: Yeah, we did just that with National Geographic world atlas and it was really nice so it's a hybrid image vector map but it's much smaller than the ease equivalent map would be, so the download time went down to just a few minutes so I'm a big fan of shoving into a single database with just one Blockbuster per tile and just shoving that on a device, it's really an improvement in most cases.
Konstantin: In terms of file sizes for vector tiles, they really heavily depend on the amount of simplification, generalization you apply. We found at MapBox that most of the vector tiles are similar to the image tiles in terms of file size. However, there's one important thing to consider, and that's to do with the amount of zoom levels that you support. With vector tiles, we only go down to zoom level 14 or 15, but to show street level, raster tiles you'd have to store the raster tiles up to zoom level 18 or 19 on the device, which obviously takes up a lot more space, so in general storing vector tiles on the device takes up way less space and it has the added advantage of being able to change the design so you don't have to store separate raster tiles for every style that U. want to support in your app.
More questions? Yes, right here.
AUDIENCE MEMBER: Yes, a little bit more comments on your font approach, actually. Is generally the end users are not going to care as much about a pretty little drawn polygon. They care about the readability of the map itself. That always comes from text and fonts. Just word of advice, I spent about 6 months of my life getting vectorizations of fonts, working correctly. It is possible to do on that, but you have to actually to the operating system and have them handle those specialized characters and stuff for you because if you don't do that, there's no way that vectors are going to be useable at a vector scale without first handling the font issues so I don't want to undersimplify how difficult that is.
>> Mike: So an interesting comment there about needing to go back to the operating system for getting the font right. You were talking about Arabic levels. It seems like there's a lot of work that's already happening at the OS level to solve that to some degree. There a way that you guys are bringing up that lower-level stuff without reimplementing it?
Konstantin: So what you're talking about and what Matt also mentioned is text shaping and contextual replacements, and there are a couple of libraries which already exist. There's no need to develop all of that. So to use HarfBuzz, you need to access the actual font files and these font files are typically very large because they contain glyphs for many, many thousand code points, so to give you an example, Arial unicode is about 20 megabytes, Google's noto fonts which cover almost all of unicode are in a similar size range, so shipping that amount of data with your app may not be a good idea, so in many cases, going to the actual operating system to obtain the -- or to make HarfBuzz use the shaping information from the fonts that are on the device is a good idea. For MapBox GL, we aren't doing that yet. We are working on that.
Matt: Yeah, I was going to mention har of buzz. We've absolutely learned that it is not a problem we want to resolve. Har of buzz is text shaping that is used in chromosome and probably other cases and because of that it's a really reliable solution it's battle-hardened and there's no need for us to reinvent all of that so that takes care of a lot of layout issues, Kerning and text shaping and as far as the actual read ability quality there's a technique that was first established in video games called signed distance field rendering which lets you get perfectly visually crisp outlines on glyphs at a wide range of scales, and we're applying that in our renderer, as well as Hughes using har-buzz for the layout.
>> Hannes: I actually wrote about vector tiles in my thesis and a large part I haven't talked about because it was too technical at a lower level it was to with tile rendering for free type for reading the fonts har-buzz and using the script recognition so you know which are right to left or left to right and then sorting this on the right order and for this I took parts from different open source projects, and put it in the library which which is tailored to the specific task of drawing texts with open GL and getting -- and updating the paths for curved labels with each frame. If you can show the screen shot, it should actually, yeah, there. You see it. So this is created in each frame, and the label moves smoothly around the corners, so I'm going to release it soon. Just need to clean it up.
>> Steve: So actually I take a different approach than most of these guy, I just use the OS. The font -- the glyph rendering and layout on iOS and android is pretty good and since those are the systems I target, I just use what they say and then I use a dynamic textual analysis to stick the renders in so I just use them to double the size and get rid of them when I'm done. So that tends to solve those problems.
Hannes: For the complete labels? Each letter if you move them?
>> Steve: The question was how do you do curved labeled and you'd have to lay those out, that's true. That's a much harder projector problem. If you're doing just simple labels you can ask the layout engine on iOS, for example, to do it for you and then reuse the glyphs which is pretty fast.
>> Mike: So that turned into a really big answer.
AUDIENCE MEMBER: Briefly in terms of like rendering, the actual rendering, what languages would suffer if you use GL rendering, fonts versus the existing infrastructure? Mike: Let's get back to this in just a second. I just want to finishing up with this question here.
AUDIENCE MEMBER: Thank you. I'm curious what kind of problems and solutions you've encountered to do with tiles, tile boundaries, label placement and consistency and 3D buildings come to mind, but maybe there are other things.
>> Mike: The biggest question for last is about tile boundaries and buildings and labels.
[laughter]
I don't know, Matt, do you want to take this one? Matt: Yeah, sure. Mike. I saw a cool 3D print of a neat tile boundary thing at your office yesterday.
>> Matt: It is keeping tiles updated is really useful, it also causes a lot of problems. One of the things that we're really trying to be general about in Tangram is getting arbitrary views into the map space, so looking at angles across the world looking like up at the horizon and beyond, and this is really not what tiles were built for. It's generally intended to be like you fetch a rectangle of space and that's it. However, if you're looking at some tilted view into the world, you're actually looking at what projects down to a trapezoids and you have to figure out how to map that back onto this tile set and not only that, if you're looking at a large angle, you have a lot of area covered by the far end of your trapezoid that's covering a lot of tiles and very, very little screen space so we put a lot of work into figuring out how to figure out what size tile is appropriate to fetch based on where you're looking in the world. It's very easy when you're looking top-down and figuring out how to wrangle the tile world into your needs is a big problem and there's no easy answer there.
Mike: So we're a little bit overtime here, so let's move this end bit to the beer panel, which is after this one. In conclusion, vector rendering is a land of contrasts. Thank you, everyone.
[applause]
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Saturday, June 6, 2015
Mapzen is building Fences… and knocking down barriers
Diana Shkolnikov
Hello. I'm Diana. I started at Mapzen about five months ago, and that was my introduction to the geoworld. Before I had never touched OSM and I hadn't really worked with maps. I'm a software engineer. So that's my background. And so my first project, when I started, was to look into the quality of the administrative boundaries data in OSM, which I thought was interesting, and because I had never seen the file, I downloaded it -- I'm just going to go look at this data. Oh, wait, there's so much other data. How do I figure out what I'm looking at? And so we all know that OSM is massively unapologetically hugely awesome. And we all want this. Everybody is here because we want to encourage more edits to the data. But as it continues to grow, we consequently have less and less idea of what's actually in there. It's really hard to gauge the quality of any one subset of data. Like the admin boundaries. And so it becomes more and more important to separate that data out. A lot of times I feel like the OSM community -- we want all these edits, and it kind of reminds me of the underpants gnomes, if anybody is familiar with this reference. We want all of the data, we want to collect it. We don't have step two yet. But we know that at the end, when you collect all this data, it's going to be amazing, and everyone is going to win, right? But we have to figure out what to do in the interim, until it becomes awesome. And so what I would propose in the interim is that we do the four Is. We identify these interesting subsets of data, we isolate them, we inspect them, and then we improve them, which feeds data back into OSM. And so what we've done with the administrative boundaries -- we call it borders, fences, kind of the same thing. Basically we've identified that data set and we've identified it as anything that's tagged with boundary administrative, the admin level, not null, and we isolate that data, we extracted it, basically constructed the polygons using Osmium, created GEOJSON files, and made them available for download on the Mapzen website so that anyone can grab these files. We created the planet extract and also the country subsets, so that if you're only interested in your country because that's what you know well or that's what your use case entails, you can grab that. And it's hosted at Mapzen.com data borders. And so the next step is to inspect this data. And we've included the errors JSON file that was generated during the extraction process, and there are errors. There are missing nodes and ways. So we hope that the community looks at those errors and goes back into the data set and fix it in OSM so the next time we extract the data set it gets picked up. And we want to create tools to help us put these things on the map in an automated way, so you can take the country and say -- I want to see all the different layers and see holes in the map, and I'm going to fill those holes in. Or you can see where the data is sparse. So also as people start using the data, they're going to notice that there are errors in it. And where things are missing. And it's going to generate discussion around the short comings of the data. This is just an image of the USA data. And you can see that it's like really dense on the East side and relatively sparse on the West. And so just visualizing it like this and having it isolated will help people drive to fix the data. And again, improve it, once you've realized what's missing. Fix the errors, fill in the gaps, keep track of the progress, as we isolate this data set. We can also keep track of how many edits were made specifically in that subset of data, as opposed to overall in OSM. Because it's so large. That's hard to do. And then because we're going to monthly build these extracts, there are going to be -- they're going to stay fresh. Any edits that you guys make or the community makes into OSM will get picked up, and keep improving this data set. And we also are Open Source. So if you guys have any contributions to the source code, the tools have been made available on our Github. So you can also create your own extracts, and add regions that aren't necessarily in there now. So we welcome all of the community to contribute back. That's all. Thank you!
(applause)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Sunday, June 7, 2015
State of the Geocoder
Harish Krishna, Rahul Maddimsetty
HARISH: Can we start? All right. Good morning, everybody. We're going to talk about state of the Geocoder. My name is Harish, I'm a software engineer at Mapzen. Mapzen is a cool new mapping app that focuses on building open-source tools using open data. One such tool is called Pelias, it's on open-source Geocoder. I've worked on Pelias and it's agnostic in that it supports OpenStreetMap. I'm pretty sure that everybody in this room has built a Geocoder at some point. I know that you're in denial, you've thought about it. In this case you did, know that you're jumping through the same kind of hoops dealing with data inconsistencies and going through them. Today I'm going to talk about going through some of these hoops, specifically I want to talk about the state of the Geocoder using OSM data and the issues it faces and the workarounds we expect to implement. For the data quality itself, for the purpose of improving Geocodering purposes. Number one, is your data complete. Any node can have stacks and these are the stacks that can be used to specify a house number, a suite name, a city and a state, but oftentimes, you'll see a nameless road like this which might have a house number and a street name which is good. It's, like, the basic, bare minimum things that you would need. But sometimes it won't have house or street name, or city or state information, which by itself isn't complete, and you can use this node as-is because if you look for 612A, East ninth Street, that might show up in a zip code that's in New York.
A common Geocoder is reverse geolookup. So for what happens is for every node based on its geocodes and then fills in the missing information. So the city and state information can be deferred by moving open data states like Quattroshapes. It's got boundaries, rich neighborhood information, states, provinces, so on and so forth. Mapzen borders that uses awesome polygons which is refreshed monthly. But moving forward this is the kind of thing that all Geocoders need. Maybe a solution is to improve the data quality itself, make it more complete. Maybe in the editor while a person adds a node and tries to fill out the address maybe you can try and fill in suggestions, do it on the fly, and ask it to fill out city and state. You could also publish a weekly data set that posts some data and posts it. So Geocoder is using that and not worrying about incomplete data. Another problem is nodes without names or addresses. This is common as well, unfortunately. And some of the nodes like this one, for instance, it's a park node. It staggers a park but it has no name. There's no workaround, we don't know what's there missing. If it's park benches, trees, or lampposts that's okay, because nonsense searching for that on a map but if it's a building name, a landmark, that's not okay. Problem you solutions is again, node creation time, check, just asking the user, there's a park, do you know the name, or we can just identify all the nodes that doesn't have names. Maybe if you just want to solve the park problem, maybe you could create a game called Guess The Park where maybe you would show the park, and ask the users to guess it, and, you know, figure out the name for it. Uniqueness is another problem. So nodes without names, and having multiple nodes with the same name is also a problem.
So this one, especially, you know, it's a problem because in the geocoding context if you search for -- let's say you search for a park and you get two nodes -- two good nodes then it's not an about experience. This could happen when you have similar nodes with a name, or a node that shares names and tags with their enclosing area. A common solution to this problem is deduping it. It figures out, wait a minute, this node is same as the name as the other one, so it unions them, and gets all that information, and makes sure that there's only one node for that name. It's expensive but it's necessary.
Moving forward, though, we're really trying to clean up the data, and maybe perform some sanity checks. Keepright.at is a good tool for installing various of these things. Next we talk about consistency. Most Geocoders we're interested in a finite number of tags. Most Geocoders know the nodes that are tagged as lampposts, or what have you, park benches, probably. But we would be interested in, you know, parks and buildings, and address points and stuff like that. So if you have misspelled tags, we basically miss those nodes from the geocoding indexing process. Similarly if there's misuse of the ways, that's not a problem as well. So there's multiple ways. Albeit leisure is one way to do it. But you also see landuse and amen can also be used in the park. So you fuzzy match all the tags so if there's any spelling mistakes, if anybody's misspelled and I still cached them and you still use those nodes then you could also hypothetically use those combinations as a way to express a polycode feature and that way you capture all the important nodes that you want the Geocoder to find.
Some of the things that we do as part of the community work is to, you know, fix the data issue. But maybe even warn the user of the node creation time or maybe add a lib task where you go in and check for typos and things. Lots of beer, pizza, and community things can help as well. Address tags are actually what we as Geocoders take. It's really important. It needs a structure that most Geocoders use. These are some of the address tags that's formally used by Geocoder. As you can see, ideally in an ideal world, all the numbers on the left should be the same, meaning that all nodes should have complete address staggering but we don't live in an ideal world and, you know, it's fine. But house number, street, postcode, country are pretty straightforward. But the street could further broken down to street number, street name, type and prefix. And that's the kind of rich data that Geocoder would be able to take advantage from. Unfortunately, the coverage is really poor. We should do something about that. Right. So another important about this tag is address interpolation. This tag is instrumental for street interpolation. Or it helps you define a street in a way. Let's say there's Main Street and there's five houses on there, 100 from 205. And not all of them have an individual node on that way but if it figures out that it's a range of 100 to 105, it kind of searches for 103 Main Street and that node doesn't exist, we still have that Main Street, and we know that 103 is within the range, and so we can on the fly interpolate and give an estimation of where that would be.
Again there's only 1.8 million ways in this tag. But fortunately there's multiple workarounds that multiple Geocoders implement, for example, snapping to the nearest house number if one of three doesn't exist 11 of one exists, and you'll just ask if they want to go with it. If there's no streets, no nodes on that street then it would return the street itself. So if there's no one of three Main Street or anything else, and then the client decides they want to display, or what have you.
Street interpolations. There's another solution. Index time, we could go over all of the nodes and all of the ways and split them into smaller ways and define string each of them. This is extremely expensive and non-trivial. You have to be careful doing that. Other solutions would include manually creating interpolation ways. And these solutions really relief Geocoders really from doing some of the work that we keep doing around data. Instead, Geocoders can focus on far more important problems like natural-language processing and smartphone search quality ranking solutions. Also, these solutions ensure better data quality and with better data quality, search quality improves too and with that, I would like to hand it over Torah hull.
RAHUL: Good morning, everyone. I work for Geo-Foursquare here in New York. Be before that I worked three years at Anything Maps, having been on both sides of the open-source/divide. Hopefully I know what it would take for OpenStreetMap to be more competitive with commercial data providers. And specifically, with geocoding it offers some unique challenges but to begin with, I would like to acknowledge and in a sense get out of the way, the most obvious challenge inherent to any community effort which is global disparity and inequality. Here I would like to refer to South Park season four. But it has nothing to say about literature that's come out over that, right? It's similar, actually. It far better addresses quality in the UK, the U.S. and anywhere else in the world. I'm not going to belabor the point about coverage, I promises. Instead I would like to examine what it is that commercial Geocoders are able to accomplish and what it is about their data that enables. And to reflect upon why doing the same things is difficult with OSM data. As I've come to appreciate in the last four and a half years, anyone can build a Geocoder that does 80% of the job. Perfectly-spelled, unambiguous, Fulton fully-qualified addresses. This is where commercial Geocoders really try to differentiate themselves and if we want to talk about competing with these commercial Geocoders, I argue that the last 20% is all we should be talking about. But to be fair though, the last 20% is the stuff that you build into the geocoding engine stuff like fuzzy input, or a miscellany of regional special cases if I'm being kind, but idiosyncrasy otherwise of hierarchies that I just mentioned and code bases of Geocoders tend to be littered with hacks. It's a privileged position for me to be asking. But this is where much is not more. It addresses the core of the users. That is when you need the data model to back you up. In my time I've worked with data that modeled distinct address ranges for each alternative name of a street. That model, at least three different postal code and city context for each side of the street, you know? I know some of this stuff is available in the TIGER import although I would argue not are in the best way possible application. I could take the rest of my assigned time going over these corner case studies. Stuff that makes Geocoders weak on the inside but I want to make a bigger point how big is -- how critical is structure to geocoding. It is this structure that enables commercial geocoders to do what they do and it is precisely this structure that is hard to replicate with OSM data. OSM's guiding philosophy pragmatically favors ease of map over everything else including say, data retrieval in the software. Which is fine. I wouldn't change it for anything. But as someone building Geocoders, I can't help but think about how that short-changes Geocoders as an application. Take for example two different applicationses, rendering and routing. Data that is surveyable, or otherwise observable from the ground, or directly from the sky, things like names, classifications, turn restrictions, things that you can turn the streets. And update based on what you see. Regardless of the motivation of individual mappers, and I imagine there are many motivations as there are mappers themselves. Those incentives line up perfectly. That is not the case, however, with geocoding because it's not just the observer. Most often ranges like addresses are not seen. Postal code and city are unseen. And gazeteer things are unseen. And the problem is with unseen is authority. Primarily you work with authorities working with different data formats and imports data with their own. I know we've imported TIGER and Autosurvey but to do these providers, and they bash on it and so we need to create an appetite for that adventure, or maybe be a glutton for punishment, whatever. Right now we have guidelines actively discouraging importing anything that's not available. I'm not particularly privy to them but the way that argument is made against importing parcel data is that they are not useful in a map, and that is telling, in my opinion, they're not useful on a map, they're not useful visibly registered on a map, but ask anybody doing a high Geocoder and they'll tell you how useful it is. Then I come to structure itself. Structure is implicit and for the most part, we're putting things in the right place. Again that is not the case with geocoding. One of the most common causes of ambiguity in geocoding in intersections is just resolving and parsing the various street name components. Commercial data providers have highly structured street name data that separates the base names from street types and postals, and SSP, and a hundred other things, actually. But in Geocoder we have this proliferation of unstructured tags. You have separate tags for house number tags for checkins. And as Harish mentioned that there's a street name component in OSM, but the coverage is low. What I'm suggesting is that these structured tags that do exist tend to be drowned out by the sea of unstructured tags and that might be the problem for us to fix. This does not affect rendering and routing data because those are just strings that you just draw on a map or just insert into directions text but it makes all the difference to a Geocoder to know which part is the base name and which part is the metadata.
Among other things it helps you generate better name variants at index time, or allows you to perform more fuzzy matching in realtime. Furthermore, geocoding is all about relations, which is obviously what it benefits the most from. From the point of view of software, working with string-value types instead of ID-value relations is extremely wasteful and inefficient. Every OSM Geocoder wastes a bunch of time resolving these tag differences by either searching by name, or by containment. This is particularly for matching a house number to a street, or a street to a city, or grouping multiple segments to a street. And this is extremely wasteful. Furthermore does it not only slow down the index, but in the present, in the future this is going to stand in the way of parallelizing the index barriers to use. Now I know that not everybody who's geocoding as much as I do. But image if you're really going into commercial geocoding, this is going to really help in the long-term. And this brings me back full circle in that relations are being used in OSM. We have particles right now that are used to express relations. Yes, I get it too, it makes it hard as well. But given that these admin shapes give quality. It's clear that the mappers are willing to go through the trouble to add those relations, right? On the other hand there are questionable returns if you do the same if you add Geocoder, if you add relations to the geocoding.
So I personally think that there is an absence of incentive. And no amount of pleading or pressuring the community is going to compensate for that. So we shouldn't even try. Instead, I think this is something that what we could easily do, compute the relations, add them back to OSM, or add them to a different data set because it doesn't make sense for individual Geocoders independently deriving them independently, every single time they build an index. Finally changing tracks labeling streets didn't work out -- we didn't want to. Last year, David Blackman attempted to label streets using Foursquare venue data and it was a really hard problem and it didn't quite work out. This year, I came up with something similar. As Harish mentioned, address coverage is really spotty in OSM. And we could relatively easily create interpolation from source data and I would be happy if you would let us, and if the system existed that would accept this data. I'm finally, again, while I have your attention and while I'm speaking on behalf behalf of Foursquare, I just want to mention ODbL and how it kind of hinders the options. David Blackman talked about last year in the context of not being able to use ODbL, and Gary touched upon, among other things how Geocoders tend to be tend to have it harder because they're directly exposed to the underlying data. And so the basic nature of this argument hasn't changed, and there are people more qualified to make it than I am. So I'm just going to mention it and move on. But if anybody's more capable to mention it, I'll welcome that. But that is everything I have. Thank you. I'll take questions.
AUDIENCE MEMBER: I was most interested in the last thing you said about how Foursquare provided interpolation of information that you mentioned. As we know, it gets us into the usual import conversation and things like that but earlier you mentioned with data sets in particular, having one particular data set has solved this problem for a lot of people is a nice way around doing it. Is that a good way to start having that conversation. So I guess if there were a data set that isn't constrained to the way that OSM is, it would help demonstrate that and we've already policed it within the world of other geocoders to look at data sets other than the whole set.
RAHUL: I'll repeat the question. So the question was about, when I was speaking on behalf of Foursquare I wish there was a way to give you interpolation, but at the same time, I mentioned how instead of publishing separately the data set is that a way to start the conversation, absolutely it is. I don't know how to start it. I hope this talk is one way of getting that conversation started but -- I'm sorry? And birds of a feather. Yeah.
HARISH: Right, so there's going to be a birds of a feather on geocoding which I encourage you all to come to,est going to be around 2:00 or 3:00 today. Please look at the whiteboards and we'll discuss things like that for sure.
AUDIENCE MEMBER: I have a question in New York City. If you have a map, if you want to look to find for ordering food. If there's a database for eating, and going home automatically, it could help.
HARISH: Yeah, we do use -- did you mention Foursquare or do you just mean like local search perspective? So geocoding is one part of any local search fulfillment. Particularly, search happens by parsing out a "what" and a "where." What is the intent and the where is the Geocoder?
HARISH: Your question revolves around how do we find hotels and churches in New York City. And like I said, some of the tags, Geocoder is interested in. Some tags -- and if you have the geocode for that, it gives you all the restaurants, then you could essentially do what you were asking for. Any other questions? Yes?
AUDIENCE MEMBER: What would you say about the possibility of theoretical address coding and interpolation and eventually with technology point data that's accessible -- APIs. And the need for much more accurate addresses. What would you say to that?
RAHUL: Well, that's kind of like an unicorn, that's what we want but I don't know how we're going to get that. Any address has at most, structure, and if everything was perfect then we wouldn't have to do interpolation, or all the actions that I mentioned, or the workarounds. It would just work. So we can try to educate people how to add a node and hope for the best at this point but hopefully in the future -- you know, hopefully each geography from the get-go -- you know, everyone knows how to tag these things and eventually we'll have a perfect data set. Does that answer your question?
AUDIENCE MEMBER: Kind of. You can talk, though.
HARISH: Yeah, I don't have much to add, really. It's just like like Harish shared, it's just the code, you can just search and return, although passes would be nice for reverse geocoding. What would we do that with interpolation, we would have partial data but I'm being told I have to stop now. Thank you so much, everyone.
[ Applause ]
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Improving Diversity in OpenStreetMap
Kathleen Danielson
Sunday, June 7, 2015
(tapping mic)
>> Is this... Oh, it is on. Okay. Great! Can everybody hear me? It seems like the mic is working. Awesome. I am Kathleen Danielson, and I'm going to talk about improving diversity in OpenStreetMap. A quick introduction -- I am a developer advocate at Mapzen. I am on the board of directors of the OpenStreetMap Foundation. And I am on the advisory board of the Ada Initiative, which I will talk about later. So what am I talking about? What do I mean when I say I want to talk about diversity? I want to talk about things like gender diversity, racial diversity, sexual orientation and gender identity, disability, socio-economic status. I am primarily going to focus on gender diversity today. That's what I know the most about. It's what I'm most qualified to speak on. Many, many, many things that I talk about today will be applicable to all these other categories. And the way I kind of will frame this is from a feminist perspective -- but particularly intersectional feminism, meaning that you can be in favor of equality for women while not at the expense of other marginalized groups. If you feel like learning more about this, which you should, this book, Feminism is For Everybody, by bell hooks, is the primer on the subject. So Open Source has a problem. Actually, it has a couple problems. The first one that I want to talk about is the problem of implicit power structures. Open Source is a very unstructured thing, and it is that way by design. The problem comes when that lack of structure comes into the social workings. Any group of people ever has a power structure. That's just how people work. In Open Source, what you end up doing usually is you replace that power structure that normally is stated out loud -- this is who is managing what -- and you replace it with an implicit power structure, where nobody knows who's managing what. Unless you're already a part of the cool club. Unless you're already on the inside. If you're not, good luck getting in. Suggested reading for this is a wonderful essay called the Tyranny of Structurelessness, by Jo Freeman. It was published in the early '70s, talking about the feminist movement at the time. It's wonderful. What happens with that structurelessness is that privileged groups end up rising to the top. Privilege can be a trigger word for some people. It makes some people feel defensive, and I get that. Open Source has a lot of interplay between privileged and marginalized groups, and if you want to learn more about this, which you should, John Scalzi has a wonderful series of blog posts explaining what privilege is and what it means and what it doesn't mean. And then Open Source also has an unequal distribution of funds. I think there's a story we tell ourselves about how Open Source is a cash-neutral game. These are volunteer projects. Everyone is just doing them for fun. It's free. And therefore there's no money in it. That's not true. Look at the conference around you. Look at all sorts of other Open Source, open communities -- this is not a cash-neutral economy. So OpenStreetMap is not immune to these problems. OpenStreetMap absolutely falls under the umbrella of tech and Open Source, even if it's more of an open data project, but we'll actually talk about that a little bit too. So one more suggested reading -- my colleague, Alyssa Wright gave a talk two years ago at this conference called the threads of OSM discussion, are the doors really open, a really great discussion of diversity in tech and what that looks like through the lens of OpenStreetMap. So what I'm not discussing today. I'm not going to prove to you there's a problem. If you don't believe me, you're probably not going to like the rest of the talk anyway. I'm not going to talk about the pipeline. Pipeline issues are important, but they are something I think... Sometimes they are used as a way to derail conversations about toxic environments, and it doesn't matter if we have an amazing pipeline if the environment that we're dumping people into is unhealthy. And I'm also not going to talk about geographic or linguistic diversity. These are things that are really important to OpenStreetMap. They fall into some slightly different categories than some of the subjects I mentioned already. So I'm not going to talk about those. The other thing is that I often find them coming up as derailing tactics. Even though they are very important, people like to bring them up when you're talking about diversity, because it is less uncomfortable to talk about something like geographic diversity than to talk about racial diversity or gender diversity. So I'm not talking about that. So there's a lot that is happening in the tech industry. Diversity in tech is something that's on everybody's lips. Everybody is talking about it. A lot is going on, and there's a lot we can learn. There are a couple different types of organizations and communities, and programs doing things in different ways. One of those ways is by building new communities. People -- there are communities that are created that -- I'm sure there's probably an actual word for this, but I decided to just call it communities of identity, where you're creating -- these are going to be like Women Who Code, or Latinas in Tech. Those both might be things. The value in having these communities of identity is that you're building communication channels, you're providing access to networks, for people who have been traditionally left out of those. The other thing that's really valuable in those communities is that you enable people to spot and combat problems that are much, much harder to notice and much harder to recognize as problematic when you're in isolation. So you can see things like -- where maybe one incident isn't sexist or isn't racist, but you know what? This sort of thing happens in all... Like, this sort of happens over and over. It's recurring. Maybe it's only happened to me once, but it's also happened to eight of my friends. That's a really, really powerful thing as well. So communities that are essentially adjacent to larger ones that are creating this community of identity -- these are groups and organizations like Black Girls Codes, Rails Girls, Django Girls, PyLadies, TransH4CK. These are providing networking and community to groups that have traditionally been left out of that. So in addition to creating new communities, other groups work on improving existing communities. Probably my favorite example of this is the Boston Python Workshop. This was created by people in the Boston Python user group. They created a workshop for women to learn Python, and then specifically designed it to integrate those new Pythonistas into the local user group. So they were very specifically working on improving the diversity of the existing community, rather than creating an auxiliary community. There are pros and cons to doing things both ways, but they were very specific about what they wanted to do, and what they wanted to do was improve the Python community. Another thing that was really neat about the Boston Python Workshop was that the Python Software Foundation provided that initial group with grants to go and implement this workshop in other communities around -- I think around the US. It could have been international as well. I'm not positive. Another group that works on improving these existing communities is Outreachy. It used to be the GNOME Outreach Program for Women. They recently rebranded. They changed. They're not just focused on women. They support trans and cis gendered women, trans men, and LGBT people -- but essentially it's Google Summer of Code but for diversity. And it pairs up interns with Open Source projects, pairs them with a mentor, provides funding for them, which is the really important part, and Outreachy -- OpenStreetMap, we're actually working with them. HOT is now in their second round, with Outreachy, and OSM now has their first interns from Outreachy. And the final group of... The final type of project is one that isn't really for building a community, but is instead providing resources. One of my favorite examples is the Ada Initiative, that I'm on the advisory board for. The Ada initiative has done a lot of work for supporting implementation of -- encouraging the adoption of codes of conduct at tech conferences. They also provide imposter syndrome trainings and ally skills workshops, which are really neat ways for people who want to be good allies, but don't know how. How do you do things like recognize bad behavior. How do you call that out. Especially because you're usually in a better position to do that than whatever marginalized person is there. And then they also provide AdaCamps, which are small unconferences for women in open tech and culture that they hold around the world, and those provide some more of those communities of identity. Another example -- and this is closely related to The Ada Initiative -- is the Geek Feminism wiki. It's exactly what it sounds like. It's a wiki of Geek Feminism. One really important thing it has is a timeline of incidents. When you talk about sexism or racism in tech, you get a lot of pushback on -- well, this isn't really a problem. Or our community doesn't really have this problem. And by starting to write them down and say -- you know what? Yes, there are incidents every day. Every month. Every conference. Starting to write them down across the industry, you see these trends, and it's a really powerful thing. And it's a very powerful thing for people who are using it. The Geek Feminism wiki also provides great feminism 101, essentially, which you can send people to, or use yourself. For myself, I think, as a baby feminist, I felt like the Geek Feminism wiki was really useful, because it was an accessible place for me to learn all sorts of things. Things I didn't know had names. It gave me vocabulary for things that I was experiencing that I didn't know were weird. And it does this by outlining patterns of harassers or abusers, things like silencing techniques, derailing tactics, explaining things like -- what is gaslighting? Things that are very subtle, and are intentionally very subtle, and when you're in isolation, you don't know how to spot them. So a resource like the Geek Feminism wiki is pooling these resources for groups that normally wouldn't have access to them. Wikimedia is one more that I wanted to mention, and Wikimedia is really important, because there's no other project that parallels OpenStreetMap socially the way that Wikimedia does. No Open Source software project can ever have the same parallels that we have with Wikimedia. Wikimedia is doing interesting diversity work. They have certainly found that among their editors, there is a lack of diversity that they really need to do something about, so they've built up a lot of resources on -- shocking -- wikis. So you can go in and see what their gender gap task force is working on. They have different surveys, especially to see what -- how successful their initiatives have been. There's lots of academic scholarship about Wikipedia, and within that, lots about diversity within Wikipedia, and it's all there. You can find all of that. Okay. So I sort of promised a very action-oriented talk, and so far, I have provided zero actions for all of you. I'm going to change that. So what about OSM? What's going on with the project we're working on? We're already actually doing quite a bit. There are codes of conduct at several conferences. So the one you're at right now has a very good code of conduct. Last year's State of the Map in DC was the first State of the Map US with a code of conduct. I was involved in planning that. Helped put that code of conduct in place. Travel grants. There have been travel grants for the last... This is the fourth conference to have -- fourth State of the Map US to have travel grants, and increasingly, those are earmarked for marginalized people. There's a diversity mailing list, and I don't know how many of you know this, but actually last fall, the OpenStreetMap community raised over $2,000 for The Ada Initiative in 36 hours. Because I was like -- hey, we should do this, and it was a day and a half before their fundraising drive closed. So maybe I'll do that earlier than 36 hours this year. Okay. So that's sort of the stuff we're already doing. But what next? I want to provide a quick caveat that what I don't want to be advocating for is more invisible labor. I think that one of the worst things we do is we expect marginalized people to demarginalize themselves. And by... And we also tend to discount the work that goes into diversity efforts. So I tried, as I was pulling these together, to not be suggesting that someone take on any more unpaid labor. Which is also an interesting thing within a volunteer context. But we can talk about that later. So this one seems really obvious. And I wish this was just my whole slide, and I would mic drop and leave, but we need to make these spaces healthier. And what do I mean by that? Stop tolerating bad behavior. Stop tolerating bad behavior. Stop tolerating bad behavior. Codes of conduct are in place and are implemented, because you need to have structure. You need to eliminate these implicit rules, and provide guidance around what's acceptable. What I would like to see is State of the Map -- the OpenStreetMap Foundation that I'm on the board of -- we have the trademark for the State of the Map. This is something I've talked about with Kate, with some others. I think it would be great if we were creating a trademark policy that said basically -- you cannot use State of the Map if you do not have a solid code of conduct. I would like to see the same for all mailing lists that are hosted by OpenStreetMap. We've talked about this. Some do, some sort of do, that sort of start and sort of peters out. I would like to be anything that has a mail server -- has a code of conduct. And beyond that, just having a code of conduct is a nice gesture, but it doesn't matter if you can't enforce them, if you can't implement them. And this is not something that just we struggle with. I think across the tech industry, it's really a challenge to deal with -- how do you implement these things? How do you enforce them? I think that we need better plans for that. I think there need to be better resources about it. So we need to stop tolerating bad behavior. We need to increase the access to funding. Like I said before, this is not a cash-neutral project. Look at the conference you're at. There is money in this economy. So who is making a living off of this project? Who is doing work that is based upon OpenStreetMap? That is working towards or with OpenStreetMap? And what does that distribution of funding look like? I would love to see -- I would actually love to see research into this. Someone doing a proper survey of where is the money in OpenStreetMap. Who has it. Who has access to it. Because I think the most obvious ones are going to be the larger companies, like Mapzen, who I work for, Mapbox. But there are lots of other people who are paid to work somewhere within the OSM ecosystem, or in a way that is very related to OSM, and when you end up having one person, two people, at an organization, it's much harder to spot patterns. So if we were to look industry-wide, we would see how that money is distributed, and I'm very confident that it's not distributed in an equitable way. And the last thing I want to recommend that we do is I think we need to capture our experiences. And this is something that I think we can really learn from Wikipedia and Wikimedia, that they are writing down what they are experiencing, what they are working on. And I think that there are a lot of ways that we can do something really similar. So we have a wiki. And why don't we use it? I checked. There is no diversity page on the wiki. So is there OSM diversity research? I think so. I don't really know. If we had a wiki page that just said -- here. We've done this research. This is a wiki page about this diversity -- that would be really straightforward and we would know. I think best practices and resources -- it doesn't make sense for us to try to do something like rebuild the Geek Feminism wiki, but there are going to be things that are specific to what we're doing and resources that are helpful to us, and I think specifically we can find them through Wikimedia. I want to learn a lot more from them. There's so many lessons there that we just don't take advantage of. I think also something that would be really easy is just do some greatest hits from the mailing list. It's not that we don't ever talk about diversity. We talk about it a lot. But then it goes out to the mailing list and it's out in the ether and no one sees it again. Put them in a list. Say -- here are conversations we've had about diversity. Look at trends. What's successful? Maybe there are things we assume everybody disagrees on that not a lot of people do. That's a really straightforward thing that we can do. And then another one that I think would make a lot of sense is simply to track ongoing diversity projects. Things like code of conduct implementation. Because I think there are a lot of things where we started something, it petered out, we started it, it petered out. And I don't think that tracking it is going to make it not happen, but it'll show trends. We started this, it failed every time, we probably need to reassess how we're approaching it. So the point I'm really trying to get across here today is that this problem is not too big. Yes, diversity is a challenge. It's complex. There are a lot of moving pieces. But it's not too big, and we can do something about it. Talk about it. If this is something you care about, tell your boards. Are you a member of the OpenStreetMap Foundation? You should be. And tell your boards. Tell me. Tell the rest of your fellow board members. When the elections happen, make sure people know that that's something you care about, and that you want people to be working to improve diversity in OpenStreetMap. Better yet, run for the board yourself. There's also -- in the US, there's the OSM US board. Same goes for them. Make sure they know that's something you care about. Some other countries also have boards for their local chapters. Let people know that it's something you care about. Educate yourself. This is something I need to keep doing as well. But again, one of the things that we do is we say -- it is the responsibility of the marginalized groups to explain to me how they're marginalized. If you think about that for a second, it's ridiculous. It doesn't make any sense. So educate yourself. Learn about this stuff. And don't be so oblivious. Because that lets you then spot bad behavior and say something. We spend a lot of time tolerating bad behavior, and it's by being very passive -- we are making our priorities super, super clear. So let's change those priorities and make that super clear. Stop relying on invisible labor. Stop relying on people who are already -- already don't have access to the same resources to build up an understanding of why they don't have access to these resources. Money talks. This is not a cash-neutral economy. There are companies in this economy, there are organizations, there are foundations that are very invested in this project. It's a valuable project. It is... And if that is something that you're involved in, if your company has money, put your money where your mouth is, and work on improving diversity in OSM. Because it matters. If you want this project to be successful, it needs to not be a homogenous group of mappers. It needs to not be a homogenous group of developers who are working on it. And finally, pay the experts. I thought a lot about this as I was getting this talk ready. Did I want to suggest that we can put our own funding structures in place for things like microgrants or non-microgrants? And I think that there's a lot that can be done right now, by supporting the organizations that already do this. There are people out there who are experts, who do this across Open Source projects. They know what they're doing. They're very good at it. And I think if we say -- we can run this internally -- what we're going to do is we're going to take on this invisible emotional labor of running something, where we're not necessarily the best equipped to do that. There are people out there that we can pay. Let's pay The Ada Initiative, get them to do ally training skills at a conference. Let's pay Outreachy to bring junior developers into the ecosystem. So diversity matters. Thanks.
(applause)
>> Before Q and A starts, I'm just going to do one quick comment, which is... We're going to restrict it to questions and not comments. I'll be around after this, and I can talk to you about comments. And there's Twitter. You can say all sorts of things to me on Twitter. Ideally nicer things. But it's Twitter. Okay.
>> First of all, thank you. Also, how do you think this compares in the OSM community in general versus the general tech? Is it better? Worse? How have you seen it so far?
>> Um... I think there are some ways in which... There are some ways in which OSM I think is behind tech in general. I tend to find that the conversations around diversity are... They're delayed, basically. They're things that other communities were talking about one or two or three years ago. So in that way, I think the rhetoric is behind the times. I think in terms of numbers, I think to some extent, because the rhetoric is behind the times, there hasn't been as much improvement in numbers. I don't have good research at my fingertips. Right there.
>> I was just curious -- if you see some of the efforts that you're proposing with OSM as part of a more general framework for other communities? Because OSM... I mean, it's not unique in the sense that there are many Open Source communities, and then there's even closed source communities that this stuff applies to too. Do you think it's more effective that people within the organizations find their own way to do this? Or do you think it is more effective on a general scale? If we can all agree across the industry, on a set of guidelines?
>> I'm not sure if we can agree across the industry on how to improve diversity. That would be lovely, though. A lot of what I'm saying is applicable to different projects. Also, different projects have their own special issues. So I think what I'm saying around -- we need to start capturing our experiences. I think that for some communities, they wouldn't necessarily need that. They're kind of in a better place now, or... And for some communities, they might... Like, that might be too ambitious. But certainly much of what I'm saying is very applicable to different projects. Any other questions?
>> I don't know how to... Why do you think that geographic diversity... Why don't you think that's (inaudible)...
>> I'm going to defer that question, just because I left -- I intentionally left geographic diversity out. Basically I think it's important. We can talk about it after. But I want to set that aside. Because it's an entire masters thesis unto itself. All right. One last question?
>> What do you think about just users that are anonymous? And you don't necessarily know who's doing what? Or what they are? Or even if they're human. Or what-not.
>> I assume they're all cats. I think that's great. I think that there's no reason that people need to specify gender. I don't even think there is any way to specify gender with OSM. I think it's perfectly fine. And excellent. But they may be cats. All right. Thank you very much, everyone.
(applause)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
O.S.M.B.A. The history and future of companies in OpenStreetMap
Randy Meech
Sunday, June 7, 2015
>> Hey, everyone. That's really loud. This is a full room. But business is so boring. I'm surprised you guys are all here. That's funny.
>> (inaudible)
>> Exactly. So thank you for coming. My name is Randy Meech, and this talk is OSMBA. And it's the history and future of companies and OpenStreetMap. And I do not have an MBA. I actually studied literature and religious studies. And I'll just go over my background a little bit so it makes sense that I'm talking about this at all. So I started in tech at the end of the '90s tech bubble, and then as things kind of fell apart, I luckily went to Google and spent five years there, which was amazing, and I did startups after that, one of which was acquired by AOL, where I became head of engineering for MapQuest, where we launched all the OpenStreetMap tools that MapQuest supported back then, around the 2010 time frame. Now I'm CEO of Mapzen, which is based out of the Samsung accelerator. The closing party is tomorrow night, so I hope to see everybody there. This has happened a few times in a few different talks. I think Michel did this in the last session as well. How many people here, show of hands, work on OSM professionally? I think it would be more in this talk, probably. Actually, not so much. Yeah. I just think it's an interesting question. So let's get started. So the origin of OpenStreetMap -- I think it's pretty well known -- was founded by Steve Coast in 2004. And he wanted to make a map -- he was in the UK. He had a very practical need. He wanted to do a map and it was very difficult to do it. You may be familiar with the ordinance survey, which holds all the official map data in the UK. But this was not accessible to individual developers. Bigger companies would have had this data, but not an option back then. Looking at the success of Wikipedia and crowd sourcing, he started this project with a bunch of early contributors, many of which are here, which is great. It was a good time to start, because there was the arrival of good affordable handheld GPS technology, as well as a proven crowd sourcing model with Wikipedia. So it's a good confluence of events. And Steve was kind enough to let me interview him for this talk. And one of the questions I asked him was if they thought at all in the early days about enabling future businesses when they started. And his answer was pretty quick. They were thinking more about disrupting existing models. For example, it was difficult to get data from ordinance surveys. So this was much more of a disruptive crowd source effort, rather than the blueprint for future businesses. You know, FYI, there are currently four global map data sets. There's Nokia's map tech, TomTom, OpenStreetMap, and Google. So two years after OpenStreetMap was founded, Steve started the OSMF. This is a UK not for profit company. We've had some talks about it. Today Kathleen talked about it earlier. Kate chapman talked about it. So there are folks here from the OSMF, but I want to think back to that time and what an interesting move that was. There are a lot of things that could have been done instead of starting the OSMF. Which is a group, a not for profit, that's charged with growing and taking care of the project. Steve could have made this a company. It could have been dual license. Or it could have been something more like the Wikimedia Foundation, which is a strong, centrally controlled project. Instead, it's a database that people can use and share. So I just think it's a really interesting model that provides a lot of opportunity for us now. But there's tension. You just need to look at the mailing lists or look at blog posts. There are arguments over the license, and whether it's fit for business. And we'll get into this a little bit. So we'll start with the license. Which I'm not going to argue about, personally. I'm just going to describe it. So at the time when it was founded, it was CC-BY-SA, which was chosen -- it's similar to Wikipedia. With OpenStreetMap, it's been replaced by ODBL, which more or less is the same thing. I won't get super into that. But the thing about the license is it allows you to use it for any purpose, commercial or not. All you have to do is two things. You need to give attribution to the creators of the data and you need to share back any improvements that you make. So the license was chosen to benefit the project and to ensure its growth, and there's a slight fear that the data would be taken and improved maybe by a company and not shared back, and there is a tinge -- sometimes -- about maybe there's a resentment of companies using my data or the data that I created. Why should someone profit off this? But it's important to note -- free software culture is strongly against limiting use. So it's okay to require attribution and require sharealike and still be free. And OpenStreetMap is rooted in free software cultural. So you can read about the GPL's four freedoms or the open definition. I actually have a syllabus at the end of this. If you go to my Twitter page, I have a pinned tweet that gives some articles about this, if you want to read anything about this. So these arguments are very similar. But there are arguments about the license. Clearly something works. We argue about how well it works for businesses or if it could be better. But there's a huge number of people at the conference. We're at the UN, the data sets are amazing. Two days of amazing conversations. Humanitarian, commercial -- all kinds of stuff. So I just want to pause and say that something's working. Sorry. Just going to grab some water. And there's also a number of businesses here. You know, we have folks from Apple, Facebook, Twitter, ESRI, Microsoft, Amazon, all in attendance, I believe. And not all of them are using the data, but the fact that everyone's here -- I think it's very interesting. So something is working, even if there are some problems. So there's been a lot of writing on this topic, lately. Actually, since I got this talk approved, so much has happened and so much is about to happen that it's kind of tough to plan it. I was kind of worried there would be last minute announcements of things today, but that didn't happen. One article I linked to in the syllabus at the very end was something by -- a piece that Gary Gale wrote in Geohipster. And now that I'm reading it in my notes, I'm not sure it's appropriate to say. But the article argues against CC-BY-SA and claims it's holding the project back. I'm not going to quote it here, but it quotes from Simon Poole's diary entry. And he points out, among other things, because there's no strong central organization like Wikimedia, OSMF leaves space for other companies to build on top of it. It's very permissive. This hasn't happened with Wikipedia. There's no opportunity for a Mapbox to be created off of Wikipedia, because it's strong and centrally controlled. So from this angle, you could argue that OSM is incredibly business friendly and very generous. So if we stop and think about all the freedoms we have with this data, I think it's kind of amazing. But... You know, there's tension. Anyone who's been around OSM for a while knows this. Visible companies come and go, build on top of OpenStreetMap, but really one of my main points here today is the same freedom given to businesses by the OSMF, and the fact is that there's no Wikipedia.org controlling things creates a vacuum for companies to come along and productize OpenStreetMap, and this productization is I think a core tension that happens here. So Cloudmade did this once. I did it at MapQuest, now again at Mapzen, Mapbox does it as well, and it's the biggest example of this, but the vacuum created by this provides opportunity for companies to come along and do this stuff really well. So this is an anonymous email I got. And it says basically -- from some services from Open MapQuest and investments we did there at Nominatim and geocoding and the vector tiles -- I don't think it's true that I've done the most. I think other people have. But this contention is that OSM should be running these services, which requires budget and staffing, and you would need to fund raise. Should the OSMF or some central group do this? I'm not going to say that. It's something for discussion. I think that we as a community have made choices over the years, and what we have here is something where businesses will come in and do this. So how many people remember Cloudmade? Can we see a show of hands? Actually... That's about less than half. That's really interesting. So you might find their business familiar. Right? So one thing that they did was they allowed users to style their own maps. They hosted tiles for a fee. They invested heavily in Mapnik, which is map rendering technology. They launched Leaflet, an iOS SDK to work with OpenStreetMap data. And most of these things have been done several times since then by different companies. At Open MapQuest, we did a lot of this stuff, Mapbox does this very well. We're doing some of the same stuff at Mapzen. So the market will fill the void. This, again, going back to the Wikipedia example -- if Wikipedia had just been a database with no service offerings, I think the same things would have happened. So Cloudmade had some problems. Like everything in business, some of it was timing. Some of it was execution. I think it might have been a bit too early for the market. The data might not have been good enough yet for a lot of use cases. Again, this was like... I think the 2008, 2009 time frame. And Google, as a map API, was new and popular and free, and there were also some community struggles. Again, about the productizing of OpenStreetMap and how controversial that might be. And again, about blurring lines between what OpenStreetMap is, the community, and what the business is. That's also a tricky thing. And they were a VC funded startup, which is great, unless things go wrong. They pivoted a number of times and now they're in a completely different field. Although they still do exist. And this is a Techcrunch article I linked to in the syllabus. But this shows -- Cloudmade's OpenStreetMap. There's always the tension with -- businesses -- are they claiming ownership of the project? Are they blurring the lines too much? This is something that comes up a little bit. It came up quite a bit with Techcrunch. Did they give a sense that the business can drive the community? It's hard to do that, especially in OpenStreetMap. This is another anonymous quote for a community member who let me interview him for this, which is super useful. It sums up, again, one of the key struggles between a charismatic business and the OpenStreetMap community. One thing about doing something as a business is it's your full-time job. You get to really focus on it, you can hire other people, you can do marketing, design, teaching, and this attracts a much bigger community with more novices and more people. But my contention is with the Foundation and OpenStreetMap, this will happen inevitably. Businesses will always do this. It's too strong a vacuum and too strong an opportunity. I think if the companies doing it now would vanish, others would appear immediately. It's kind of like drug dealers. One leaves the street, and the other one comes back up. Wow. We're already 15 minutes in. So I'm going to talk about MapQuest. So I mentioned that I was there, and I can talk a lot about this, because I was there. I love MapQuest. You know they started in the 1960s? It was the advent of commercial computing with maps. It's a really interesting company. And in 1996, they put a map on the web. And I want to pause and think about this and acknowledge it. Because I think it's really amazing. It's been forgotten. MapQuest was really the first web application that most people used. Back in the mid-'90s, there was a lot of static content. You would read text. But with MapQuest, you would press a button and things would change. People could drive somewhere they couldn't have gotten to before. So it's a very revolutionary site, frankly. Also the first billion dollar mapping acquisition. Waze was the most recent. Right around that time, Excited Home bought Blue Mountain Arts, which is a greeting card company, for $7 million. So maybe there was a little bit of a bubble. But still... So what went wrong with MapQuest?
(laughter)
(applause)
>> Wow, wasn't expecting that. So this is Gerald Levin and Steve Case from Time Warner and AOL. My opinion -- I was there way after this, but MapQuest was acquired by AOL right around the time of the AOL Time Warner merger. This is regarded as the worst business deal in history. The internet folks came in strong and cocky, the internet bubble collapsed, took a one time writeoff of $99 billion, that's a B, and the stock value went from $226 billion to $20 billion. So Time Warner got the upper hand, treated AOL like a cash cow, including MapQuest, and when I showed up, there was a team that wanted to innovate, but they weren't able to do much. When you searched for an address, they had a search disambiguation page that was like -- did you mean -- even if there was only one result. Because they wanted to show an ad. So it wasn't great when I was there. So what kind of strategy did we implement? Looking at OpenStreetMap and having come from a startup that had done a lot of work with OpenStreetMap, we just decided to hopefully invigorate the developer community by supporting some of this stuff. And here's a Wall Street Journal article from -- it's actually right after State of the Map in Spain. And you can see we made that very clear line between what the community is and what the company was. And yeah. So this is basically the announcement for that. So there wasn't really much of a turnaround strategy, but this was kind of something that we thought might be interesting. We pledged $1 million to help support OpenStreetMap, and I wasn't sure what that meant at the time. If anyone saw Kate Chapman's talk, having talked to Kate, organizations need to be ready to receive that. There wasn't really an organization ready to receive it, so we did a lot of internal development and outreach and things like that. Here's an example of the site we launched. You can see the Nominatim GM coder. As we relaunched MapQuest, it was the same kind of interface there. It was good. A lot of work on Mapnik, Nominatim, routing imagery. So let's talk about disruption, because this gets into the second part of the talk. I'm going to go a little faster. I think this is an hour and a half talk, now that I think of it. Disruption is a very overused word with startups, but this guy wrote this book, the innovator's dilemma, called one of the most important business books of all time. It describes the problem of a successful business like MapQuest that's unable to compete with certain competitors because competing would cannibalize the core business. It's easy to dismiss the competitors because they look cheap and not super professional, but the quality gap closes before the incumbent can really respond and they go out of business. I think this is something that happened there. MapQuest made a lot of money selling store locators, so when the Google maps API came out, which is free, they had a sales team. Our customers need support. They need more professional services. Not so true, after a while. This is a great book if you're interested in these topics. But this comes back to us in the mapping industry and with OpenStreetMap. So there's this year, 2007, which I like to call peak proprietary map data. I'm not sure. We'll see what happens. But it's an amazing year. So July, TomTom offers 2.8 billion for Tele Atlas, in October, Nokia acquires NAVTEQ for $8.1 billion. Think about that when you see what Nokia is going to sell for shortly. October, Garmin bids $3.3 billion for Tele Atlas and in December, TomTom ups its bid to $4.3 billion. So all this frenzy of activity happens around there. And then what happens? Another great article. It's called less than free. Which I link to later on. Which you should read. Google does a lot of really interesting moves. It drops Nokia as a map provider. The rumor was that Nokia didn't want to give them a very good deal for Android because it would have been competitive to Nokia at the time. Google starts driving their own cars. This was probably cheaper than the Nokia data deal. This is kind of hearsay. In October 2009, Google drops TomTom and now Google is running 100% on its own data. So 2009, this is a really key movement, Google offers free turn by turn nav on Android, which causes Garmin and TomTom's stocks to fall -- and it's an amazing thing about Google. The bigger they get, they can disrupt whole markets like this. Usually it's a smaller upstart. They're amazing. So this article is from October of last year. And you need to read the news about Nokia here to see what happened. This author looked up the financial data for Nokia and TomTom, and the entire industry is basically gutted. He concludes there's not much money to be made in mobile and internet maps, he looks at the Apple deal with TomTom and says there's no significant revenue coming out of this deal, despite the number of users they have, and the bottom line I find fascinating. Google has been sucking all the value out of the market. And if that is not enough, OpenStreetMap is finishing the job.
(applause)
>> Yeah, wasn't sure if that was a good thing or a bad thing. But to give more context to why this is important, mobile is a huge market. So you have Google and Apple. You know, Google invests way more money in mapping than I believe they can possibly make back. I'm not sure, but I would imagine that. And maps are key to their mobile strategy, right? They're willing to invest profits from other business lines to control this market. Look what happened to Apple with their maps. So the iPhone launches in 2007, with Google maps, Google buys Android incorporated a year later, a year after that, Apple gets uncomfortable and acquires placebase, which is the start of their map project, I believe. I wasn't there, but it seems like that from the outside. And we know what happened after this, but this stuff is really key for mobile, which is a huge market. So what we're talking about here is commoditized data. There's two disruptive forces in mapping. There's Google and there's OpenStreetMap. We're seeing all the value getting sucked out of proprietary data vendors. So I personally would like to see the value of that go to zero and for us to do something a little more interesting like product development. When I say commoditize, I mean remove the value that one provider might provide over another. These are commodities. So the market doesn't care who is selling wheat and where it's coming from. Wheat is just wheat and corn is just corn. Data can be the same thing. We can decide it's just a commodity and build interesting stuff on top of it. So I think we can have the best map data as part of the commons and developers can work on new products that are hard to build now. There aren't enough products in the hands of users with OpenStreetMap data. It's getting better, but it's kind of inaccessible. So Mapbox is an example of a company doing great work. I think the most visible, doing this stuff, getting it into the hands of new users and new developers, working on data in the commons. Alex gave a great talk yesterday about helping to fix the map. Users of their different services can submit things that they might not want to fix themselves. So I think there are some tensions with the community with that, because it's direct investment in the data, but unless you think of businesses as part of the community, which I do -- I do see businesses as part of the community. Oh, that's a weird print. That's supposed to be the Skobbler icon. So Skobbler, who are here, is another example of a company doing great things. They've had really great talks, helping to make sure the data is navigable. Working on this data in the commons. They're a professional company with a strong sales team, the ability to get this stuff distributed in big ways, and I think they're interacting really well with the community, with the potential to make a big difference here. And so... You know, finally... I think that working together -- and I love the fact that we're at the United Nations, because you can have a slide like this, and it seems a little corny, but hopefully it's true. I would like for all of us to work together. A community of individuals and businesses, to make this the best data set in the world, using our shared resources, which then moves competition away from the data. Think about it. There's all these companies driving around, collecting the fact that the UN is here, or the pizza place is there. It's such a waste of effort. Like, let's just... Let's just share this stuff.
(applause)
>> There's so much more interesting stuff to do. It's hard to build a company like FourSquare or Waze, because FourSquare, you need really good location data. It's hard to get. With Waze, you need to build a big team of navigation experts. You need to build data. Let's... Let's share in that stuff and make it much easier. Let's have a dozen companies like that pop up. It would be so much more interesting. So yeah. Companies are still trying to productize OpenStreetMap, but there's a lot more work to get to that, where we're coming more interesting products on top of the data. Geocoding isn't great. Navigation needs improvement. There's a lot of things that we need to work on. So let's focus on this. Yeah. So... That's it. Any questions? Comments? Questions veiled as comments?
>> So it seems like a lot of the acquisitions right now is built around driverless cars? What do you think is going to drive consolidation? Do you think the factors in the market are going to drive this convergence to data as a commodity or fracture it further?
>> I would hope that some of this could happen in the commons. The question for me is -- where does the value actually go? If the data is commoditized, if the services are to some extent commoditized, where does the value go? And it goes to user interactions. It goes to products. It goes to interesting social things that happen over that. So in terms of a lot of the value that you're seeing now, it's the on-demand economy. You need navigation services. You need things like that. I would love to see that get involved in the commons and have apps on top of it, that can provide value. Like Waze, for example. What was the value of Waze? Is it the social activity on top? Or is it the data and the technology? I don't know. It's hard to talk about. But I would like to see that value kind of step up. Does that answer your question?
>> I love your idea of data as a commodity, but two things that are different between a real commodity and the OSM data is -- you actually pay for a commodity. The oil isn't free. And number two, once you have it, you can do whatever you want with it. You can burn it. Whatever. How do we get the OSM model from here to there, or do we get the OSM model from here to there?
>> In terms of paying for a commodity -- there is still a cost in this. It's just a cost that kind of gets driven to -- you're not competing for value-add on top of it. I'm not going to offer different OSM data from someone else and be able to charge much more. So in this case, it would be like data costs and things like that. And I'm sorry...
>> Once you buy it, you can do what you want with it.
>> So I personally think that in terms of the license there are some clarifications that should be made. We're moving in that direction. Geocoding is one of the hot button issues. But I think it is pretty permissive. You can do anything you want with this data, with two things. You need to give attribution, and you need to share any improvements that you make to it. And having been at MapQuest and done deals with the global data set providers, that's actually extremely permissive. If you do a contract with someone that you're buying data from, the contract is this big about what you can and can't do. It's tougher than this. Some things need to be clarified. I think we can definitely do that. Yes? It turns red when it goes on.
>> Hello? Really curious about any thoughts on how governments can be involved in this idea of a data commons and how they might cooperate with private entities to make that easier to do, or provide some leverage there. Just basic thoughts.
>> Sure. We've been doing a lot of work. There's a lot of interesting work happening with governments and government data. I think just opening the data and providing it in a way that's accessible and liberally licensed is big. Again, like I said before, there are businesses ready to fill vacuums and help out. There's plenty of businesses that would and are helping out in that case. You know, Mapbox, for example, did a lot of work with New York City -- opened its PLUTO data set, and they did a lot of work with that. As more businesses are around, they'll help. I think making it accessible and licensing it right are the two big things. Do I have... Just maybe one more question? Yes, sir?
>> Okay. Do you have perspective on what the hope was when Steve Coast went to Microsoft? Funnily, I was at Microsoft and Bing Maps then. I knew nothing about what was going on.
>> I have hearsay. So I was at MapQuest, and Steve Coast went to Microsoft at the same time. This was 2010. I was like -- this is it. This is businesses that are going to do this. Because both companies spent tons of money on proprietary data sets. With Steve there and me at AOL, at MapQuest, I was like -- we'll definitely do something interesting here. I think a lot of businesses look at OpenStreetMap and are maybe not all-in strategy-wise, but interested enough to do some investment there. And I think Steve was part of some interest from Microsoft. But obviously that didn't turn out exactly like that. But maybe next year. All right. Thank you.
(applause)
>> Oh, wait. So I have a pinned tweet with the syllabus. I promised this. I meant to end on this. So I'm @randyme on Twitter. I have a pinned tweet. If you want to read some of those articles, check it out. Thank you!
>> So for those who are more interested in this topic about OpenStreetMap and businesses, there's going to be a birds of a feather session now. You cannot hear me? It's going to be a birds of a feather session now. 3:30. About how businesses can support OpenStreetMap better. Now, 3:30.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
>> Hello? Oh, great. Hi, everyone! Welcome. I guess this is the midafternoon session. Anyway, we're here for the lightning talks. And here's our illustrious panel. We're going to kick it off and get started. Five minutes each. First one -- he's going to be talking about some berries. Here we go. Patricio.
>> Hello. My name is Patricio Gonzalez Vivo. A year ago, I was scraping Google's data and creating landscapes that doesn't exist anywhere else, except the data. But this was problematic, because this data is private. So luckily, I was very happy to join the Mapzen team in August last year, and since then, I get to know more about OpenStreetMap, and I get to play with the data, and do different types of experiments. This is one from labels. And also to redefine or push forward the boundaries of how maps can look like, taking some tricks from computer graphics world. And game engine. But today, actually, I don't want to talk about beautiful maps. Even... I think that I would really enjoy. It's more kind of the opposite. It's about slow devices. I think true openness -- it means lowering the bar. The technology bar. To make this more accessible for everybody. So I decided to make a personal project inside Mapzen with a super slow inexpensive computer called RaspberryPi. That probably everybody knows about. Everybody knows about RaspberryPi? Yes. All of you are excited about RaspberryPi, no? I am. It runs a full Linux distribution. It's less than $35. And it's a project that came up for education. So I thought it would be a really good idea to put some open data, Open Source, and make my own -- our own -- GPS. Like a DIY GPS device. So I hooked a couple of things together with a RaspberryPi. A GPS battery with a charger and a touch screen, which is only a touch screen, one single touch. It's not multi-touch. It's cheap. So I pulled this all together with a core of plastic that I 3D print in the office. It's a picture of that. So with it, I managed to make it portable. Whoa. Everybody is thinking that this is going to explode. But I'm not going to say the word with a B, because I'm very afraid. There's a guy there. Watching me. Okay. The second problem was the interface. Because it's only... It's only touch device. I have to find some ways around it. So I made two slides. To rotate it, and to zoom in. And a button. And you can see that it works pretty fast. Thanks to our brilliant team, on the Tangram ES. I think they are there. So this is working. This is something that you can download. And really hoping that people get excited about this. And start building their own devices. I think everybody here enjoys making their own tools. So build something cool with it. Let me know. All the information to do it and the schematics to 3D print the case and everything else is in our Github, and also you can see a nice post about it in our blog post. Thank you so much.
(applause)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Extracting interesting data from OSM
Diana Shkolnikov and Indy Hurt
Sunday, June 8, 2015
>> Hey, folks. Welcome to the final day of the State of the Map conference. Super fun day. Workshops all day in this room, there's the hack day for free form hacking and getting together, there's Missing Maps also, and there's a mapping party. So feel free to explore throughout the day what you want to do. You're here now, so that means you're here for a workshop. Indy and Diana from Mapzen are going to talk about getting interesting data from OSM. So let's get started!
>> Thank you so much for the introduction. I am super excited to be here today. I want to help you get started with OSM. It's kind of this enormous dataset, with all kinds of amazing things. And how do you actually extract information, how do you find information, how do you work with the dataset -- those are what we want to cover today. So I have this really exciting overview slide that has a lot of what we're going to cover today. And I'm going to make these slides available to you as well, so you can click on these links and get back to these sites. So I put the URLs up there in tiny print so you'll be able to get back to them. But we're going to cover OSM map features, this is the wiki system through OpenStreetMap, and this is the starting place. This is an amazing resource of information. And once you drill down through the wiki pages, we will also show you about taginfo. So all this official information and wiki pages that you would find on the OSM features page you might find additional information on taginfo. And it could be information about tags that we don't actually have wiki pages for, but there is so much information. So we'll cover taginfo. We're also going to touch a little bit on Overpass Turbo. So here's that question. I found something interesting. How do I actually get this out of OpenStreetMap and create a map with it? And Overpass Turbo is just one way to do that. It's also a way to create queries so you can look for things, and there's a workshop on Overpass Turbo today as well. I'm actually going to attend. I learn things about these tools all the time. If you need more than just some little subset within OpenStreetMap, then you might want to check out our metro extracts. This is exciting. You can download all of the OSM data for a particular location. So you might pick New York City if you want everything that's available. If you don't want to download the entire planet file and think about -- how do I deal with 65 gigabytes of data? Uncompressed, it suddenly becomes enormous. So you might just want this small subset. We've cut it up for you. And finally, we have a tool named Fences. And Diana is going to talk about this. She's actually going to demo it. And I'm excited about this particular tool, because what it provides you with are all the administrative boundaries for a particular location. And this is something that you can use to make (inaudible) maps. Everybody is excited about administrative areas. We all know about cities and states and areas, but what about what you might find in another country? You might find hamlets, you might find provinces. You might find all kinds of interesting things. And we're going to show you what those interesting things look like in Japan. And finally, once you get to that point, you've drilled down to the data, this is where you find all the information, and down in small print right here, I put the public mailing lists. So the public mailing lists are kind of a mixed bag. You will find all kinds of discussions happening there. This is where you find the most passionate people in the OSM community. And I hang out in there. I haven't quite yet gotten up the nerve to actually jump in and post anything in there. But I actually learn a lot from this community. Because I tend to like these really random datasets, and this is where I find people talking about them. So I'm really excited to have all this information for you, and all these links. So I'm going to make sure that you get these, so that you can find this information. So... The first thing is the map features website. And who wants to look at a slide that features a website with all this text? So I put these little icons on here to entertain you while I tell you a little bit about this website, because you're not going to be able to read all these details, but you have these links. So the map features website is really exciting, because it has links to all of the primary features, and if you want to know how to actually search for features, it's really helpful to understand how they are represented within the dataset. So you'll find things like airway, and so if you're thinking like -- how do I find all of the airports? Then it's helpful to know this particular word. Aeroway. That's how all the features that are related to airports are sort of connected together. And then if you're looking for restaurants and bars and pubs, you will find that in the amenities section, and it's under sustenance, which is perfect. Because it's like... I could use some sustenance right now. I could use some orange juice. There are schools. There are all kinds of information about transit. You might have information related to bicycling. You might have information related to streets. There are even banks. Financial institutions. And I find that really interesting. I want to know where all the ATMs are. So we're going to look at where the ATMs are. Because they're the magical money dispensaries. Thank you. So... If I click on the financial section, it takes you on the page to a section that has the financial information. And we see amenity, ATM, amenity, bank, amenity bureau_de_change. I'm not pronouncing it correctly. I think it's pronounced with a French flair. And I'm really interested in the ATMs. So I want to understand how they're represented in OSM. So if I click on that ATM in that table, it takes me to a wiki page that's all about ATMs. And you have this tagging information here that tells you what kind of tags are often associated with ATMs. So I might have name. I might have operator. I might have fee. Like, I might want to go to an ATM that doesn't have a fee. I don't think that one exists, but I can try to look inside the OSM data and see if anyone has tagged the ATMs that are no-fee ATMs. Cash in. If you're one of these really crazy brave people and you actually like to make a deposit into the ATM in cash, you will find where you can actually do that. I might do a check, but cash is... That's crazy talk. Cash comes out of the magic money dispenser, not in. So there's also drivethrough. All these amazing things. The only thing missing here that I'm actually really disappointed in is there isn't pneumatic tube as a feature. I'm so happy some of you have laughed. That's dated me. I'm not the only person who remembers the pneumatic tube that sucks the money in. So awesome. So enhance -- if we zoom into this sidebar, this is where you'll also find a lot of interesting information. So this here is the elements. So it tells you -- it gives you an idea of how this information should be stored. It says -- it should be stored as a node, which is right here, and then there's also ways, either as a line feature, or you have an area or a polygon. And then this last one is a relation. And it's saying that these features are typically not stored as relations, areas, or polygons, or linear features. It's an ATM. You would expect it to be a point. So a relation is kind of interesting. Because a relation might be a bunch of features that are connected. And I was thinking... Well, for a bank, I might want to have all of the Chase ATMs be a relation, but it's not a suggested thing. And suggestion is really the key thing here. Because you can tag things any way that you like. But if you want to have a common scheme that makes it easy for people to find your features, you might want to look at these suggestions here. There's also some useful combinations. Operator, the drive_through, and then status: approved. Which means that the community has voted and proposed and spent quite a bit of time working on how to represent these features. So you can have confidence that you will find a lot of features that follow this schema. And then at the bottom here, it tells you some statistics. So we have here 83 -- a little over 83,000 ATMs in the entire OSM dataset. That's quite a few magic money dispensaries. And you also see that 279 are represented as ways, and 7 as relations. This is a very small percentage. So that's going back to that whole thing that I mentioned about suggestion. There's a suggestion of how these features should be represented, but you have the power to represent them any way that you might like. And you also see at the bottom... Tools. So let's enhance a little bit more. This is this exciting part of that sidebar. Where you can actually click on these features and figure out a way to extract them. So what we're going to do first is we're going to look at taginfo. Because I want to understand more information about these ATMs. So if I click on taginfo at the top, it's like click heaven. You can easily go down the rabbit hole, because everything is clickable, but I encourage you to click away so you understand what information is available. So quick segue here. We're actually leaving the wiki page with all those map features. It's a good use of your time. You know, "Rome"-ing around on the wiki pages. But I actually wanted to click on that taginfo and see what I could find out about these ATMs. So when I click on that link, you get this page here. And first you see the overview page. Same thing we saw before. 83,000... You know what's interesting? I made these slides over a few days. And the first slide was made first, and it showed 83,500, and then I made this a few days later, and it says 83,800. More magic money dispensaries! Okay. So nodes, ways, all those relations -- they're there. And if you go to combinations, I find this super valuable. I'm a data scientist with Mapzen, and I need to extract all of the features that are represented as ATMs or have some relationship as ATMs. And so if I look at these combinations, I can get some information about, like, how many have additional bits of information, like how rich are these ATM features. I can find that there. And then the map tab -- what do you think is on the map tab? A map! Oh my goodness. The map. So here are all the ATMs. So if you're taking a trip down the Amazon, you might not want to go without cash. Same with the Sahara. There aren't a lot of ATMs there. Or this could be an opportunity for Missing Maps. Maybe we should all take a trip right now to the Amazon and try to find some ATMs. Because they appear to be missing. Well, we just want to confirm that they're missing. If anybody wants to tell me that they really aren't there, I won't believe you until I know for sure. Do you have a question?
>> Sorry, stupid question, probably. But -- if it's in a different language, how do you know -- how do you cover that as well, if you look for worldwide tags? The same thing that might be called something else in a different language?
>> This is a very good question. So did you all hear that? How do you find information if it's tagged in a different language? So the exciting thing is that a lot of things are tagged in multiple languages. So there are specific tags that will have the two-letter country code. So what you want to do is you want to look for the ISO standard for the two-letter country code names, and then you can use that to search the data, and you can find tags that are in different languages. Also, this next tab here -- great segue -- is the wiki page. So all of these wikis here are in different languages. And you might find the different features here that are represented in different ways by going to these wiki pages. In most cases, it's just a straight translation of the English wiki to the Dutch or the Polish or the French wiki. But some things kind of get lost in translation. And so if you're working in a particular location, and you know what the official languages for that country are, I highly recommend that you take a moment to view the wiki page in that official language. Because there just might be some additional information there that would be super valuable. Think of, like, Wikipedia. If you've ever spent any time looking at Wikipedia pages in English, and it's for a place in another country, and you click the link that takes you to that country's page and it's completely different, you won't necessarily find that with the wikis for the OSM, but you will find slight differences. So this is -- there are a couple ways. One is to use the tag search with the language. The other is to find the wiki for these different languages, and just check to see if there's anything different, or define how things are... Like, the words ATM in different languages. Thanks for that question. And finally, there's this project page. And these are so interesting. There's all these different projects, and every feature that you look for might have a different set of projects associated with it. So Nominatim helps with search, JOSM is another type of editor, Vespucci is another editor, Wikidata, you can click on those, and find out about additional tools. There's just so much information here. So... Going back to the combinations, you can find all of these interesting bits of information here. I forgot what I was going to say about this slide. But... What did I have here?
>> The second page has drivethrough.
>> Oh, it does have drivethrough. Yeah, it has drivethrough. Which is cool. It still doesn't have pneumatic tube. But it has brand. And if I click on brand, I can see the distribution of all the features that have a brand tag. And this is also clickable. So I can click these things here to drill down even further. So the point I'm trying to make with taginfo is that there's a lot of information, and you can drill down pretty easily. Similar brands here. And this is where I find interesting features related to, like, slight typos. Brand is a really easy word, but sometimes if you have something that might be spelled like color with a U or without a U, then you can go to this page and figure out, like, okay, I might need to combine a few things to get all the features that I'm interested in. So taginfo -- this is the homepage. It's a click Wonderland. There's all kinds of information. You can come here. I definitely recommend that you check it out. There's also a set of reports. There are information about the relations. You can find the most popular tags. The most popular values associated with those tags. It's all here. And it's all for you.
>> Here's a chair. If anyone wants to claim it, I'll just move it here, and then you guys can just walk in, so you don't have to look from the corner. If you just want to spread out on the floor. Sorry we didn't prepare.
>> So there's a project happening now in the city that is happening the trees. And so I thought -- okay. Let me just do a quick search of the trees in OpenStreetMap. And it turns out that there are 36,000 trees represented in OpenStreetMap. And hardly any of them are here in New York City. So if you want to contribute to this project, you would definitely be helping quite a bit to get more trees into the dataset. But you can see from the map, if you go to taginfo, it shows that the highest concentration of the types of trees are olive, and it's southern Europe, as you might expect, but there's also apples and oranges, and they're a little bit wider distribution. There's also a huge clump of oil palms, and they're all in Malaysia. So if you want to fry it up, that might be interesting. So you can see interesting features within the data. I like random datasets, so I looked into abandoned. And I just looked for abandoned by itself. As soon as I started typing this into taginfo, it gave me a list of things that might be similar. So I picked just abandoned by itself. So you can see here there are 21,000 features that are tagged abandoned in OSM data. 998, almost a thousand different keys used with this tag. And there's a lot of similar keys. And values used with this key -- almost 150. So as I look through here, there's 84 pages. I looked at every single page, because I just do this, and I see on here abandoned:tourism. I have to know what is an abandoned tourism. This is right up my alley. So I go here, I click on that, and then I get this page. I'm looking -- what's here? I see hotel. Oh my gosh. The Bates Motel could be in here. Guest house -- these scary haunted houses are what's coming in my mind. But the thing that stuck out to me is theme park. So I clicked on theme park, and there are 13 of them. And this is when I was like -- okay. This is what I want. I want this data. I have to have this. And this is where you find that there's a bunch of tools here at the top. And one of them is Overpass Turbo. So I click on query, and it's going to query these 13 features. Abandoned tourism amusement parks. I'm so excited. It automatically brings up the Overpass Turbo. And look at this amazingly complex query that I did not write! I clicked. And it just perfectly formatted it for me. So you can do this. There's nothing at all to be intimidated about. It will just generate this query for you. And it displays the features on the map. It looks like there's a collection of nodes, it looks like there's maybe some ways, some polygons. I can click on them and it gives me information. Looks like there's Wonderland in China, 60 miles outside of Beijing, 140 acres of craziness. If you've ever been to -- I'm from California. Disneyland is about 40 acres, I think. So can you imagine 140 acre fake Disneyland that's abandoned outside of Beijing? I can. I'd like to go. There's also Jazzland in New Orleans. What could possibly go wrong? Maybe Katrina just decided to swallow this entire amusement park. You have to Google Image this. It's amazing. Other places. Holy Land. How did Holy Land get abandoned? I can't understand that, other than the fact that it's not in the Bible belt. I could totally make this place rock. Can you imagine Noah's Ark water rides and burning bush barbecue?
>> So the Overpass Turbo service is querying the layers of OpenStreetMap --
>> Yeah. So this is the actual data. I've figured out -- I used taginfo to figure out what features I'm interested in. I drilled down, I figured out that there's 8 of this particular feature. I think it was... Oh, 18. And then I clicked on the Overpass Turbo, and it's actually querying that data out for me.
>> And from there, can you extract it into other formats? Into JOSM?
>> From there, can you extract this data? Yes!!! You totally can extract this data. Export. So at the top of the screen, you see all of these buttons at the top. Export is the one that I'm interested in. I promise I did not plant this person in the back of the room. You guys are like the perfect audience. So you click on this. You get all these options. I'm a huge fan of geoJSON, because it just makes life easy, and I have all these reasons why. I'll show you in just a second. So you can click on this, and it's going to give you the GeoJSON, you can copy it into a text file, save it as abandoned amusement park.geojson, instead of txt, and from there, you can drag and drop it into CartoDB. Literally drag it from your desktop onto your browser, and it will generate a map for you. So here is Holy Land. You can also use Mapbox. Equally easy. Drag and drop. And you can pick all these different themes. I figured this theme might be appropriate for abandoned amusement parks. This is another option. It's called uMap. It's also fairly easy. It's not drag and drop, but you'll see this little button up here. It looks like an arrow pointing up. You just click on that and it asks you for the path of where you want to find your GeoJSON, and then you can view that in your map. If you're a little bit more advanced, and you're going all the way to QGIS, you can drag the GeoJSON from your desktop and drop it on the QGIS frame, and it will render it. I'm pretty sure Arc GIS will be able to do this. It's pretty crazy that we can use all these cool Open Source options to create these maps. I have one more map here to show you. Because all I do is look at data all day. I mean, that's actually my job. And I made this map. Does anybody want to guess what I queried?
>> NYU?
>> Well, NYU is probably... Actually, I'm not actually sure where NYU is on this. I'm from California. But all the little red spots. What do you think these red spots represent? Not just the pin, but all of the little dots?
>> It can't be Starbucks. Because there's not enough of them.
>> Public bathrooms.
>> Elevated railway?
>> Trees?
>> So I queried stairs. And I don't know how well this represents reality. But this tells me that if I want to do a stair hike, I need to go to the East side. Or the West side, sorry. Cardinal directions. And I might want to... And for those of you that are from New York City, if you feel that this map isn't representative of where I might find urban stairs, then we should get together and add more to the map. Sound good? Okay. So from here, you've gone to the wiki. You found out information about these different features. You've gone to taginfo. You've dived deep, and you've gone to the Overpass Turbo, you've pulled out some information, but what if you need more? What if you need an entire city? You want everything. You don't just want the stairs. You want all the roads. You want all the polygons. You want everything that's available. So then I highly recommend that you go to our website, and you can get a metro extract. So here if you go to the link that's data, you'll see a metro extract web page that has metro extract, it has borders. Diana is going to talk about this. She's going to demo it to you. And if you were to scroll down on this page, you would see that we even have transit data available. You can download transit for different cities, which is really exciting. So here is an example of a metro extract. I decided I would be curious in Trinidad. Trinidad-Tobago is an island Republic off the coast of Venezuela. It's known for having one of the greatest parties in the world. They do a carnivale that rivals Brazil, in my opinion. My mom is from there. So you can grab the GeoJSON and drag and drop it into these websites. But these are really large datasets. This one is kind of small. But if we look at that list of all the features that are available, you'll see that there are some differences. And you can see there's kind of a range from the raw to the processed. And so if you really want the raw, you might go with the OSM PBF file. And then if you're a little bit more advanced, and you want to load this into a postgres database, and you want to do all kinds of queries and pull out interesting things and you want as much control as possible, then you might want to start there. But if you don't want to do any processing, then you might want to take one of the shape file formats. And you're probably still wondering -- well, what is the difference between all of these formats? Well, this slide is really a graphic representation of an entire blog post that we have on our website, that describes what the data is, what's available, what's the difference between all of them. I just pulled out the things that I felt like -- oh, these are things that I really want to drive home, that you'll have to look out for. One of them has to do with the projections and coordinate systems. Projects make geographers sad. You know about that? Because they distort shape, area, distance, direction. And if you try to bring in features that are in different projections or coordinate systems, they might not overlay properly. So it's helpful to know that one of the datasets, the IMPOSM shape file, has a different way of representing the data than the others. So if you are mixing and matching, it's helpful to know that ahead of time. It's also helpful to know that if you grab the OSM2PGSQL shape file, you're maybe going to get four different shape files. One representing the ways, the lines, one representing polygons, one representing nodes, which are points, and you might have a whole separate one for roads. But if you go with the IMPOSM, you might get, like, 18 files. So just test it out. Take a smaller metro extract and download it, and see what you get. And now I'm going to turn it over to Diana, so she can tell you how to use fences.
>> So I'm going to start off with just talking about -- once you guys downloaded one of these extracts from the city that you're interested in, from our metro extracts, then you can do some additional filtering, because sometimes you want to write scripts to do these things. You want to automate your extraction process for whatever app you're doing. You don't always have the luxury of going to the taginfo website and Overpass Turbo and all those things. And so there are other really awesome tools you can check out. I have links to them at the bottom, so when you guys get your hands on the slides, you can check them out. What I've used for the fences project, for example, is OSM convert and OSM filter. Those are really convenient tools. OSM convert will basically let you take any format, like PBF, and convert to OSM, and go between those different formats that Indy showed you. And so we were going to walk through this, and I think we can still do that once we finish the slides. We can just actually run through some of these examples from the command line. But you can take OSM convert, feed in -- I fed in the Disneyworld extract. Because I was like -- if I want to plan my trip, I want to know where all the parking is. So I downloaded the Walt Disneyworld extract. Converted it to OSM 5, because it's more convenient -- it basically allows it to process it faster. OSM filter is really fast, by the way. It will go through the planet in 8 minutes, depending on what you want to extract. And then do OSM filter, and you have these awesome tags. I encourage you to go and check out their wiki, which again is linked here, and it will tell you all the different things you can do with it. You can keep things, you can throw things away, you can specifically say ignore certain things. You can say leave only the relations. And the great thing about this tool is that it will know all of the members of those relations, and it will keep them for you, even though they don't have the tags that you're interested in, and that's a difficult thing to get from other tools. So this one is great for that. So I said keep amenity = parking. And again, I went through taginfo and I looked at -- what are these tags that I want to find, that are going to help me locate all the parking spaces? And then I wanted it to output into parking OSM. So that's what that -o is for. From there, I have this OSM file, which is XML, and I could display that in places where it's supported, but a lot of times you want to convert it to GeoJSON, because it's a well supported format and it's easy to read through. So if you wanted to convert it, there's a really good tool. OGR2OGR. And the link for it is there. These are all Open Source tools that you can install easily. You can build them on your own machine or you can download already built installs for various OSes. So if you wanted to convert it, you would basically do OGR2OGR, go -f and give the format. You would have to go to the wiki. I don't want to sell it short by just giving a few here. You have to check it out to see what they can do. And this will take the XML and convert it to GeoJSON. The Disney GeoJSON is what I'm going to get, and I told it that I want multi-polygon. There's a caveat to this, when you're converting to GeoJSON. You can only convert one type of layer. So nodes would be points, your ways would be lines or multi-strings, and then multi-polygons are the relations. And because I wanted to see parking lots, I picked out multi-polygons. You can also visit GeoJSON.io, which is another tool that you can drag and drop your GeoJSON into, and it would look like this. So as I'm planning my trip, I'm going to be like -- oh, we're going to go to this park and that park. Here are all the parking lots that I can go to. Or if you're doing urban planning and seeing if there's enough parking in the city, these would be all the public parking lots. You can see some of them -- our points, actually -- so this is not the GeoJSON that I extracted. This is the OSM XML. And this also supports importing that. If you go to that open -- it will give you a list of all the options, and you can import XML files, or PBF or GeoJSON. So this is a great tool for visualization as well. Yeah, so this will show you all of the parking lots. And then from there, we also have another really helpful extract. And so we've done the borders extract. And there's a lot of talk about admin boundaries. People want to get the polygons for various -- country, city, state, whatever you might have. And so what we've done is we've made these publicly available on our website. And you can go to Mapzen/data/borders. This just came out two days ago. It's hot off the press. And basically we've extracted every country that we've found within OSM. We've extracted it and made it available for download separately. Or you can get the planet file, which is larger, obviously. And if you're dealing with all the countries, then that's what you want. All of the different extracts at country level will have their specific country border, and then any borders that have been found within. So to basically... There are different admin levels. The tags that we extracted for borders were boundary = administrative and admin level = not null. And admin level determines the level -- basically what type of border it is. So you can see for different countries -- this is the wiki page, by the way. The OpenStreetMap wiki. If you look at admin level, it tells you -- these are the admin level numbers. And for each of the different countries, they mean something entirely different, except for level 2. Which is just national border across the board. But if you look across the world, it means different things, and some of them are not applicable to certain countries. So this is another good resource if you're looking at these admin level files and you're like -- what's in this one? I don't understand. So from here, we look at Japan, and you can see what each of them mean. And this is one of the few countries where every level is filled out. And if you took the files, you basically could download that zip file. When you extract it, this is what you're going to see. You get a separate GeoJSON file for every level. And so you can look at just counties or just cities or just the country border because that's what you're interested in, and you'll get the same list of files, no matter what country you download. And also you can take note that where it just says 42 bytes -- some of them are just empty feature collections in GeoJSON, because there's nothing for that level in that country. But the file is still there, just so that you guys can see -- hey, there's nothing there. If you think something should be, you can go add it to OSM, and then next month, when we do the extract again, it'll show up in the data. And so if you look at Japan, just to visualize what you actually see, I just did this in Tilemill, but basically it's just another visualization tool. This is the country level. This is the level 3. This is level 4. I think this one is 6. 3? Sorry. I didn't put all of them on there, because there's just too much detail. But these are the different levels in Japan. So this is what the data looks like when you actually extract it. And throw it into a visualization tool. And then sometimes you just want the entire planet, right? And so that's when you go and you grab it from the planet OSM. And you get planet, latest. And it is huge, if you look. It's 29 to 42 gigs. And growing. That's good. That's what we want.
>> You know you want it.
>> Yeah. So that's all we have in terms of slides. And then... Do we have time to... I guess we should take some questions, and we can walk through some of the tools, if we want to do that. Questions? Yes?
>> Why wouldn't you expose your tools as a service, by chance? Because that makes it a little bit more interoperable. Kind of like the previous service where you can actually query it.
>> So the fences tool -- the tool that's used under the hood to extract that data -- is available. It's Open Source. It's not a service, but it's something that you can easily fork and edit and do whatever you want with it. Or you can run it locally. But as a service, we're just making them available for download. You know what I mean?
>> So the website is kind of like a service.
>> Yeah. All of our code is a product and a service at the same time. So everything that we do, we have instances running, we support downloads of these things publicly, but then if you wanted to do something similar on your own and tweak it a little bit, or just have a private instance on top of your own data source that looks like OSM... You can do that too.
>> So if I download fences, will it show examples of how you got to the border projects?
>> Yes, yeah. You can do -- basically it's npm install fences, and it's like fences create, and you give it the planet OSM, and you will be able to get the administrative boundaries. I'm working to make it slightly more generic so you can specify tags. Right now it will only support administrative boundary tags, but in the future it's going to take in any number of tags that you would like to extract and give you polygons and be able to slice them up into countries or any region that you want. So one of the reasons you might want to run fences now is if you didn't find the region that you wanted. Let's say you wanted several countries in one GeoJSON. You can make that region outline in the GeoJSON spec, and then you can provide that to the fences slice command. And it'll create an extract of the planet GeoJSON with that -- with only the subset that you're looking for. So that would be one use case.
>> So are those admin boundaries localized per country? What I mean is -- depending on which country you're sending your product to, the boundaries could be different. There are some disputed areas where one country says -- we follow this boundary. Where the other one -- and international guys will have another boundary.
>> So right now, it's purely the data that's in OSM. So we're just extracting anything that says admin boundary... Boundary = administrative and has an admin level. So we're putting it out to the community to say -- what are the issues with these files? And hopefully drive back to the OSM editing tools so you can make it better or change it so if there is something that is not fitting those needs.
>> Please be careful. Don't start an international war.
>> Your product can get banned pretty fast if you show something which that country doesn't really like.
>> There is a tagging scheme in OSM for marking disputed borders. It is possible to mark that in the data. I think people do. So you just have to go on the wiki for admin boundaries, and you'll be able to see how people mark disputed borders as well.
>> Yeah.
>> And then you have lots of options. You can represent them as dotted lines, you can mask them.
>> So with some modifying, we can use this data.
>> And also within the data, you might find all these tags. Because we basically take whatever the admin boundary says in OSM, and as-is, we take the tags, and it becomes part of that extract in GeoJSON format. Those might already have the tags you're looking for, so you can do additional filtering on your own in the GeoJSON.
>> Can you provide us with a link to that table you showed that defines the different administrative boundary levels? I've been trying to search for it.
>> If you go to the wiki and you just search for admin_level, it should take you to that. And I'll add it to that slide, so when we make the slides available, you guys can definitely check that out. Did you find it? If you don't find it...
>> I did.
>> Okay, cool. Yeah, that's a really good resource. And again, the wiki is such a plethora of information. So you can search for that, and make your own map porn. Yay!
>> You want to do a demo?
>> Are there any more questions? We can go to taginfo and we can collectively search for something awesome. And you guys can just call things out. It's hard to do this when you're looking sideways. Sorry. It's just hard to do this. So... Yeah. This is the metro extracts page. And you can see all the different countries. GeoJSON.io. This is that page. Yeah.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Workshop: A hands-on tour through open transit data: OpenStreetMap, GTFS, Transitland, and Onestop IDs
Drew Dara-Abrams, Meghan Hade, Ian Rees
Monday, June 8, 2015
Okay, folks. Two more workshops this afternoon. First we're going to talk about transit. Drew and Megan are going to take us through transit data. And then finally at 4:00, we're going to talk about geocoding as well.
>> Thanks. So my name is Drew Abrams. I'm up here with Megan. We work for Mapzen on the transit urban design team, and we're going to show our work. We've been collaborating with others in Mapzen and some other community leaders with this project called transit land. So who here has worked with transit data in the past? Okay, some hands. The acronym GTFS -- does that mean something to some of you in the audience? Who here has ridden transit? Okay. End on a high note. So the setting for our work is that collecting together transit data -- so timetables, route information -- is a pretty well established practice now. We can pull out our phones, we can use Google services. As of today, we can use Apple services. We can use a wide range of different providers to route from A to B, using transit services. But the Open Source and open data landscape has been a bit more fractured. There have been lots of opportunities to work with these data. There have been lots of ways to create mobile apps and visualizations, but it's taken both a lot of experience with a big toolchain, as well as some nose-holding ability to work through a lot of the quirks that are in these data. So what we're trying to do with this transitland project is bring a lot of partners to the table, a lot of folks who work with transit data, and pool our patterns, and come up with some shared infrastructure, so that all our projects can move towards more advanced questions. And more advanced problems that we ask about transit data. Doot doot doot. So... What we're going to do today... We're going to provide a very bit of background information on GTFS as it currently exists, a very well established specification for transit data. We're going to introduce a little bit of this transitland service that we've released in a developer preview, and that we're happy to share with you all. And we're looking for feedback on. We're going to show off a simple and easy to use tool that we've created, so that transportation enthusiasts, urban planners, transportation planners, and folks who don't necessarily speak code can also make use of these data. And then we're going to show the API behind the scenes, so that those who actually want to touch code can create their own applications and visualizations. If you want to look up in the URL bar, you're welcome to follow along on your own laptop and run some of these queries. But we're not going to assume that you have to open a terminal window at all today. If you have any clarification questions, please feel free to raise your hand and chime in at any time, but we'll also have some time at the end for discussion if you have a larger, wide ranging question. So some context. GTFS -- it started out as the Google Transit Feed Specification. And today it's known as the General Transit Feed Specification. It was created by folks at the Portland Area Transportation Agency, as well as some folks within Google, and it's used by a number of industry players now. It's also used by hobbyists and developers and startups as well. Part of its appeal is -- it's simple. You can create it using Excel. It's a zip file containing a bunch of comma-separated value files. Here are a few of them. There's a file that brings together all the stops in an agency. Columns for the stop ID, the stop name, latitude and longitude. And this can be created in Excel. It can be consumed by just parsing simple text files. So this gets everyone immediately on the same page, in terms of being able to use a common dataset, coming out of a transit agency. What we're doing with transitland is we're trying to reference all of these feeds that exist and are shared publicly with a feed registry. This lives as a Github repo. It's a series of JSON files. It's open for anyone to use, anyone to contribute to. It's a similar pattern to the Open Addresses Project. Anyone here familiar with Open Addresses? A couple? So a directory of authoritative sources, pointing people to say -- if you would like to go download this feed, here's the URL. Here are its license terms. Here are some additional descriptive information, so that if you would like to go out and fetch any number of feeds, we'll provide you with that recipe. Like I was saying, we're creating our own API on top of that, that is also open. We call this the data store. Where we go out with transitland's infrastructure, and we fetch those feeds. We go to those URLs, and we bring in a copy. And then we make that accessible to any user with a simple query through an API interface. So this is relevant to people who would like to, say, write out a URL-style query. No intense coding required, but you'll need to understand the basics of an API. And then also what we're going to demonstrate today is on top of that, we've built an attractive simple interface that's more relevant to beginners, to people who have questions to ask, but maybe don't know how to turn that question into code. We call that the playground, and this is just one of many applications that could be built on top of this API. So the main impression I would like to leave you all with, with transitland, is that what we're trying to build out is a modular architecture. It's all Open Source. We have plugin points for different kinds of users. Beginners who might want to just run a little query, more advanced folks, who have very specifics that they want to type in, and then also organizations and companies that would like to create their own infrastructure, but would like to share some of these patterns with others. So to make some of this more concrete, I'm going to hand it over to Meghan, to show off this playground and to show off some of the data that we've collected together.
>> Hi, I'm Meghan. I'm going to start with a question as well. How many people in this room would not consider themselves a developer? Awesome. So I am a developer for Mapzen, but I do not have a background in computer science. I actually studied urban planning, and in doing GIS work for urban planning, realized that I needed this set of skills in order to be able to access a lot of the data that I wanted to use. But I don't think that it's necessary to have to have those skills. I think everyone should have access to the data to use in their own way, without necessarily being a developer. And that... Those types of people, like advocacy groups, planners, community groups -- that's who the playground is for. People who want to work with transit data, but don't necessarily program or write code. So the transitland playground accesses the data store, and you can query for the data and visualize it and download it. So I'll show you a couple of examples of this. You can find out what operators are in the San Francisco Bay Area. Right now we're working with San Francisco and New York data, but more is on the way. So a quick query will show you that we have these operators. These transit providers in our data store. And here's some information about those operators. You can link to the different websites. But you immediately get this impression of the different overlapping systems without having to go to each individual operator website to figure out what their service area is. You can also... You could... Start here. You could zoom in on the map and find out in a particular neighborhood where are there transit stops. And then you can also just find out the stops here. And grab some information about the stops. And then finally, you can query for routes. You can also do all of these by operator name if you wanted a specific operator. So I'll show you an example of that. Let's do one in New York. Let's do the MTA. And see what their routes look like. This is a big system.
>> It's not giving it up.
>> Well, luckily, I've pre-loaded this query. This is all of the routes in New York.
>> The Wi-Fi can be a little flaky in this room.
>> It does usually work a little bit faster than that. So these are all the routes in New York. And you can zoom in and check out what they are. Click on a route and see what line that is. And then you can also download these to formats that you can work with in other things, if you work with Arc or ESRI software, you can use CSV or JSON. And if you download these, you also have a bookmarkable endpoint for this data. So you can come back to this dataset again. If you just want to save this bookmark. And that's the playground. So I'll hand it over to Ian to talk a little bit more about how the API works and how you can get a little bit more sophisticated with your queries if you just go to the API directly.
>> So as Meghan showed, the playground -- it's powered by the data store API. All of the different transit feeds from all of the different agencies plus additional data we've added in together -- it ends up in our data store, which is a Rails app. It can be queried in various case. Stops, operators, bounding blocks, type of service, and so on. And it's actually a fairly simple API, and you can see here when we downloaded the data from that previous playground query, it created a URL. This URL could be bookmarked. It could be saved. It could be queried again and again over time to look at how things change, and so on.
>> That's big data right there. When it's crashing in Chrome.
>> Yeah. Beach ball means big data. Whoops. So you can see here -- this is one of the queries that was generated by the data store. It returns a JSON, and this JSON has a number of different operators. Each of these operators has a geometry, and so on. So if we want to look at what's actually in the API, we have a couple... One second. We have a bunch of different API endpoints. You can look at this readme file that has examples of stuff you can look at. So, for instance, we can look at all of the stops. We can look at all of the stops that have a particular identifier, or belong to a particular transit agency feed. Or are within a certain bounding box, or are served by a particular operator. And this is something where we get a little bit more interesting -- where we can look at the particular attributes associated with the particular stop or route. Same thing with operators. And routes, as well as stops. And then this also -- we have all this rich metadata about where the feeds came from, when we imported these feeds, what was in that feed at that time that we imported it. And so on. So if we want to look at one of these API endpoints... Let me just pull up a little bit more. So... As we kind of alluded to at the beginning, the GTFS format is simple, but it also has a lot of internal logic, and a lot of little implementation details that go into a GTFS file. It's basically a dump of a relational database, and it's kind of interrelated using IDs across all these different feeds. Sorry, files inside of the feed. And then there's hundreds of these GTFS files from transit agencies all over the world, and more coming every day, and they're updated constantly. There's different versions, and so on. And part of the task that the feed registry is designed to solve is tagging all of these and importing them into the system, but once you have all these GTFS files, working with them in a systematic way across multiple agencies is still kind of difficult. So importing them into the data store allows uses to do queries not just on one GTFS feed but all of the GTFS feeds we've pulled into the system. So for instance, one query that might be interesting -- say you wanted to say what are all the subways in the database that you have? All the subways in North America? All the light rail systems in North America? All of the bus systems in North America? Say you wanted to be able to compare economic or social characteristics between these different types of public transit services. That's possibly a very interesting question. And gathering all of that data yourself and processing the data can be pretty tricky. But this is why we developed this kind of service. So one of the tags that we have associated with routes in our data model is vehicle type. And internally, in the GTFS, it's an integer. And the integer is kind of opaque. It maps to subway, bus, light rail, and so on. But because we're building on top of the GTFS attributes, we're also kind of creating, like, a tag system. Kind of like an OSM tag, where we can take the data model a little bit beyond what's specified in the GTFS. And kind of leverage this open collaboration in what should be in these tags. So in this case, we've taken that zero and turned it into metro. We can search by this string. And this returns a JSON response that shows us all of the subway or metro routes in the system. And similarly, say we wanted to find out -- of these -- of the stops in the system, which ones are tagged as being wheelchair accessible? That's another type of query we could do. We could look at the stops, look at the tag, wheelchair boarding, look at the value of the tag, and this this case, it's directly imported from the GTFS. And we think this is going to be a really interesting API to allow people to develop a lot of really interesting applications on top of this data. Finally, you'll see that there's all of these different IDs. These one stop IDs. These are the unique identifiers for every operator, every route, every stop in the system. And then we can use these as stable identifiers that can be linked to over time and referenced independently. And so you can pull up -- here is the one stop ID for a particular BART stop in El Cerrito, North of Oakland in the East Bay. And one of the things we're also doing in the system is we want to interact with the OpenStreetMap community, and one of the things we do right now is kind of a loose coupling, where we have all of this data that we've gotten from the operator -- in this case, Bay Area Rapid Transit, but we've also taken this stop and associated it with an OSM way ID. And this allows us to create a loose coupling between the data we get from the agencies back to the data we get from OSM. This data is used in our routing service to be able to connect transit stops back to the OSM network, and also to provide a way to kind of talk between these two different datasets that come from two very different worlds.
>> So Ian -- so why that particular way? It doesn't seem to represent the station itself.
>> So the way it works right now is... Our routing team has developed an API called Tier, and one of the endpoints -- one of the methods of this API allows us to find all of the OSM ways that are closest to a particular point. And in this case, it was just kind of a simple matching -- whatever way was closest to the point that was imported from the GTFS feed. And we're continuing to kind of fine tune this, and be more selective. Not just the distance as a criteria, but other criteria that could be a more representative OSM way to create the link. But this is ongoing.
>> So this association can change with the import that you have?
>> Yeah. So any time the stop geometry is updated, it will reassociate with a different OSM way. And we're considering how we're going to keep the updates going in the other direction, when OSM is updated. Maybe this way is dropped or modified or something. And we're still deciding exactly how we're going to synchronize in the other direction.
>> Can you include multiple tag queries? You know how you had tag name, tag value -- can you do that with multiple tags at the same time?
>> Right now, we only let you specify one tag name and one tag value. You can also, if you want, just send in a tag name, and so see the full range of, like, any stop that has a wheelchair accessible tag, say. But we are open to feature requests.
>> Yes.
>> Why is it ways and not node edits?
>> So the reason for that right now is that we're trying to provide enough information for a routing engine to be able to do multimodal routing. So to say... You know, walk these five blocks, then board transit, then go back. We do have the foundation here with those one stop IDs, that are in our system, to be able to put that in OSM as a tag on a node. And vice versa. So I think over time, as we get to, say, being able to model stop versus station relationships, and also egress points of stations, where you have to start getting more precise, being able to connect to nodes as well is going to be very important. But at least for a starting point, to be able to power a routing engine, a way ID seemed sufficient.
>> Can I add my own tags to a stop or a route?
>> So we have everything in place right now technically speaking. We haven't yet opened up the data store to public edits. More because of trying to figure out what sort of user model we'd like. But technically speaking, yes. Yes, it's possible.
>> Do you model transit stop pairs at the transit stop locations? Or do you have just one where the transit stops are? I'm asking because if you're doing pedestrian routing, it's important to catch that crossing time, if people are traversing the street to get to the other side.
>> So one of the interesting things that we're doing is in GTFS, it's stop, stop, stop, stop, stop. And the stop can have a parent station. A lot of this is like -- we want to be able to model hierarchical parent-child stops a little bit differently. Like, for instance, a stop that has -- a bus stop that has flags on two different sides of the street -- it's important for pedestrian routing, but conceptually, it's also kind of one stop. And we want to be able to make that one stop with two different access points. Two different points of ingress and egress. You see this commonly for rail systems, where, for instance, in the New York City subway, most of the stops in the GTFS feed are modeled as parent stations, with multiple platforms. We want to kind of extend this, and kind of automatically see how far we can get with doing this for bus stops as well, where we have stops on both sides of the street, or cross different street corners. Basically have a parent stop that has multiple different access points. And each of these can have some of their own properties and their own tags. Yes?
>> And that's something that you'll be able to accept improvement on?
>> Yes, we are still -- envision this as a very collaborative project, and we're looking for any suggestions and possible improvements in the future. And also, as part of our collaborative user model, we would also be looking forward to user contributions about filling in gaps in the data. One of the things you'll find in the wild is that the quality of GTFS feeds varies wildly from agency to agency to agency. Some agencies are extremely accurate and they model every little detail of their transit feed and they follow the spec perfectly. Some other transit operators... Not so great.
(laughter)
>> And they may follow the syntactical specification of the GTFS feed, but not fully following the spirit of what should go into a GTFS feed. So you'll see this a lot with different operators, and part of the idea behind our collaborative model of user edits in the future is to be able to annotate and improve these feeds that come from authoritative resources.
>> Can you speak to how you see this engagement with the agencies happening, going forwards? I know there have been attempts in the past to kind of create an authoritative registry of GTFS feeds, and getting all the agencies to buy into that... Especially if you're talking about this collaborative model, which I think is great -- but if the agency publishes their feed and someone makes changes, how do you get the agency on board with those changes and reincorporate that into their feed? I wonder if you have any thoughts on what you see the future there being?
>> I think that's... Well, that is a very big and important question, right there. And it's one that's half about technical functionality and scope, and half about the way we all are working together. I think technically speaking... Let me see if I can... One thing that we're trying to do in our architecture for this system is have... I'm afraid I can't pull it all out. It's a workshop, so I can open the developer tools.
(laughter)
>> Okay. So... And I will say -- in all honesty, the technical piece is less a challenge than the relationships. But in terms of setting up the right kind of architecture to show that we want to aggregate data without assuming too much responsibility and without... Well, having the right sort of relationship with authoritative data providers. One thing you'll see down here is that we're trying to draw a distinction between having a registry of feeds, where the source of truth lives with the agency. It's at their URL, maintained fully by them and under their own terms, and within transitland, we're providing a certain amount of aggregation and a certain amount of community overlay, but we want to be very clear that we're not looking to become a source of authority. Of original authority. Or... Also, you'll notice the arrows are only pointing right. We're also not looking to be in a position to say -- report errors back upstream. I find it very interesting to hear how OpenStreetMap is getting used within local governments. Not as a direct feedback mechanism, but as a data source that they can check against, and that they can then use as a way to say -- a change has occurred in this building. Maybe we'll take that as a signal, but... As a loose signal, rather than a tight technical coupling. So in terms of technical architecture, we're trying to be very clear about providing a community resource on top of the authoritative data provided by agencies. In terms of the way we're building these relationships, we're very much trying to -- just like a number of other projects sponsored by Mapzen -- we're very clearly trying to make this an Open Source and an open data project. Where the goal is to bring together organizations and companies and non-profits and individuals who have a common interest in seeing an open data solution here, and hopefully that spirit will feed into both the technical architectures and the way that we actually build relationships with the agencies who are on the ground, really making all of this work, and that we're just making a little more possible on top of all that. Did that get to your big and deep question?
>> Yes, thanks. It's a big challenge, but it sounds like there's progress being made. Thanks.
>> In the way back.
>> How does this compare to or how is it different from Open Trip Planner?
>> So the question is how is this similar or different to Open Trip Planner. Correct me if I'm wrong. Open Trip Planner is a routing engine that brings in data from OpenStreetMap to get at pedestrian and auto routing, and brings in data from a GTFS feed from an agency, and then also optionally elevation datasets and some other sources, and it builds up the graph that you need to route people, to plan a journey, to get from point A to point B. Transitland isn't a routing engine. We're bringing together the data that are important ingredients for that. Mapzen also has a separate routing engine project that actually draws in data from this service. We're looking to be very loosely coupled with it. We hope these data can be drawn into Open Trip Planner as well, and a number of other routing engines. Because the type of data here, and its representation scheme, don't need to be vertically integrated with the routing engine. There was another question in the back.
>> So how is it different from transitfeeds? I hope you have seen transitfeeds. And whatever the support for GTFS, at least some inbuilt support, and what about GTFS RT?
>> Okay. So a couple questions. And if I forget to address them, please jump in. The transitfeeds.com is a site that's created by an Australian named Quentin Zervis, who has done a lot of really good work creating this resource that... It's nice. I... Have a look. Have a look. He's doing this aggregation component of going out and fetching feeds, and displaying an attractive page that shows the status of feeds over time. And how they've changed. It's a nicely done project. And it's a good source if what you are looking for is a copy of a feed provided by an agency. It's a good resource there. It effectively serves as a mirror to transit agencies and their GTFS feeds. The differences in terms of our approach here are... Not only... We're not actually mirroring feeds. We keep the source of truth at the agency. That's why we only have this registry. We don't actually create copies of the files and serve them out. But what we're looking to do is build layers on top, and provide much more incremental... Sorry, much more fine-grained API queries, so that if what you're looking for is the GTFS feed, you should turn to the agency themselves. That's the source of truth. But with our model, if what you're interested in is, say, querying one or two stops, the transitland infrastructure can provide that. So I see these as complementary services.
>> Validations?
>> Yes, so validations. One of the pluses and the minuses of GTFS as a format, and being an accessible, simple format, based on CSV files, is that you can go wrong in the construction of a feed. You could, say, have a row in one file that's not represented in another. You could have an extra comma. There are small, minute errors that crop up. Google provides some tooling. There is a validation library that agencies can run on their own computers. Also, the process of submitting a feed to Google, when an agency goes to Google to send in their feed -- that's run through validation steps as well. In our own system, before this line here, before trying to aggregate a GTFS feed, we run some validation steps of our own. And I'll turn it over to Ian to answer any particular questions there. Because there are a couple different concerns. First the concern of... If an aggregator script touches this GTFS feed, is it going to have a hard failure? So we've checked for those types of things. And then we also run the existing Google validator feed, and produce that output. Or just to say... Here is the state of the feed right now. And GTFS RT was your third question, right? Okay. So this is realtime data. So there's also a separate spec called GTFS RT. For transit agencies sending out realtime updates, like a service advisory, like saying -- this route has... Is under construction. Has a ten minute delay. Also, getting down to, say, vehicle-level delays or positions. And while GTFS has good adoption, for providing the static information of a transit timetable and network, the situation with realtime data is a little more in flux. There are a couple competing standards. Vendors support different specs at this point. Let's just say realtime data is less mature than static data. Transitland doesn't involve any realtime data at this point. But what we're hoping to do, both with our own efforts and with partners, is to build a good foundation, to build an architecture and an identifier scheme that allows people to overlay realtime data. But without a solid foundation, having realtime information is of only so much use and relevance. Okay. Three questions, three of three? Three of three. Okay. So any other questions?
>> I have a question. About GTFS format. Stop name can be... I look at the format, and there is only a possibility to one stop name. It's not possible, multi-language. For example, in the English or French or German specified stop name -- for example, city of Geneva, in French is Geneve. In English, it's Geneva. In German, it's Genf. Written and pronounced differently. And it's also a stop of the train. What to do in this case?
>> I'm not sure. Are you familiar with any GTFS extensions there?
>> I'm actually not aware of international...
>> Based on stop name, it's not enough.
>> So GTFS does allow for unicode support. But there is no internationalization built into the GTFS spec. So you can specify a feed in an alternate language. In whatever language you want. The stop name can be written in whatever, but I'm not aware, personally, of any multi-language international support inside a single feed.
>> Specify name -- which understandable for local population and tourist. And, for example, if it's a name in Cyrillic, it's not understandable to tourists at all. It should be written in Latin and Cyrillic both.
>> I would love to talk to you about how to build in some level of international support into our data model, where we can have multiple names for routes and stops, if you would like to talk online sometime.
>> I think a workaround could be to do it at the API level. The application layer could just choose the language of the user and just change the path to the API. Like a prefix. FR, DE, US. Right?
>> So support for internationalization -- we did build it for Guajarati and English in India. I think there is some international support for GTFS.
>> Inside a single feed?
>> I'm sorry?
>> Inside of a single feed?
>> Yeah. I mean, you just have to include all the languages, I guess. I will look into that. But it was done previously.
>> Okay.
>> Yes?
>> The feed registry -- is this something that would run in parallel with GTFS data exchange? Like, a similar service? Pulling the addresses at the GTF files? Or is it complementary to each other?
>> So a question about the feed registry and the GTFS data exchange. That is an earlier, almost equivalent to that transitfeeds.com site that we were talking about earlier. So GTFS data exchange also takes this approach of mirroring GTFS feeds. So it's a site where a user would go and upload a feed. Oftentimes enthusiasts and developers have gone there and uploaded their own copies, so it's not... In many cases, it's not actually even provided by the agency themselves. But this site has... It's lasted a long time. It now has... I don't know. I think topped maybe 700 feeds a while ago, which triggered some server errors, and now they had to up the integer value of the number of feeds limited. So that's an embarrassing problem that's good to have, right? It's a sign of success. I think we also see this feed registry as complementary to that service, because the goal is not to mirror feeds, but just to reference feeds where they already exist and where they're managed by the organization that's responsible for them. We do include a variety of IDs in the registry. So it can be used as a crosswalk. And if we're aware of... If we're aware that that feed is mirrored in GTFS data exchange, we include the ID from their API. So it's possible to use the registry as a crosswalk to say -- if I would rather go to the GTFS data exchange, when the ID is listed there, hit that URL instead. So again, I see this as a complementary effort. Yes?
>> What you said before about not wanting to have arrows going the other direction quite so much, because you don't want to necessarily be telling transit agencies how to do their work or be responsible for their feeds, but I feel like there's an opportunity here, because you're aggregating all these different feeds to take lessons learned and be able to describe to a transit agency that may have limited technical expertise -- this is a best practice, or this is better for what we're trying to do than others. Have you done that informally? Or do you have any thoughts on that formally, like a blog post? I imagine a headline. What we've learned from 100 different transit GTF feeds, that could inform the practice around GTFS.
>> Yeah, yeah.
>> You've already got a wish list going?
>> Published.
>> You're hitting this right on the head. That this is about... Bringing together common patterns and common experience, but then framing it in a way that is positive and helpful to the agencies. So yes, we actually have been aggregating quite a lot. We have a list of quirks. We also have been doing a similar effort of... We've been reviewing the licenses attached to these feeds. We've been finding... Let's just say we have the raw data and we're bringing together the raw data, and we've been chatting with lots of others who have their own lists of pluses and minuses and quirks in these feeds and the systems built around them. But this is both a practical question, but also just true of building a good Open Source and open data project in general, is -- how do we bring these together in a way that it really does help the agencies themselves and the staff there, and isn't a report card on performance, but saying... We all share common interests, and we know you have constraints within your own organizational environment, and we'd like to be able to say... Not a Buzzfeed headline of 52 issues with GTFS feeds, but instead more... Good patterns.
>> Going off of that, do you find that third party providers like GTF software are helpful or not? Are they actually promulgating bad practices to the transit communities? Or something like -- oh my God, thank God they used this software to generate this GTFS. This is beautiful. Or the exact opposite of that?
>> Oh, um... I'm going to say that I think within the community of GTFS data producers and consumers, everyone has really good intentions. Everyone is doing what they can to work towards some nice commonalities. I will add the asterisk that GTFS only captures one aspect of transit agencies. It only captures the rider experience. A transit agency also involves operations. HR. Scheduling. Vehicle maintenance. And a whole lot of concerns outside of the context of taking a rider from A to B. And there are different... There are a lot of big IT systems. There are a lot of vendors that are important in that larger universe. I'm not sure if they share as much of the same interest and priorities with GTFS. But I will say that it's a good community. Everyone is working towards good ends. And... Yeah. Yes?
>> Is it possible to use a set of (inaudible)
>> Sorry? Is it possible to use a map of what?
>> In OpenStreetMap, there are public transport relations. Public transport, I guess.
>> Oh, okay, okay, so OpenStreetMap has...
>> Can OpenStreetMap use a resource for (inaudible)?
>> Okay. So the question is: can OpenStreetMap be used as a source for routes and stops? So OpenStreetMap has a wide range of data about the geographic aspects of a transit network. So about stop location and routes. And the data that have come from user contributors. So... And this is in contrast with, say, a GTFS feed that's coming from an authoritative source, and includes the temporal information. So the schedule and the calendar attached to a route. This is part of why we're trying to aim for a loose coupling of transitland with OpenStreetMap. And I think this is also true of a number of other projects, like Open Addresses, say, where the goal is to be able to take advantage of OpenStreetMap's user community and user-contributed data, and loosely connect it to authoritative data sources, and that's why we're doing these conflation against OSM way IDs. I think, as we move forward, what we'll be looking for is more slight little connections we can put in place, like, say, putting some of transitland's one stop IDs into OpenStreetMaps, so that there can be a loose connection to existing OpenStreetMap data, but bringing... Let's just say bulk imports of data are contentious enough. A bulk export of data that effectively creates two copies of stops and routes... Would probably be less effective than just sharing IDs across systems.
>> I mean... You assume there's always authoritative data provider. And it's not mostly true for... For example, Third World countries like Africa or Russia or... Eastern Europe. So, for example, in Russia, there are thousands of towns, but, like, two publicly available GTFS feeds. St. Petersburg and some other town. So I'm wondering if people can create their own datasets for public transport without authoritative support. In this service.
>> Yeah. So a question about user-contributed data in places where there's no authoritative feed. Yes, I think it would be great to see that. We have all the mechanisms in place within this data store. That's why we're trying to say that there are -- there is the possibility for user contributions on the right side of that diagram. The data store actually has the mechanisms in place with change sets to be able to support that. And so I think some of this is less a technical question than a question of what should live in OSM versus what should live over here, and how to make that clear to users. And some easy ways that they can contribute. Yes?
>> Have you thought about including scheduler or frequency information?
>> Yes, we have. That's the next step.
>> If you can't get to the trip level schedule, frequency over period of time, or at least the span where there is service over a particular stop, is really important, because -- I work directly with this. We had a lot of trouble where stops that are only peak hour only... And they're not represented as that anywhere... It's really confusing to users.
>> Yes. That is in the plan. If you have specific requirements and requests, we would love some guidance on the specific type of output that would be helpful.
>> (inaudible)
>> Including schedule or frequency information.
>> Yes? A question?
>> I'm just wondering from a research perspective if you provide or could provide any historical GTFS or GTFS realtime data, if that's something that you're capable of?
>> Yes.
>> You want to address the historical feeds?
>> Yeah. So just briefly... Every time a new feed occurs, it would pick up on that feed from the agency website. We associate the time of that feed, we discovered it -- the checksum of that feed, and we keep a copy of it. Now, we're not publishing these copies publicly. But for research purposes, we will be keeping an archive of all of the feeds that we import, as well as metadata about these feeds. Now, I don't know if they're going to be accessible directly from the API. That's not currently part of the plan. But we probably will be keeping an internal archive, at least, of these feeds.
>> And this is another bit of technical functionality that... That just touches on keeping the agency as a source of truth.
>> Yeah. From a planning perspective, how this could be used in policy is sort of knowing how these things change over time. So to have a way to access that at some point is valuable. Yeah.
>> So we've done various experiments internally on comparing changes over time. So part of the technical part of the system is when a new feed is ingested, it's not a blank slate. It's actually merged into the existing data that's already in the database. Through this creation of a set of changesets. And so we have tools that we develop to compare historical versions of different GTFS feeds and create changelogs, change history, revision history, expressed basically as diffs. And we actually have some internal projects -- it's a visualization brewing -- exploring how transit feeds change over time. And if you would like to talk about that maybe later this afternoon, it would be great.
>> So... Do the GTFS feeds suffer from the same licensing woes as address sources, like Open Addresses? Is there some common ground in this? Are you able to republish this and redistribute it without limitations and use it without limitations?
>> So a question about licensing terms attached to each of these feeds, and if there's some commonalities between transit data and address and parcel datasets, like in Open Addresses. And yes, there are very strong parallels. Many of these feeds have custom terms attached to them. There is no equivalent to, say, Apache license, MIT license. Kind of these standard go-tos within these public datasets. The only standard that some government agencies adopt is a public domain dedication. So part of what Mapzen has been doing as part of sponsoring this project is putting in the legal review to figure out what exactly can be done with each feed. And that's also part of our efforts to provide some resources to the agencies, to be able to say -- here are common patterns across all of these licenses. Here are some of the better parts of the licenses, and here are some patterns that are actually realistic. That do give agencies the protections they need, while also making it more clear to developers what they can do.
>> So did you just say that the only licensing you've seen is a public domain license?
>> No, no, no, no, no. Specifically that in the world of software, there are a number of well established...
>> Got that. But I'm asking -- on the GTFS feed, for data, you've seen... You observed many different licenses on it?
>> Yes, correct. Correct. Yeah. So... We're getting close to the end. Any final questions? If not, one thing I'd like to do is maybe we can just pull up the playground again, and the API again, because this isn't just handwaving stuff. We'd actually like to point you all to it, and please go home and give it a try. We'd really love your feedback, both of what looks promising, and then also where you see room for improvement and opportunities to work together with others to build this into something that's relevant to a lot of different developers, a lot of different providers and consumers of data.
>> Okay. So we have... Well, it's not... Transit.land/playground. And on Github, under the transitland organization, there is a readme for the data store with API endpoints, and it's open for anyone to hit, and we look forward to your feedback now in person or online. And thank you for your time today. And all the questions.
(applause)
>> Okay. Thanks a bunch. Ian, Meghan, and Drew. That was really fantastic. I learned a lot. So we have one more coming. Geocoder. So stick around for that if you want. We have a few minutes' break.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Workshop: Geocoder-in-the-box
Peter Johnson
Sunday, June 8, 2015
>> Okay, folks. So... I see a lot of the same faces that I saw earlier this morning, so a lot of people have stuck with these workshops. That's great. To close out this pretty good day, we have Peter, also from Mapzen, come to talk about how to build a geocoder, which is arguably pretty tough.
>> Basically I can explain why I'm with you today. So first of all, all these slides are here. Hosted here on the web. So if I go too fast and you miss a link, you can just go and grab them from there. And then also the... What was written about this talk was actually about helping everybody to install a geocoder on their computer. And I just think that due to the internet and all of the complexity of different people's computers, we probably won't get to that. But if somebody wants to do that afterwards, please come and see me, and we'll get it installed on your laptop. I also had it installed on the RaspberryPis. We had a geocoder on a RaspberryPi, which is cool. A $25 computer, and it's powerful enough to do all of this stuff. So instead, what we're gonna talk about is just a high level overview of geocoding. And then at the end we'll have a chance to just talk about any individual parts that anybody wants to talk about in more detail, or if you want to talk about something, you can stop me and we can talk about it for a little bit, and that's fine as well. So what is geocoding? The basics of it is... We've all seen it in Google Maps and stuff like that. So as you type a place in the world, it figures out what you meant, hopefully. And it presents it here, and you can make a selection. There's sorting and there's filtering, and potentially there's hundreds of millions of places in the world that it's searching, and it does it as quickly as possible. So here you can see the autocomplete is happening in sort of subhundred milliseconds. So even if you're a super fast typer, you should be able to get ten of them in a second. And then the opposite of that is reverse geocoding, which is just -- you give a lat/long, and it tells you the nearest place, the nearest ten places. What we're doing now is the nearest ten Chinese restaurants and stuff like that. And that's all based on OpenStreetMap data, which is really cool. So the project I'm working on is called PELIAS, a MIT-licensed Open Source, open data geocoder. A geocoder for everybody. We're trying to provide the tools that are required to build your own kind of geocoder. And we also provide more opinionated distribution. Like Ubuntu provides a distribution of Linux, we provide a distribution you can install on your server. But it's fully customizable. And we have an API you can consume with very generous rate limits, and it's all free. I'll get into the data in the later part of this talk. This is the metarepo. This is the organization that everything sits under. So you'll see, it's quite split up. There's maybe about 30, 50 different repositories there. And this is the quick start. It's a vagrant instance, and if you want to get up and running really quickly without having to worry about configuring everything individually, you can follow the guide here on the vagrant install and that will get it set up on your laptop and you can start playing with it and hacking on it. But a more general -- build your own geocoder. I can talk about what we did and the steps you need if you want to build one yourself. If you want to build your own geocoder, what sort of features would you be looking for? You want it to be Open Source, because it's so complex. There's so many different edge cases and different countries in the world, different address schemas, all that sort of stuff, that you can't be there on the ground to observe the truth and to ensure the accuracy of your product. So it has to be Open Source, unless you have a hundred million... Billions of dollars that you can go and spend and put people on the ground to ensure its quality. Nowadays, people expect the fast autocomplete, especially in the mobile experience. And that's something that Nominatim doesn't offer us at the moment and probably never will, due to the fact that it's based off postgres, and just -- the technology will probably never get us there. We expect location bias. So if you type pizza here -- I live in Berlin, so pizza in Berlin, you should get different pizza restaurants. Filtering and sorting. If you type New York, you probably only want to get New York results, and sorting would depend on your geographic bias and the words that you type. So this stuff is really subjective and really difficult to test, but we can talk about it a little bit as well. This thing is kind of complicated. It's called terms frequency/inverse documents frequency -- and it's like -- the word "the" or "street" and "road" appear a lot in our dataset, so it's less important. So we can figure out how important that word is or how unique that word is in the corpus of our data. Fuzziness is spelling mistakes. Getting the order wrong. Missing chunks of text. That sort of stuff. We want to be able to run it for the whole planet and also want to be able to run it just for our city. So we want to scale the architecture out to 20 servers, but also run it on a RaspberryPi. Batch geocoding is common for businesses. That's just taking a spreadsheet and geocoding the addresses. Address interpolation is difficult. That's the idea that we have some addresses in the system but not all of them. And we can try to infer the ones that we don't have from the ones that we do have. Cross streets, very common in New York. It's like intersection. And we want to support that and we want it to be easy to set up and to test. Any questions about that?
>> Isn't cross street common outside of New York?
>> No, it's common in, for instance, Melbourne -- is a city which is also built on a grid. But most other places in the world, they don't say -- when I come here and the taxi driver is like -- what's the cross street? They don't do that elsewhere.
>> Do you have the equivalent thing in Berlin? Do you say between this and this?
>> The other thing they do is they talk about nearby landmarks or they just give a house number. You'll find it's not as common as it is in Melbourne and New York. So there's a few different types of geocoding. And this is called coarse geocoding. And it only returns administrative areas. So countries, cities, states, villages, towns, municipalities, whatever you want to call them. It's really interesting. It's really nice for something like weather.com or Tinder. If you want to find somebody new -- airBNB. Heaps of apps are using this basic thing. It's like a way of localizing your audience very, very quickly. You can say -- do you want something in Berlin? Yeah, I do. And you only get events in Berlin. It's very common. And we have place geocoding. Which is like -- you see on Facebook. You see it on FourSquare. FourSquare also has the reverse of this. Like which places are near me now. All the services, all the main social networks are using some form of place geocoding. And then the last one is address geocoding, which is... Like, it may autocomplete the form for you, when you're posting something. It's really helpful to avoid errors in the postal system. And the inverse of this is -- like, Uber, actually, you can drag the pin. And it says -- are you at 350 Fifth Avenue? Yeah, I am. And that's very difficult with the OpenStreetMap data we have at the moment. So to build a geocoder, you kind of have to wear many hats. A lot of them are technical. Computer sciency. Information retrieval, programming and testing, all the data you have to consume, hundreds of millions of things. Linguistics is the other major part of it. Like looking at languages, foreign languages, looking at analysis of text. And I'll get into that in more detail. And obviously geography. Though geography doesn't play as big a part in it as you might think, it is really important for filtering and sorting. Community management is something to think about if you want to actually have a geocoder that's going to get to a certain scale, like where people can contribute to it, and it's not going to be a short-term project. It's going to be a long-term project that sticks around for a few years, maybe 10 years, 15 years. And the last -- User Experience, UI, UX. And if you can control the front-end, as well as controlling the API layer, it gives you much more flexibility in a feedback loop for your product, where potentially you can log user activity and use that to improve the product, which we don't do at the moment. So as far as, like, where to start -- you have to kind of choose a database to get started with. And there's lots of options out there. You have stores -- I've got the slide here. So you have relational databases, document stores, key-value stores. Actually, I'll get into this in a little bit more detail in the next slide. But there's certain characteristics of your database that you need to also look out for. You're going to be in a heavy read environment. Especially with the autocomplete. You're going to get a lot of demand for reads from the database. Whereas the writes to your database will only happen probably during index time, which may only happen weekly or monthly, or it might happen daily. You need eventual consistency in your system, not transactional security. What I mean by that is -- you don't need to ensure that when two people insert a record at the same time that one wins. An example of that is -- two people buy the last product in a store. You need to make sure that only one person buys it. Otherwise, a lot of confusion. We don't really care with this database. We're more concerned with having the ability to shard and replicate. To be able to scale it across lots of servers, than to ensure we have transaction security. We need to be able to support large indices. So our planet-wide geocoder at the moment is ruffle 60 gigabytes of RAM, obviously not something you can run on your laptop and that all software can deal with. It should have a feature rich geographic API built in. Which is something that PostGIS has. But also, elastic search, and MongoDB and CouchDB does, and more and more databases are adding functionality into their databases. It should support multi-lingual language analysis. I'll get onto that a little bit later. How -- instead of using regular expressions or really basic search queries, you can dive deeper into looking at the language structure. And the last one is capable of performing a full text search. Give it really complicated input and it will figure it out from a huge amount of information. So I'll get into these very quickly. So it's a bit techie. So if you want to stop and talk about it more, just let me know. This is a classic relational database everybody is familiar with. Like an SQL database or spreadsheets. This structure is a document store that people are becoming a lot more familiar with these days. Where you store a structured document into document store, and when you retrieve it, it doesn't have to do any CPU work to get that document out. And nowadays, it's even... They're combining that with the relational databases, so you can have two documents, and you can actually join those documents at retrieval time. That's quite interesting. There's one called... RethinkDB that does that. This is a key-value store. It's pretty basic. But you can build pretty interesting things on top of key-value stores. This is a graph database. They're really interesting as well. They're not so much about modeling the individual entities, but modeling the relationships between entities. So in this example here, it could be like -- the blue ones could be like movies, and the red ones could be like actors and directors, and you may act and direct in a movie, and so through that, you can infer relationships between objects, and this is the sort of structure you would use to do sky scanner, any flight software, where you need to know that you're going to hop from New York to London to Berlin or whatever like that. Because you can walk the graph. But the two that are really important for search are -- this thing is called the inverted index. And so the idea is that like -- here is a bit of your text. You can just -- you tokenize it, which means basically in this space -- into the white space, you create one token, and then you keep it here as the keys, and in the values, you have a list of the document numbers. One, two, or three, where those tokens appear. And I'll get into that a little bit more. And the other option you have -- so this is an example here. Welcome to State of the Map. Mappers meet up in the United States. So you have here the token -- welcome appears only in the first document. State, however, appears in the first and the second. And you'll see here it says states, not state, but I'll get onto that in a second. And map as well. Like, it says map here. With an exclamation mark, and this says mappers, but we want to try and reduce down the inverted index, so we can do a bit of fuzziness, like we were talking about before. Any questions on this? No? So the other option you have -- autocomplete is this thing called an FST, a finite state transducer. It sounds really complex, but the way that I think about this is like with traffic lights. So if a traffic light goes from green to orange, green to yellow, sorry, and then yellow to red, and then red to green, it kind of has four different states that it can move through. And in the same way, this is a good way. So you can imagine as you type the first letter of your autocomplete, and then as you're typing here, it's giving you a list of the possible places that you can get for your result. The problem with this structure is obviously that it's anchored on the left. So if you don't start with the correct letters, like if you start with a P instead of a J, you will never get what you're looking for. And here is an example of this. So the states here would be like... It goes from P to I. On the first one and the second one. And it goes from I to Z only on the first document and I to S on... And this a good structure because it's really cheap to walk the graph and get the results that you want, but it uses a lot of memory. This is also really interesting. I think probably a lot of you know what this is, and if you don't, you should check this out. I've got a nice demo here. It's a way of creating spatial keys in your database to do really quick spatial lookups. And if you use this key algorithm, it means that when you sort the documents in your index, all the ones that are geographically close will actually be next to each other in your index, which makes them heaps easier to sort on. I won't go into that in detail. We can talk about it later. So I'll go quickly into linguistics. Time? Okay. So you have to consider -- like, you've got different languages to deal with. And then you have a lot of different ways of dealing with those languages, to try and get a structure that, like, lets you query it efficiently. And fast. Right? So here's a good example of the standard tokenizer I was talking about before. What we've done here is we've just split the sentence by punctuation and whitespace, and we've lowercased the w. That's all we've done. And this is the resulting token that we'd enter into our inverted index. You have something called stop words. So stop words are really common, where it's like -- the, of, to. In our case, we use street and road in some cases. But they're a little bit problematic, because for instance the word die in English is not that common, and it means to kill somebody, but in German, it means the. It's a very common stop word. So you need to be careful about which indexes you're using which stop words on. And that can help to reduce the index size, which can get to 60 gigs in memory. Something called stemming. So this is the idea that you can stem a word to its shortest part. Like a parent form. Like mappers will turn into map. Walked will turn into walk. And walking will turn into walk as well. So if you search on a search engine and miss the suffix, get the suffix wrong, it still figures out what you're talking about, because it's been stemmed. In in English, you can figure out how to do it, but there are stemming algorithms for other languages. Including Russian. You don't want to get into writing that yourself. It's very academic. Synonyms are pretty obvious. With walk we say we want walk, stroll, with streets, we want street, road, and avenue. Adding new tokens in, where they weren't there before. Or maybe expand a number from 1 to one. Number 1 to the word one. This is a structure used for autocomplete. It's called engrams, and it's the idea that each word can be split up into these little grams, and you can define the size that they are, and you can imagine that we can use these very efficiently to do the autocomplete as well, because we look at how many of those grams you've typed and how many we have in index, and we can reduce down that big index very quickly to find only the documents in which these appear. This is one we're looking at, at the moment. It's called a prefix gram. And it's just... It only takes the ones at the front. So it's like th, ma, map, like that. And it's really nice, because generally you get the start of a word right, but you don't really get the end of the word right, is what we find. And so it gives you some of the benefits of the FST structure that I was talking about before. It's anchored on the left, but only anchored on the left for each individual token, each individual word, rather than for the whole phrase. That one is quite nice.
>> So why the variance in two to three?
>> You can choose how many you want. The problem with a one gram is...
>> Why is it fixed?
>> One isn't useful, and larger than three, you almost never have enough data.
>> It's common to have a range from two to four. To use...
>> You can choose whatever you want. That's why they're called engrams. Because it's like... Any number. But if you do one, something to consider is if you had the letter S -- there's lots of streets in there. So in your inverted index, the key is S and the value is this massive array with every single ID in your whole index. And when you try to reduce that down quickly, it takes a lot of CPU time. So I wouldn't advise using 1 grams. We use them for countries. So if you type United States but you want a street address, for instance. This is something called shingles. I didn't make up the name. It's the idea of combining two adjacent words together. And that's really nice for disambiguating street addresses. Otherwise 101 and 110 of the streets would match the same tokens and wouldn't score correctly. Something called ASCII folding. It's really nice for European languages. These, again -- I didn't pick the names. Fuzz and slop. Fuzz is like this. Like... Mapzen. Mapzeen. And using what we looked at before, the FST, and analyzing a graph, or using the engrams as well. You can provide some level of fuzz. And slop is about putting them the wrong way around. I don't really have a good example here. But if it was place Mapzen, for instance, that's an example of -- they're sloppy. They're put in the wrong way round. In Germany, in street addresses, the number comes at the end. Whereas here it comes at the start. You have to deal with that case as well. As far as, like, choosing a programming language, we use NodeJS, and there is really no... I hate to be like... Saying one language is the best. I think that's a really bad attitude. But there are some attributes of a language that you should look for. First, we want it to be easy to accept contribution and pull requests from other people. JavaScript is like lingua franca and easy to contribute. Should have strong geolibraries so we can do poly intersections and stuff like that. Have linguistic algorithms and multi-byte support that we talked about before, have IO performance so when we're doing fast complete we're not held back by its ability to talk to the network. Preferably multi-core architecture and parallel processing for the data imports. They can take a long time. At the moment it's two days for us, to import the planet. We're hoping to improve that some more with parallel processing. Has to have good unit and functional testing libraries, for obvious reasons. We need to make sure we're not regressing and that quality is being improved rather than worsened. And the last one is one of the reasons that we picked Node -- is it has good support for data streams. And the reason for that is -- when you're dealing with huge amounts of data, the planet file is 25 gigs, compressed. So uncompressed, it's large. If you try to use, for instance, a job queue, you'll take it in, put it into a job queue, and another process will come along and take things out of the job queue, you have the ability to flood your queue. It'll just fill up. You'll fill up the memory or crash the server. So using data streams, the opposite of video streaming, it's putting stuff into the database using streams -- it's a really nice way of keeping all your machines from crashing, essentially.
>> Geolibraries are really not all that important, are they? I mean... Like, the operations that you're doing... Geospatial operations are not that complex. Like distance...
>> You're right. I have some slides on that.
>> But you could get into some issues where you're doing address interpolation. If you want to get more complex. So you want to have support for it.
>> But realistically, you want to be using geohashes the whole time. Doing string prefix matching, because it's just much more efficient. Anyway, let's not geek out too bad. These are our data sources. It's all open data. So it's all available for free and you can download it and import it straight away. The planet file, released from OpenStreetMap, I've got some stats later. Roughly 50 million, something like that. And we provide metro extracts of your city, if you just want to do New York, for instance. Open addresses is a really important project for us, that fills in the gaps and the holes in the addresses that we don't get from OpenStreetMap. Quattroshapes is a really cool polygon project that came out of FourSquare and provides us with a lot of the neighborhood polygons that we need. Geonames is a great project as well, and there's no reason you couldn't put anything else in there if you wanted. Proprietary data, if you wish. I'm not going to write the code for that, but you could, if you wanted to. So in the data, you've got points. Lat/longs. They're like places or addresses, and they're kind of interesting. But what's really interesting is polygons. And you see only one more thing -- a boundary. Administrative boundaries are really, really interesting. That is a screenshot of the Quattroshapes project, at admin 1. The second level division. Admin 0 would be the countries, and you can see here the next child relationship to a country in the United States is a state, and in other places, it might be -- for instance, in the UK, it's Scotland, England, Wales, and Ireland. In Australia, it's states, and in New Zealand, I think it's the islands. I don't know. So here's like... This is the Quattro shapes again. This is a locality? We have several layers of these. And then we can see here like... This one is neighborhoods. Which is really nice. We've simplified them a little bit. But you can see it's like -- it's pretty good. So what that means is that with OpenStreetMap data, which doesn't have a hierarchy, a place hierarchy, we can drill down -- say this is the point here, drill down through here, through all those other layers as well, and you'll be able to establish which country, state, neighborhood you're in, everything like that, and we can add that to the end of the string, when you get it back, so you can figure out if that's the point you were talking about. This is from OpenStreetMap. This is a London boundary it you can see in the middle there. For political reasons, this is the City of London, which is a different thing. For a lot of reasons which other people can talk about. So what we're looking at doing is extracting that border information from... We've got it here live, actually, if anybody wants it. You can extract the borders from OpenStreetMap in the same way we saw with the Quattroshapes project, so when mappers are out there, mapping these areas, we can pull those borders out and use those borders to mark up the places of interest. And it's really important here, because like the Quattroshapes project, there are social boundaries, where people think that they are. Because there's kind of no such thing as neighborhoods. Like, neighborhoods don't really have fixed borders. They kind of... They can move, actually. And in some cases, the government decides, or the civic planners decide where those borders are. In other cases, it's a little bit more blurred. So just check that out. They're GeoJSON extracts of those borders.
Yeah. I mean, all the data is dirty. Especially... Well, it's all dirty. So it's all got errors, and there are a lot of duplicates in the datasets. Between datasets and in the same dataset that need to be accounted for. There are weird character encoding issues. Not in OpenStreetMap, but in the other dataset. And there's just, like, heaps of errors. Like people writing all in caps lock. It bugs the hell out of me. Incorrect name tags. Like... Name... Lamp post. Do I need to know it's a lamp post? It's marked as a lamp post and named as a lamp post as well. Suffixes and prefixes. You know what I'm talking about. Not very fun. Just some formats that you want to import. OpenStreetMap, obviously. Shape files. Just random TSV files that you get, et cetera. So we're providing adapters for all those formats. Just briefly on import pipelines, they can run for hours or days, so you need to kind of test your code pretty well for those things. Potentially going to use a large amount of RAM to do them. Polygon intersections -- you might not want to load 8 gigs of polygons into RAM, and you need to prevent flooding and crashing, like I said before, for obvious reasons. So these are our stats at the moment. 166.4 million places in the database. And you can see kind of where they come from. These are nodes, like places. And these are ways that are places as well. And then some OpenStreetMap entities have, like, address, house number, and address street on the tags as well. So we pull those out and make them a separate entity as well, so you can search by address. And there's 44.9 million of them in there. Compared to the open addresses that we're doing, which gives us 96.9. So you can see already with open addresses, though it's a young project, it has twice as many addresses than OSM. Although I think we could probably get a little bit more, because we're not really parsing the ranges of those completely. If it's 3-5, we're not parsing that at the moment. Any questions?
>> How many of those addresses from open address are unique, or do you have to conflate and dedupe between it and OSM?
>> We have a deduplicator for that one data pipeline but we're not using dedupe across data imports at the moment, so we're not actually sure of the amount which are duplicated there. That's the honest truth. It depends. Open addresses is really interesting, because in some countries you have great coverage. New Zealand is totally covered. But then in the UK, there's no coverage. And there's actually an open addresses UK project ongoing. There's a lot of politics and stuff around that, and so for that reason, all the OSM addresses in the UK are unique, but in the States, you might find there's more of an overlap. Especially in New York. And New Zealand, you would find there would probably be overlap as well. So we really... That's one of the areas we need to work on. It's actually a more difficult problem than it sounds like.
>> How comprehensive? What percent of the planet is that? What needs to be done, do you think?
>> How many addresses are there in the world?
>> What do you think the coverage is?
>> Guessing? If you ever look at openaddresses.io, they have a map.
>> You have six billion people, and assume three billion people per address...
>> Honestly, less than 10%. And OpenStreetMap is less than 5%.
>> In many cities, most houses do not have addresses. There are no street names, no house numbers. Just coordinates.
>> It's a big problem. That's one thing for open addresses.
>> And one other thing, Peter, about that import pipeline, do you use any delta or incremental system for that, or is it just --
>> At the moment, we don't do partial updates, we don't do minutely, hourly updates. We could do that with OpenStreetMap fairly easily, because the tooling is there. With the other datasets, it's not that easy. Until the deduplicator has matured a little more, we're probably going to continue to do full reindexes. Which we can do every couple of days.
>> Have you considered providing elastic search index as a download?
>> Sure. To prevent people from having to wait the two days?
>> Do it once, provide many, kind of scenario...
>> We've talked about different scenarios. We've talked about doing a pre-processing step and providing a dump that can be done very quickly. I kind of like that, but I also like the idea that you just import whatever data you want. I'm not going to tell you what data to import.
>> I'm seeing it as a parallel to your project of your extracts. There's what you draw, there's your data, but here's your search.
>> If you're doing a smaller subset, you could easily do that with vagrant, and we're going to continue to keep the vagrant install, so you can specify the regions you want to download. You can do that now. It's not as obvious. So you can say -- spin me up a vagrant with just London or New York or US. And that shouldn't take long. So the indexing isn't really an issue for something like that. If you want the planet, realistically, very few people can actually accommodate that sort of... So if you can... Then you might not have an issue with running it on your own. So we just haven't really found a good place to store something like that, to make it publicly available, because it's also just a large --
>> Nobody else is running the planet, except for us. It's quite costly. It's 10 large servers on Amazon. Probably thousands a month to run this thing. And the cost is coming down all the time, as we develop new stuff and we change the indexes. But then actually the cost will go up again, as we get all these addresses that we're looking for. The other 90%. The cost is going to go up.
>> And then we do have the instance that runs, so you can always hit our API for now.
>> Yeah, just use us. This is what I was talking about, for the geography. What you actually need from your libraries. Point in polygon, polygon in polygon, address ranges, filtering, sorting distances by boundaries. This is not that much. The point intersection I talked about before -- you have these layer of polygons, you drill down through them with your centroid, you have to figure out quickly which polygons they intersect. It's a quite complicated problem, using a tree and then some point in polygon to finish it. These are some examples of the polygon in polygon. It's not just simply is it inside. It's like -- does it overlap? Is it completely inside it? Is it like near it? Is it like... You know? So there's actually different algorithms that exist for talking about whether things overlap or intersect each other. And these are squaring. And squares are easy. These are address interpolation. And whenever we talk to people about address interpolation, I was like -- oh, it's so easy. But there's a line like... 1, 100, 50... You want 50? 50 is there. And then you look at that. And there's a tree there and there's a stadium there. And you're like... Oh, okay. So you can provide interpolations, but it's never going to be perfect. And you just have to accept that. And I would like the user interface, the UI for the user, shows that... Like, hey, this is a less accurate result. This is not a definitive result. This is a computed result. So we talked about doing that at indexing time, doing that at query time, and I think you probably have to do it at query time. But until we have a certain -- like, 10% of the world's addresses... How well are we going to do interpolation? Probably not very well.
>> Do you do reverse interpolation?
>> How does that work?
>> When you do a reverse geocoding, you want to get the exact street number where that point is, but you don't have the street. You just have a range. So you kind of guess... You see if your point is between two points, you guess what number that point would be.
>> We don't do that. It's such a hard problem. Yeah. I mean... We could probably work on that for months. But we'll tackle that at some point. And if anybody has any ideas, please... It's kind of been installed before by other people. Nominatim does that. So we can look at what they've done there. And the other is distance sorting. If I want pizza places near me, and maybe I want the nearest pizza place, so that should affect the sorting of the results. That's geographic as well, and it can be more difficult around the world, with the arc of the world. So just kind of finishing up, there's a few other things as well, if you want to build your own geocoders, running it as a service. We have a lot of servers. We have an operations team, deployment automation, testing testing testing testing testing testing testing testing. Writing the APIs. Getting the feedback. And actioning the feedback -- it's like a lot of the feedback is -- hey, I searched it thing that didn't work. Thank you, but where are you in the world? What were you searching, what were you expecting? I don't live there. And rate limiting. We run generous limits because we're paying for everybody to use, but at some point, if somebody wants to make this... Hook up their app to this, there's a potential that they'll do a denial of service for other users on your system. That's something you have to think about if you want to get to a service scale. And then also releasing as a product, which is something we're looking at as well. So you can install your own servers, have a lot more testing testing testing, extensibility, where does core end and plugins begin, versioning it, so we can change stuff, and we can accept that things are wrong and fix them. Issue tracking. You know, how people report a problem. Especially if they have different datasets. I don't want to share my dataset with you, but it's not working. It's kind of hard to debug. And feedback, again, like feedback, feedback. You have to create user interface. I won't go into that very much, but it's really nice if you can control the front end, for a lot of reasons. And this is a big one, I guess, as well. Is to build a community around your product. So you want to be Open Source. Everything is open. Open data, open ticket tracking. We're looking at open road map, open chat rooms. Everything is completely open. You can see on my Github when I went on holiday, because there's no green for three weeks. Everything is open. Support. We need to provide support for people. We need to be nice and friendly to people. When they have a problem, we need to say thank you for reporting the problem. Education, training, workshops. And then collaborate. If anybody is building an Open Source geocoder, I would be more than happy to talk with them about it. If they're building a proprietary one, probably not so much. Still nice to chat. There are people I know who are building smaller regional geocoders, and I'll happily chat with them on IRC or whatever about the stuff that I covered earlier in the slides. And that's it. Any questions?
(applause)
>> Okay. I'm working for the City park. But I'm a big fan of the (inaudible). If I'm working on MS active spreadsheet, latitude, longitude, there will be export into (inaudible) changing in the (inaudible). Will be plugged into OpenStreetMap. Is that possible?
>> You could import it into PELIAS. If you have a traditional SQL or auricle database, some of the coordinates are latitude, longitude, and name, for instance -- can you import that into OpenStreetMap, I believe not, depending on licensing. Into PELIAS, you can. It takes care of the tokenization, the analysis, and all the stuff I covered for you. You can import into a CSV file and stream it into the database.
>> Okay.
>> Can you go to the second slide so I can take a picture?
>> The second one?
>> I wrote this today. There's like 52 of them. Sorry.
>> There's a link to the slides.
>> That one?
>> Yeah, that one. That's the presentation?
>> That's the presentation, yeah.
>> All but one slide.
>> Which one is missing?
>> I'll tell you later.
>> Should have been the last slide.
>> What did you add?
>> You're working mostly databases? You're working mostly databases, geocoding -- in the join and out of join?
>> Yeah, we don't use a relational database. This is all... The indices are all -- so I didn't actually cover that, sorry. We use Lucene, which is a linguistics library, and a library for building the inverted indices and the finite state transducers, and there was a product built on top of that called Solar, which has been around for a long time, and Solar provided a service on top of the library, and more recently something called elastic search, which is what we use, and elastic search gives you all this stuff, plus a RESTful API, plus sharding, and gives you the ability to elastically horizontally scale your stack. But you lose a lot of the benefits of a relational database. A relational database, you can query arbitrary things from the database, do statistical analysis, ask for interpolations with groups and joins and stuff like that. You can't do that with this structure. So... But also we didn't want to -- with the photon project, which is a similar sort of project, they installed Nominatim, PostGIS, and Postgres, and then into elastic search. We wanted to remove that so it was easier to install. Imagine installing on a RaspberryPi. How many of them? So it's not a relational database, but you can import relational data in there if you like. OpenStreetMap is relational data as well.
>> You said speaking of RaspberryPi -- you said early on you want it to be able to run on a RaspberryPi.
>> It does, yeah.
>> Including elastic search?
>> Yeah, yeah. The new RaspberryPi 2. It's quad-core.
>> Oh, the 2. Well oh.
>> Oh.
>> Still.
>> It's got a gig of RAM. It's pretty good. It could run New York in 100 milliseconds.
>> Are you able to filter on what you expect from Overpass? Amenity, restaurant, that sort of thing?
>> So there was a feature that we did a while back, taxonomy and categorization. And when I was looking at that, I spent quite a long time on it. Decided that I didn't want to use the OpenStreetMap taxonomy, because it's very OpenStreetMap. And we have other datasets as well. So what we need to do is find a middle ground. So I went and looked at the FourSquare one. If you're interested, go and look at the FourSquare taxonomy. It's like 3,000 different categories for venues. It's amazing. But again, it was too much, and it belonged to them. I looked at the Google one, and the Google one is very interesting. There's 100 of them. But one of them is like RV parks. And there's like Christian churches and mosques but not other churches. And it's like... How did you pick these things? So I looked at all of those, and then mapped OSM to a kind of... Our own sort of category structure, which you can go and edit if you don't like as well, and we would like feedback as well if people don't like it. And then we mapped geonames into that categorization as well, and that means you can say -- I only want religious places in my results, or I only want restaurants, or I only want Chinese restaurants in my results or Mexican or something, depending on what categories have been mapped.
>> You can sort by layers. Search by layers. So you can say -- I only want OSM data or I only want geonames data.
>> I was thinking of filtering within each of those.
>> Yeah. You can do --
>> And it's kind of like a freestyle thing. It's called categories internally, so you could potentially --
>> Are you monitoring other category taxonomies for changes over time?
>> No.
>> Is yours changing?
>> Well, it was only done six weeks ago. It's probably the most current one out there. There is a page on the wiki, if you look on the PELIAS and the wiki, you'll see all my notes about doing that taxonomy. There's actually an in-depth analysis of all the taxonomies, with links to all the taxonomies. If you're interested in that, we can talk about it later.
>> You can see the mapping in our code from OSM to ours.
>> It's really astonishing how different these things are.
>> Taxonomies? Yeah, categorization is hard. How do you categorize stuff? It's quite subjective.
>> And even the length -- you said FourSquare is 3,000?
>> Something like that, and the other day he said he wanted to add more?
>> And Yelp is included?
>> No, I think I did five different ones. But the FourSquare one is very impressive, if you're interested.
>> How many folks are working on this? Except for yourself?
>> Reesh, myself, Julian just joined the team.
>> So four... And a half.
>> Four now. And a half. We had an intern who just left our team.
>> Is he here? He just did amazing work for us. Now he's going to MIT.
>> And (inaudible) from the New York Public Library is joining us.
>> He's right there.
>> Yeah, small team.
>> Yeah, it's cool. We're really interested in getting external feedback. Do people like it, do they not, and there are definitely some changes that can be made. I think some of the search objects could be improved and some of the data and the deduplication could be improved. The other thing that's interesting from the OpenStreetMap community is we have a list of tags that we import. So we actually discard a lot of tags -- for instance, if it doesn't have a name, it doesn't need to be in geocoder. If it doesn't have a name, what are you going to search for? But there are some things that end up being runway A of JFK, runway B, runway C -- kind of annoying. That's another thing I've been working on for a long time. Which is like -- doing analysis again, the second go at that, and figuring out which ones we want to import and which ones we want to leave out. It's always going to annoy somebody.
>> I was just saying... I think if you put a house name, it doesn't import it.
>> A house name?
>> It doesn't understand that its name can be searched on as geocoded.
>> Really? House name like...
>> You were talking about that.
>> That's so annoying. What do people use house names for?
>> Like apartment house names.
>> Like, it's British. British has them a lot. Like old colonial houses.
>> The Irish like to use it too. Rural areas, instead of addresses, they sometimes use house names instead of numbers.
>> My landlord has one too.
>> Or The Dakota. Are you familiar with it?
>> Another good example is the Gherkin in London is the mayor's office. I'm not sure if that's what it's called. It's a colloquial name. Another example of alternative names. We import a lot of them, but when I did an analysis of name tags in OpenStreetMap, it might be on the wiki as well -- it's horrific. Just look at it. It's nasty.
>> When we started a slack channel, just out of the birds of a feather that happened on Saturday for geocoders, and it's geocoders.slack.com. If you send us an email to PELIAS, I'll add you guys to the conversation. It's just to get the community discussing these kinds of issues related to geocoding. So we'd love to have your feedback.
>> It would be great to work on it together. You can see the complexity involved with it. If you want to make your own geocoder, I encourage you to work with us. To make one for everybody, rather than making your own one.
>> And we talked about getting conclusive acceptance test suite -- so if you have things that you want to throw in there, that are important to you, as validation for any geocoder, not just ours, we'd love to hear about that as well.
>> Cool, thank you.
(applause)
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=