A guide to parsing and analysing OSM data using Python and Osmium.

Introduction

OpenStreetMap (OSM) is a collaborative project to create a free and editable map of the world. OSM data is available in various formats, such as XML, JSON, and PBF. PBF is a binary format that is more compact and faster to process than XML or JSON. In this post, we will show you how to get statistics from OSM data files using Python and Osmium, a library for working with OSM data in various languages.

The main goal of the project is to extract meaningful statistics from OSM data files, such as the number and types of nodes, ways, and relations, the distribution of tags, the length and area of ways, and so on. These statistics can help us understand the characteristics and quality of OSM data, as well as identify potential issues or errors.

Process & Code Walkthrough

The first step is to obtain the PBF file, which can be downloaded from various sources such as Geofabrik or BBBike, or acquired using the following steps:

  1. Get the GeoJSON first from geojson.io (powered by Mapbox).
  2. Next, generate the PBF file from any available service or open-source library like Osmium or Tippecanoe. Gaussian Solutions also has utilities that provide for PBF cutting.
  3. Save the PBF and run it through the analyser code below.

We will use the PBF file for the city of Seattle, Washington State, USA, as an example. The file contains over 8 million elements.

Pre-requisites

  1. Python environment
  2. Pip packages:
    • Pyosmium
    • networkx
    • FastAPI and others, if you want to serve the data via API.
    • Matplotlib for graph rendering if required.

We first count the total number of elements in the OSM file using a customized Osmium Simple Handler class called ElementCounter. The osmium.SimpleHandler is a class that defines methods for handling different types of OSM elements, such as nodes, ways, and relations. We can override these methods to implement our own logic for processing the elements. In this case, we simply increment a counter for each element type.

We then initialize another customized Osmium Simple Handler class called StatsAnalyzer with the total number of elements from above and run it on the same PBF file again. (It is more efficient to run twice than calculating in one go.) This class is responsible for parsing the PBF file and gathering statistics about the data. We also declare a couple of Enum classes to define the different types of ways and nodes that are present in the OSM file for which we need the statistics. For example, we define a WayType Enum class with values such as HIGHWAY, FOOTWAY, SIDEWALK, etc. We use these Enums to classify the ways and nodes based on their tags and attributes.

Code Walkthrough

We override the way handler from osmium.SimpleHandler as explained below:

  1. Calculate the distance of the way using the osmium.geom.haversine_distance function and add it to the total distance of ways.
  2. Iterate over the tags of the way. If a tag key is HIGHWAY, calculate the distance of the way and add it to the total distance of highways.
  3. Depending on the type of the way (sidewalk, footway, or pedestrian), add the nodes of the way to either the sidewalk graph or the highway graph.
  4. We use the networkx Python library for adding the nodes to the graph.

To calculate the intersections, we use the graph from above:

  1. From this graph, we calculate the intersections for highways and sidewalks/footways.
  2. We use the degree of the node in the graph and check if the degree is greater than 2; if so, we add it to our stats library.

We also need to keep a tab on new tag values encountered that are not in the Enum. When that happens, the code prints a console message indicating that the Way Types need to be updated. Based on need, we can then just update the Enum to get the new tag statistics.

Apart from the above, we also maintain counters for nodes, ways, relations, kerbs, and crossings.

Once the processing is complete, we return a JSON file containing the following statistics:

  1. The number of nodes in the OSM file.
  2. The number of ways in the OSM file.
  3. The distances of each way type available in the OSM file. For example, the total distance of highways in the map area, or the total sidewalks/footways in the area in meters.
  4. The total kerbs count in the file.
  5. The total crossings count in the file.
  6. The intersections with highways. The intersections are classified based on the number of highways at the intersection.

Below is a sample analytics response for the Albany County, NY area (the source GeoJSON polygon defining the region is omitted here for brevity):

{
  "overall_stats": {
    "title": "Overall PBF Stats",
    "data": [
      {"label": "Total Elements", "value": 1341306},
      {"label": "Node Count", "value": 1216247},
      {"label": "Relations Count", "value": 1444},
      {"label": "Kerb Count", "value": 22},
      {"label": "Crossings Count", "value": 1804}
    ]
  },
  "way_distances": {
    "title": "Way Distances",
    "data": [
      {"label": "overall", "value": 19449212.58},
      {"label": "residential", "value": 1699453.75},
      {"label": "highway", "value": 7165006.08},
      {"label": "tertiary", "value": 755785.85},
      {"label": "service", "value": 2300961.02},
      {"label": "footway", "value": 323957.4},
      {"label": "path", "value": 485078.91},
      {"label": "sidewalk", "value": 87058.51},
      {"label": "cycleway", "value": 60966.37}
    ]
  },
  "way_counts": {
    "title": "Way Count",
    "data": [
      {"label": "overall", "value": 123615},
      {"label": "residential", "value": 5650},
      {"label": "highway", "value": 35457},
      {"label": "service", "value": 20222},
      {"label": "footway", "value": 3542},
      {"label": "path", "value": 1303},
      {"label": "sidewalk", "value": 656}
    ]
  },
  "highway_intersections": {
    "title": "Highway Intersections",
    "data": [
      {"label": "4", "value": 5572},
      {"label": "5", "value": 111},
      {"label": "6", "value": 6},
      {"label": "7", "value": 2}
    ]
  }
}

The performance of the code for the above is approximately 50k elements per second. It takes around 197 seconds, or approximately 3 minutes and 30 seconds.

Analysis of Albany County, NY

Using the above data we now have the details for Albany County, NY on the road and sidewalk network. The highway network is around 4,452 miles (or 7,165,006 metres), whereas the sidewalk network is only around 54 miles (or 87,058 metres).

When we represent it on the map we can clearly see the contrast: a sea of blue (representing the highway network) and the red (representing the sidewalk network).

Future Upgrades

  1. Reduce time taken to process the PBF file.
  2. Optimize memory usage.
  3. Improve performance using multi-threading or parallel processing.

Conclusion

This post has demonstrated how to use Python and the Osmium library to extract statistics from OSM data files. We created a custom osmium.SimpleHandler class to process and examine the data, and produced a JSON file with the results. We are further able to visualize it into charts and extrapolate that information onto a map as demonstrated above. This project is a part of the complete suite of solutions that Gaussian Solutions offers for modernizing geospatial analysis and transit data pipelines.

Continue the Discussion

If you are planning a geospatial analytics workflow and need help shaping the architecture from data ingestion to business reporting, book a CTO consultation.

You can also connect with me on LinkedIn for a deeper technical discussion.