osm-tag-extract
is a utility to extract selected tag metadata from an OpenStreetMap (OSM) history data extract. The output data is written to a JSONL file (see below).
- Relies heavily on node-osmium for reading and parsing OSM input files. For a list of supported input formats, see the node-osmium documentation.
- Files are processed by streaming, and the entire input file is never loaded into memory at the same time. For example, even if the entire OpenStreetMap dump with historical revisions is large (~65GB as of 7/2018), this can be processed on a low-end cloud instance w. only 4GB of memory in ~30 hours. See below for further details on this.
osm-tag-extract
can also be used to extract tag metadata from latest snapshots (data exports without historical revisions).- Requires Node.JS version 8.
One line is written to the output JSONL file for each map element that at some point has been tagged with any of the tags selected for extraction. Below is an example output line (pretty formatted for easy reading):
[
"N40",[
[1276372410, 1, 1, ["1:v2"]],
[1277305733, 4, 0, []],
[1277305733, 5, 1, []],
[1279547722, 9, 1, ["0:v2", "1:v2", "2:v3"]]
]
]
N40
indicates that this is the history for node 40. Similarly,W40
andR40
would refer to way 40 or relation 40.- The second component is an array that list changes in: 1) the selected tags and/or 2) changes in the visible status.
- For example, for
[1276372410, 1, 1, ["1:v2"]]
,- the first component 1276372410 is the unix epoch timestamp (in seconds) for this version.
- the second component 1 is the version number.
- the third component 1 indicates visible status (0=not visible/deleted, 1=visible).
- the last array lists values for the selected tags. Eg. if the list of selected tags are
["amenity", "shop", "man_made"]
, then1:v2
indicates thatshop
is set tov2
. The indices are hexadecimal, soa:v2
would refer to the 11th tag.
The raw data for OpenStreetMap is a database containing three types of map elements: nodes, ways, and relations, see the OSM Wiki:
-
A node (or a point) describes a (latitude, longitude)-point on the map. This can eg. be used for a lampost.
-
A way is a list of nodes forming a line or a closed polygon (area). Used eg. to model a road or a building outline.
-
A relation describes a relationship between objects (nodes, ways, relations). A relation can eg. be used to describe that turning from one road to another is not possible.
To view the history for a map element, one can open https://www.openstreetmap.org/[node|way|relation]/<number>/history
, and click "Download XML". (For testing, most random numbers, say, below 10000 seem to work). The meaning of map objects are described using key-value tags. For example, the tags {"amenity": "bench", "backrest": "yes"}
indicate that a map element is a park bench.
Map elements are versioned and timestamped. So editing an object creates a new version of the object with updated data. Similarly, objects are deleted by creating a new version with a visible flag set to false. Thus, any changes or deletes can be reverted.
Various data exports are available of the OSM data:
- the latest data (with no edit histories): updated daily (~40GB as of 7/2018).
- the full history updated weekly, (~65GB as of 7/2018).
For fast processing, one should use the pbf input files (protobuf based).
The tests rely on synthetic test data which (due to its somewhat large size) is stored in a separate testdata git submodule. Clone the repo with --recurse-submodules
to also fetch test data.
git clone --recurse-submodules [email protected]:tagdynamics-org/osm-extract-tags.git
Fetching the test data is not needed to just run osm-tag-extract
.
npm install
npm run test
npm run test:watch
npm install
npm run tag-extract --tags=tags,to,extract --input-file=<input osm file> --output-file=<output.jsonl>
Suitable tags to extract could be:
export TAGS=amenity,barrier,building,highway,landuse,leisure,man_made,natural,railway,shop,sport,surface,tourism
The below numbers and runtimes are based on these tags.
Note:
- colons (or commas) are not allowed in the selected tags, see Output format above.
- See the taginfo website to explore tags and more details about tags.
Extracting data from large OSM extracts have long run times and require lots of temporary space. For this purpose cloud instances may be used.
The below instructions describe how this can be done on AWS using:
- 64 bit Ubuntu Server 14.04 (Trusty) LTS.
- AWS t2.medium (2vCPU, 4G memory). Pricing: current gen, past generations
- 175G of "General purpose SSD" (TODO: check how much is really needed)
# log into instance and install dependencies
sudo apt-get -y update && sudo apt-get -y upgrade
sudo apt-get -y install git zip mg tmux
git clone --recurse-submodules https://github.com/tagdynamics-org/osm-extract-tags.git
# Install docker as described here
# https://docs.docker.com/install/linux/docker-ce/ubuntu/#set-up-the-repository
sudo apt-get install apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
sudo apt-get update && sudo apt-get -y install docker-ce
# start shell in node8 container
tmux
sudo docker pull node:8.11.3
sudo docker run -v `pwd`/data/:/data -v `pwd`/osm-extract-tags:/code -it --rm node:8.11.3 /bin/bash
# run inside docker (as a script)
set -eux
mkdir -p /data/osm-input # files downloaded directly from OSM
mkdir -p /data/tag-metadata # everything we compute
# The below takes ~30 minutes (35.1MB/s), size: ~67G.
date -I > /data/osm-input/download-date
wget -O /data/osm-input/history.osm.pbf https://planet.openstreetmap.org/pbf/full-history/history-latest.osm.pbf
wget -O /data/osm-input/history.osm.pbf.md5 https://planet.openstreetmap.org/pbf/full-history/history-latest.osm.pbf.md5
# TODO: check that checksum is correct
cd /code
npm install
npm run test
# select tags to extract (set eg. $TAGS as above)
# The below step will take ~33.5 hours. Output JSONL size: ~29G
export TAGS=<see above>
time npm run tag-extract --tags=$TAGS --input-file=/data/osm-input/history.osm.pbf --output-file=/data/tag-metadata/tag-history.jsonl
## Packaging and uploading data to s3
set -eux
# Before running set output bucket name
#
# export S3_OUTPUT_BUCKET_NAME=
#
# See
# https://stackoverflow.com/questions/23929235/multi-line-string-with-extra-space-preserved-indentation
LICENSE_NOTICE=`cat << END
The data files in this zip-file are extracted from the full
OpenStreetMap data export that include all edit histories.
This data is (c) OpenStreetMap contributors and distributed
under the Open Database License (ODbL), see:
https://www.openstreetmap.org/copyright
The download date and md5 checksum of the original .osm.pb
data export are included in this zip file. These can be used to
determine the exact data dump that was used.
For further details on how the data was extracted and
processed, please see the source repositories
- https://github.com/tagdynamics-org/osm-extract-tags
- https://github.com/tagdynamics-org/osm-tag-aggregator
[email protected]
END
`
# ubuntu should have access to data directory (created by root running docker)
sudo chgrp -R ubuntu data
sudo chown -R ubuntu data
echo "$LICENSE_NOTICE" > ./data/OSM_LICENSE.txt
export DATA_DIR=./data/
export DOWNLOAD_DATE=`cat data/osm-input/download-date`
zip osm-transitions-${DOWNLOAD_DATE}.zip \
$DATA_DIR/OSM_LICENSE.txt \
$DATA_DIR/osm-input/download-date \
$DATA_DIR/osm-input/history.osm.pbf.md5 \
$DATA_DIR./tag-metadata/tag-history.jsonl
# TODO: install s3cmd v1.5+ to support AWS IAM roles. Ensure instance has a role attached with s3 access
s3cmd put osm-transitions-${DOWNLOAD_DATE}.zip s3://$S3_OUTPUT_BUCKET_NAME/
Ideas, questions or contributions are welcome.
Copyright 2018 Matias Dahl. Released under the MIT license.
Please note that osm-tag-extract
is designed to process OpenStreetMap data. This data is available under the Open Database License. See also the OSMF wiki regarding OpenStreetMap data and the GDPR.