Skip to content
David Megginson edited this page Sep 1, 2015 · 2 revisions

TODO: update to include Python and the HXL Proxy as well.

The hxlselect command-line tool creates a new copy of a HXL dataset including only the rows that match certain criteria. You can use this tool to create (for example) a dataset containing only information for a specific geographical region, sector, beneficiary group, etc.

This command is especially useful in pipelines: you can use it to select only certain rows for other commands to act on. If you provide multiple filters, they act as a logical OR; use hxlselect again in a pipeline to get a logical AND.

For example, to create a GeoJSON map layer of only rows related to the WASH sector, you can use a pipeline like

hxlselect --query sector=WASH my-data.csv | \
  hxl2geojson > map-layer.json

There are many more similar examples in the HXL cookbook.

Usage

usage: hxlselect [-h] [-q tag=value] [-v] [infile] [outfile]

Filter rows in a HXL dataset.

positional arguments:
  infile                HXL file to read (if omitted, use standard input).
  outfile               HXL file to write (if omitted, use standard output).

optional arguments:
  -h, --help            show this help message and exit
  -q tag=value, --query tag=value
                        expression for filtering (use multiple times for
                        logical OR): <hashtag><op><value>, where <op> is =,
                        !=, <, <=, >, >=, ~, or !~
  -v, --invert          Show only lines *not* matching criteria

Note that empty fields always fail a comparison. That means that --query adm1!=XXX will not match an empty field, only one that has an explicit value.

Comparison operators

hxlselect supports the following comparison operators:

= The hashtag value is identical to the value provided.
!= The hashtag value is not identical to the value provided.
< The hashtag value is less than the value provided (may be unpredictable for non-numeric values).
<= The hashtag value is less than or equal to the value provided (may be unpredictable for non-numeric values).
> The hashtag value is greater than the value provided (may be unpredictable for non-numeric values).
>= The hashtag value is greater than or equal to the value provided (may be unpredictable for non-numeric values).
~ The hashtag value matches the [regular-expression pattern](https://docs.python.org/2/library/re.html#regular-expression-syntax) provided.
!~ The hashtag value does not match the regular-expression pattern provided.

Examples

Starting dataset:

Implementing organisation Cluster or sector District Beneficiaries
#org #sector #adm1 #targeted_num
Org1 Health Coast 10,000
Org1 Education Coast 25,000
Org2 Health Mountains 3,500

Example 1: restricting to a single #org

Generate a new dataset, including only the rows where the #org is "Org1":

hxlselect --query org=Org1 \
  <dataset-in >dataset-out

Result:

Implementing organisation Cluster or sector District Beneficiaries
#org #sector #adm1 #targeted_num
Org1 Health Coast 10,000
Org1 Education Coast 25,000

Example 2: excluding a #sector

Generate a new dataset where the rows are only those that don't contain a #sector equal to "Education":

hxlselect --invert --query sector=Education \
  <dataset-in.csv >dataset-out.csv

Result:

Implementing organisation Cluster or sector District Beneficiaries
#org #sector #adm1 #targeted_num
Org1 Health Coast 10,000
Org2 Health Mountains 3,500

Example 3: limiting by count

Include only the rows where #target_num is greater than or equal to 10,000:

hxlselect --query 'targeted_num>=10000' \
  <dataset-in.csv >dataset-out.csv
Implementing organisation Cluster or sector District Beneficiaries
#org #sector #adm1 #targeted_num
Org1 Health Coast 10,000
Org1 Education Coast 25,000