-
Notifications
You must be signed in to change notification settings - Fork 11
Select rows filter
TODO: update to include Python and the HXL Proxy as well.
The hxlselect command-line tool creates a new copy of a HXL dataset including only the rows that match certain criteria. You can use this tool to create (for example) a dataset containing only information for a specific geographical region, sector, beneficiary group, etc.
This command is especially useful in pipelines: you can use it to select only certain rows for other commands to act on. If you provide multiple filters, they act as a logical OR; use hxlselect again in a pipeline to get a logical AND.
For example, to create a GeoJSON map layer of only rows related to the WASH sector, you can use a pipeline like
hxlselect --query sector=WASH my-data.csv | \
hxl2geojson > map-layer.json
There are many more similar examples in the HXL cookbook.
usage: hxlselect [-h] [-q tag=value] [-v] [infile] [outfile]
Filter rows in a HXL dataset.
positional arguments:
infile HXL file to read (if omitted, use standard input).
outfile HXL file to write (if omitted, use standard output).
optional arguments:
-h, --help show this help message and exit
-q tag=value, --query tag=value
expression for filtering (use multiple times for
logical OR): <hashtag><op><value>, where <op> is =,
!=, <, <=, >, >=, ~, or !~
-v, --invert Show only lines *not* matching criteria
Note that empty fields always fail a comparison. That means that --query adm1!=XXX
will not match an empty field, only one that has an explicit value.
hxlselect supports the following comparison operators:
= |
The hashtag value is identical to the value provided. |
---|---|
!= |
The hashtag value is not identical to the value provided. |
< |
The hashtag value is less than the value provided (may be unpredictable for non-numeric values). |
<= |
The hashtag value is less than or equal to the value provided (may be unpredictable for non-numeric values). |
> |
The hashtag value is greater than the value provided (may be unpredictable for non-numeric values). |
>= |
The hashtag value is greater than or equal to the value provided (may be unpredictable for non-numeric values). |
~ |
The hashtag value matches the [regular-expression pattern](https://docs.python.org/2/library/re.html#regular-expression-syntax) provided. |
!~ |
The hashtag value does not match the regular-expression pattern provided. |
Starting dataset:
Implementing organisation | Cluster or sector | District | Beneficiaries |
---|---|---|---|
#org | #sector | #adm1 | #targeted_num |
Org1 | Health | Coast | 10,000 |
Org1 | Education | Coast | 25,000 |
Org2 | Health | Mountains | 3,500 |
Generate a new dataset, including only the rows where the #org is "Org1":
hxlselect --query org=Org1 \
<dataset-in >dataset-out
Result:
Implementing organisation | Cluster or sector | District | Beneficiaries |
---|---|---|---|
#org | #sector | #adm1 | #targeted_num |
Org1 | Health | Coast | 10,000 |
Org1 | Education | Coast | 25,000 |
Generate a new dataset where the rows are only those that don't contain a #sector equal to "Education":
hxlselect --invert --query sector=Education \
<dataset-in.csv >dataset-out.csv
Result:
Implementing organisation | Cluster or sector | District | Beneficiaries |
---|---|---|---|
#org | #sector | #adm1 | #targeted_num |
Org1 | Health | Coast | 10,000 |
Org2 | Health | Mountains | 3,500 |
Include only the rows where #target_num is greater than or equal to 10,000:
hxlselect --query 'targeted_num>=10000' \
<dataset-in.csv >dataset-out.csv
Implementing organisation | Cluster or sector | District | Beneficiaries |
---|---|---|---|
#org | #sector | #adm1 | #targeted_num |
Org1 | Health | Coast | 10,000 |
Org1 | Education | Coast | 25,000 |
Standard: http://hxlstandard.org | Mailing list: [email protected]