-
Notifications
You must be signed in to change notification settings - Fork 11
Cut columns filter
TODO: update to include Python and the HXL Proxy as well.
The hxlcut command-line tool creates a new copy of a HXL dataset with some of the columns removed. You can use this tool in an batch script to remove columns with personally-identifiable information (such as #email) before each public release of a dataset, for example.
There are two ways to use this command:
- Provide a whitelist of HXL hashtags for columns to include — only the listed columns will appear in the output.
- Provide a blacklist of HXL hashtags for columns to exclude — everything except the listed columns will appear in the output.
If security is a major concern, the whitelist approach ensures that any new columns you add to your source dataset won't accidentally leak out, because you have to add them to the whitelist explicitly. If robustness is a major concern, the blacklist approach ensures that any new columns you add to your source dataset won't be omitted from the output.
Note that if there are multiple columns with the same HXL hashtag, this command operates on all of them.
usage: hxlcut [-h] [-c tag,tag...] [-C tag,tag...] [infile] [outfile]
Cut columns from a HXL dataset.
positional arguments:
infile HXL file to read (if omitted, use standard input).
outfile HXL file to write (if omitted, use standard output).
optional arguments:
-h, --help show this help message and exit
-i tag,tag..., --include tag,tag...
Comma-separated list of column tags to include
-x tag,tag..., --exclude tag,tag...
Comma-separated list of column tags to exclude
Starting dataset:
Implementing organisation | Contact | Cluster or sector | District |
---|---|---|---|
#org | #sector | #adm1 | |
Org1 | [email protected] | Health | Coast |
Org1 | [email protected] | Education | Coast |
Org2 | [email protected] | Health | Mountains |
You want to produce a dataset containing only #sector and #adm1, no matter what additional columns appear:
hxlcut --include sector,adm1 MyData.csv
Result:
#sector | #adm1 |
---|---|
Health | Coast |
Education | Coast |
Health | Mountains |
You want to remove the #email column for privacy reasons, but retain any other columns in the source dataset.
hxlcut --exclude email MyData.csv
Result:
#org | #sector | #adm1 |
---|---|---|
Org1 | Health | Coast |
Org1 | Education | Coast |
Org2 | Health | Mountains |
Standard: http://hxlstandard.org | Mailing list: [email protected]