Skip to content
David Megginson edited this page Sep 1, 2015 · 2 revisions

TODO: update to include Python and the HXL Proxy as well.

The hxlcut command-line tool creates a new copy of a HXL dataset with some of the columns removed. You can use this tool in an batch script to remove columns with personally-identifiable information (such as #email) before each public release of a dataset, for example.

There are two ways to use this command:

  1. Provide a whitelist of HXL hashtags for columns to include — only the listed columns will appear in the output.
  2. Provide a blacklist of HXL hashtags for columns to exclude — everything except the listed columns will appear in the output.

If security is a major concern, the whitelist approach ensures that any new columns you add to your source dataset won't accidentally leak out, because you have to add them to the whitelist explicitly. If robustness is a major concern, the blacklist approach ensures that any new columns you add to your source dataset won't be omitted from the output.

Note that if there are multiple columns with the same HXL hashtag, this command operates on all of them.

Usage

usage: hxlcut [-h] [-c tag,tag...] [-C tag,tag...] [infile] [outfile]

Cut columns from a HXL dataset.

positional arguments:
  infile                HXL file to read (if omitted, use standard input).
  outfile               HXL file to write (if omitted, use standard output).

optional arguments:
  -h, --help            show this help message and exit
  -i tag,tag..., --include tag,tag...
                        Comma-separated list of column tags to include
  -x tag,tag..., --exclude tag,tag...
                        Comma-separated list of column tags to exclude

Examples

Starting dataset:

Implementing organisation Contact Cluster or sector District
#org #email #sector #adm1
Org1 [email protected] Health Coast
Org1 [email protected] Education Coast
Org2 [email protected] Health Mountains

Example 1: using a whitelist

You want to produce a dataset containing only #sector and #adm1, no matter what additional columns appear:

hxlcut --include sector,adm1 MyData.csv

Result:

#sector #adm1
Health Coast
Education Coast
Health Mountains

Example 2: using a blacklist

You want to remove the #email column for privacy reasons, but retain any other columns in the source dataset.

hxlcut --exclude email MyData.csv

Result:

#org #sector #adm1
Org1 Health Coast
Org1 Education Coast
Org2 Health Mountains