libhxl code walkthrough

This is a high-level walkthrough of the libhxl-python code base, to help new contributors find their way around. For library users (rather than contributors), a high-level introduction, a coding quick start, and detailed API documentation are also available.

Top level

Directories:

hxl/ — root of main library source code.
tests/ — PyUnit unit tests and input files (run with python setup.py test).

Files:

CHANGELOG — list of feature-level changes for each release in Gnu Change Log format. These changes should be much higher-level than the output of git log, focusing on what users will/should notice between releases. Keep it up to date whenever you add, remove, or change a feature.
MANIFEST.in — a Python Manifest file, used only to ensure that CSV files (such as the default HXL schema) are included in the distribution. Edit this only if you need to include other types of non-Python files as well.
README.md — a short overview of the package in Markdown format. Keep this up to date for major changes in usage or installation (even higher level than in the CHANGELOG).
requirements.txt — list of other python packages required by libhxl-python.
setup.py — a standard Python setup file for installing the library, building and uploading distributions, running tests, etc.

Main library code

The library has three primary modules:

Classes in hxl.model define the representation of HXL data.
Classes in hxl.filters define operations on the data defined by hxl.model.
Classes and functions in hxl.io define how to load and save the data defined by hxl.model.

There are other modules under hxl, but these three are the heart of the library.

Design caveat:

The hxl.model.Dataset class is aware of (some of) the operations in hxl.filter: for example, the hxl.model.Dataset.count method creates and returns a hxl.filters.CountFilter embedding the original dataset. That creates a two-way dependency between hxl.model.Dataset and hxl.filters.CountFilter, which is usually a bad choice in software design; however, doing so allows libhxl to use the JQuery coding style of chained filters as methods, which is both popular and elegant for the code user. That means that instead of writing

AddColumnsFilter(CountFilter(data(url), '#org'), "#country=Malaysia")

users can write:

hxl.data(url).count('#org').add_columns("#country=Malaysia")

Making the library itself less elegant so that users can make their code more elegant is a fair trade-off.

`hxl/init.py`

This is the top-level file that identifies hxl/ as a Python module. It imports several names to the top-level for convenience, including (most importantly) hxl.io.data, so that someone can load a HXL dataset using simple calls like this:

hxl.data('http://example.org/data.csv')

`hxl/model.py`

The hxl.model module defines the classes that make up the main data model for HXL, independent of filters, I/O, and so on. It is comparable to the Model part of a web-development 'Model-View-Controller' (MVC) design.

A library user always starts working with a dataset (often created using the hxl.data function), then moves down to specific columns and rows as needed. There are also classes that let users construct patterns for matching hashtags, and for constructing queries for matching rows.

Main classes

hxl.model.Dataset — abstract base class for any HXL dataset, defining the basic operations that are available (hxl.data(url) returns an object implementing this class or a subclass of it). A dataset contains multiple column definitions and row objects.
hxl.model.Column — definition of a single column in a dataset, including a text header and a HXL hashtag (possibly with attributes). There is a static method for parsing a HXL hashtag from a string.
hxl.model.Row — a row of data in dataset, with references to the columns for convenience, and methods for finding information in the row.

Helper classes

hxl.model.TagPattern — a class representing a tag pattern for matching a column by HXL hashtag. You can use tag pattern objects to search or query columns to find matches, much like a regular expression pattern with text. There are static methods for parsing a single tag pattern or a list of tag patterns from strings.
hxl.model.RowQuery — a class representing a row query for matching a row based on its contents. You can use row query objects to search or query rows of data, much like a regular expression pattern with text. There are static methods for parsing a single row query or a list of queries from strings, and for matching a list of queries against a row.

`hxl/io.py`

The hxl.io module defines how to load HXL from and write HXL data to external representations, such as CSV or Excel files. It is comparable to the Controller part of a web-development 'Model-View-Controller' (MVC) design.

The library can read data over the Internet or from a local file system, and can also parse data structures like Python arrays. It has special support for extracting data from Excel files, Google Sheets, Dropbox resources, or HDX resources—for example, given the URL of a Google Sheet tab, the module will automatically construct the correct URL for generating a CSV version of that tab.

Warning: this module contains many kludges to help users extract data from different kinds of sources (see, for example, hxl.io.munge_url and hxl.io.make_input). Revise with care—it's the most complex, and obfuscated part of the library.

This section gives an overview of only the most-important functions and classes. See the API docs for full details.

Exceptions

hxl.io.HXLParseException — specialisation of hxl.common.HXLException for catching problems reading HXL data.
hxl.io.HXLTagsNotFoundException — specialisation of hxl.io.HXLParseException for the specific case where the data is readable, but does not contain HXL hashtags. This Exception is useful for error reporting, because it allows a client application to let the user know specifically what is wrong. The HXL Proxy catches it to trigger its own Tagger page, giving the user a chance to define hashtags manually.

Raw input

These items help extract raw input from various types of sources for HXL parsing (or tagging). A library user will likely never need to use or know about the items here. Note that none of these items deals directly with HXL hashtags; they're designed just to prepare the data for higher-level HXL processing.

If you want to add a new kind of source (e.g. reading attributes from a GIS shape file), you need to efine a new subclass of hxl.io.AbstractInput to extract tabular-like data from the source, then edit hxl.io.make_input to detect your new type of data source and return an instance of your new class.

Function: hxl.io.make_input — this function takes a URL (or other input source), tries to figure out what kind of input it represents (e.g. CSV? an Excel sheet? a Python array?), and then creates and returns the right kind of input handler derived from hxl.io.AbstractInput. It is one of the trickiest and most obfuscated parts of the library, and needs constant attention during testing.
Class: hxl.io.AbstractInput — base class for different types of input.
Class: hxl.io.CSVInput — raw input handler for basic CSV data (including that extracted from a Google Sheet).
Class: hxl.io.ExcelInput — raw input handler for Excel .xls or .xlsx data.
Class: hxl.io.ArrayInput — raw input handler for data that is already parsed into a Python array of arrays.

HXL-level input

These items use the raw-input functionality to read data from a source, then parse the data for HXL hashtags and create a hxl.model.Dataset object from it.

Function: hxl.io.data (usually accessed as hxl.data) — this is the main entry point for HXL parsing, allowing the user to write hxl.data(url) to get started processing a HXL dataset. This function can recognise and intercept a JSON recipe; otherwise, it creates a new hxl.io.HXLReader.
Class: hxl.io.HXLReader — this class contains the high-level intelligence for detecting HXL hashtags in a tabular data source. It is a subclass of hxl.model.Dataset, to the user, it is simply a dataset that will return rows and columns as needed.

Output functions

These functions serialise a hxl.model.dataset (or subclass) into formats that other applications can read. They use the generators hxl.model.Dataset.gen_csv and hxl.model.Dataset.gen_json to create their output; because they generate the output line by line, they do not require holding a lot of data in memory at once, and can work with very large datasets.

Both of the functions have options for showing or removing both the text headers and the HXL hashtags.

hxl.io.write_hxl — generate a CSV representation of a hxl.model.Dataset
hxl.io.write_json — generate a simple JSON representation of a hxl.model.Dataset

`hxl/filters.py`

The hxl.filters module defines the operations that the user can perform on the data defined in hxl.model. It is comparable to the Controller part of a web-development 'Model-View-Controller' (MVC) design.

You can find documentation on all of the filter classes in the API docs. Each filter class has a corresponding method in the hxl.model.Dataset class, and most users will use those convenience classes rather than creating the classes in this module directly.

To add a new filter class, follow these steps:

Create the class itself, deriving it from one of the appropriate filter base classes (see below), including a from_recipe method to construct a filter object from a JSON recipe.
Add a helper method to hxl.model.Dataset to invoke the filter.
Edit the hxl.filters.from_recipe method to recognise a JSON-encoding of the filter and pass it off to your filter class's from_recipe method.
Add a new command-line script in hxl.scripts

Filter base classes

If you plan to add a new filter, then you should derive it from one of these base classes:

hxl.filters.AbstractBaseFilter — this is the lowest-level base class, extending hxl.model.Dataset (so that a filter looks like a dataset to the client). Most new filter classes should extend AbstractStreamingFilter or AbstractCachingFilter to get extra functionality for free, but occasionally, an unusual type of filter class need to extend this one directly (as is the case for hxl.filters.AppendFilter and hxl.filters.ExplodeFilter).
hxl.filters.AbstractStreamingFilter — base class for filters that modify the contents of rows but do not change the number of rows returned (e.g. hxl.filters.ReplaceDataFilter). These filters tend to be highly efficient, and can handle datasets hundreds of thousands of rows in length without placing heavy demands on a server.
hxl.filters.AbstractCachingFilter — base class for filters that need to read the entire source dataset before producing any results (e.g. hxl.filters.CountFilter. These are less efficient than streaming filters, and you need to use them with care on large datasets.

Helper items

Function: hxl.filters.from_recipe — construct a chain of filter objects from a JSON recipe, using each filter class's from_recipe method.
Functions: hxl.filters.opt_arg and hxl.filters.req_arg — helper checking arguments in class from_recipe methods.
Exception: hxl.filters.HXLFilterException — derived from hxl.common.HXLException solely to make it easier to distinguish filter-related exceptions in client code. Your filters should throw this exception when there is a problem with their parameters or input.
Class: hxl.filters.Aggregator — class for tracking an aggregate value, currently used by hxl.filters.CountFilter.

Unit tests

TODO

Standard: http://hxlstandard.org | Mailing list: [email protected]

Home

For everyone
- Installation
- Command-line tools
For coders
Building blocks
HXL cookbook

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libhxl code walkthrough

Top level

Main library code

`hxl/init.py`

`hxl/model.py`

Main classes

Helper classes

`hxl/io.py`

Exceptions

Raw input

HXL-level input

Output functions

`hxl/filters.py`

Filter base classes

Helper items

`hxl/converters.py`

`hxl/validation.py`

`hxl/scripts.py`

`hxl/common.py`

`hxl/py2compat.py`

`hxl/hxl-default-schema.csv`

Unit tests

Clone this wiki locally