Skip to content

libhxl code walkthrough

David Megginson edited this page Dec 7, 2016 · 3 revisions

This is a high-level walkthrough of the libhxl-python code base, to help new contributors find their way around. For library users (rather than contributors), a high-level introduction, a coding quick start, and detailed API documentation are also available.

Top level

Directories:

  • hxl/ — root of main library source code.
  • tests/ — PyUnit unit tests and input files (run with python setup.py test).

Files:

  • CHANGELOG — list of feature-level changes for each release in Gnu Change Log format. These changes should be much higher-level than the output of git log, focusing on what users will/should notice between releases. Keep it up to date whenever you add, remove, or change a feature.
  • MANIFEST.in — a Python Manifest file, used only to ensure that CSV files (such as the default HXL schema) are included in the distribution. Edit this only if you need to include other types of non-Python files as well.
  • README.md — a short overview of the package in Markdown format. Keep this up to date for major changes in usage or installation (even higher level than in the CHANGELOG).
  • requirements.txt — list of other python packages required by libhxl-python.
  • setup.py — a standard Python setup file for installing the library, building and uploading distributions, running tests, etc.

Main library code

The library has three primary modules:

  1. Classes in hxl.model define the representation of HXL data.
  2. Classes in hxl.filters define operations on the data defined by hxl.model.
  3. Classes and functions in hxl.io define how to load and save the data defined by hxl.model.

There are other modules under hxl, but these three are the heart of the library.

Design caveat:

The hxl.model.Dataset class is aware of (some of) the operations in hxl.filter: for example, the hxl.model.Dataset.count method creates and returns a hxl.filters.CountFilter embedding the original dataset. That creates a two-way dependency between hxl.model.Dataset and hxl.filters.CountFilter, which is usually a bad choice in software design; however, doing so allows libhxl to use the JQuery coding style of chained filters as methods, which is both popular and elegant for the code user. That means that instead of writing

AddColumnsFilter(CountFilter(data(url), '#org'), "#country=Malaysia")

users can write:

hxl.data(url).count('#org').add_columns("#country=Malaysia")

Making the library itself less elegant so that users can make their code more elegant is a fair trade-off.

This is the top-level file that identifies hxl/ as a Python module. It imports several names to the top-level for convenience, including (most importantly) hxl.io.data, so that someone can load a HXL dataset using simple calls like this:

hxl.data('http://example.org/data.csv')

The hxl.model module defines the classes that make up the main data model for HXL, independent of filters, I/O, and so on. It is comparable to the Model part of a web-development 'Model-View-Controller' (MVC) design.

A library user always starts working with a dataset (often created using the hxl.data function), then moves down to specific columns and rows as needed. There are also classes that let users construct patterns for matching hashtags, and for constructing queries for matching rows.

Main classes

  • hxl.model.Dataset — abstract base class for any HXL dataset, defining the basic operations that are available (hxl.data(url) returns an object implementing this class or a subclass of it). A dataset contains multiple column definitions and row objects.
  • hxl.model.Column — definition of a single column in a dataset, including a text header and a HXL hashtag (possibly with attributes). There is a static method for parsing a HXL hashtag from a string.
  • hxl.model.Row — a row of data in dataset, with references to the columns for convenience, and methods for finding information in the row.

Helper classes

  • hxl.model.TagPattern — a class representing a tag pattern for matching a column by HXL hashtag. You can use tag pattern objects to search or query columns to find matches, much like a regular expression pattern with text. There are static methods for parsing a single tag pattern or a list of tag patterns from strings.
  • hxl.model.RowQuery — a class representing a row query for matching a row based on its contents. You can use row query objects to search or query rows of data, much like a regular expression pattern with text. There are static methods for parsing a single row query or a list of queries from strings, and for matching a list of queries against a row.

The hxl.io module defines how to load HXL from and write HXL data to external representations, such as CSV or Excel files. It is comparable to the Controller part of a web-development 'Model-View-Controller' (MVC) design.

The library can read data over the Internet or from a local file system, and can also parse data structures like Python arrays. It has special support for extracting data from Excel files, Google Sheets, Dropbox resources, or HDX resources—for example, given the URL of a Google Sheet tab, the module will automatically construct the correct URL for generating a CSV version of that tab.

Warning: this module contains many kludges to help users extract data from different kinds of sources (see, for example, hxl.io.munge_url and hxl.io.make_input). Revise with care—it's the most complex, and obfuscated part of the library.

This section gives an overview of only the most-important functions and classes. See the API docs for full details.

Exceptions

  • hxl.io.HXLParseException — specialisation of hxl.common.HXLException for catching problems reading HXL data.
  • hxl.io.HXLTagsNotFoundException — specialisation of hxl.io.HXLParseException for the specific case where the data is readable, but does not contain HXL hashtags. This Exception is useful for error reporting, because it allows a client application to let the user know specifically what is wrong. The HXL Proxy catches it to trigger its own Tagger page, giving the user a chance to define hashtags manually.

Raw input

These items help extract raw input from various types of sources for HXL parsing (or tagging). A library user will likely never need to use or know about the items here. Note that none of these items deals directly with HXL hashtags; they're designed just to prepare the data for higher-level HXL processing.

If you want to add a new kind of source (e.g. reading attributes from a GIS shape file), you need to efine a new subclass of hxl.io.AbstractInput to extract tabular-like data from the source, then edit hxl.io.make_input to detect your new type of data source and return an instance of your new class.

  • Function: hxl.io.make_input — this function takes a URL (or other input source), tries to figure out what kind of input it represents (e.g. CSV? an Excel sheet? a Python array?), and then creates and returns the right kind of input handler derived from hxl.io.AbstractInput. It is one of the trickiest and most obfuscated parts of the library, and needs constant attention during testing.
  • Class: hxl.io.AbstractInput — base class for different types of input.
  • Class: hxl.io.CSVInput — raw input handler for basic CSV data (including that extracted from a Google Sheet).
  • Class: hxl.io.ExcelInput — raw input handler for Excel .xls or .xlsx data.
  • Class: hxl.io.ArrayInput — raw input handler for data that is already parsed into a Python array of arrays.

HXL-level input

These items use the raw-input functionality to read data from a source, then parse the data for HXL hashtags and create a hxl.model.Dataset object from it.

  • Function: hxl.io.data (usually accessed as hxl.data) — this is the main entry point for HXL parsing, allowing the user to write hxl.data(url) to get started processing a HXL dataset. This function can recognise and intercept a JSON recipe; otherwise, it creates a new hxl.io.HXLReader.
  • Class: hxl.io.HXLReader — this class contains the high-level intelligence for detecting HXL hashtags in a tabular data source. It is a subclass of hxl.model.Dataset, to the user, it is simply a dataset that will return rows and columns as needed.

Output functions

These functions serialise a hxl.model.dataset (or subclass) into formats that other applications can read. They use the generators hxl.model.Dataset.gen_csv and hxl.model.Dataset.gen_json to create their output; because they generate the output line by line, they do not require holding a lot of data in memory at once, and can work with very large datasets.

Both of the functions have options for showing or removing both the text headers and the HXL hashtags.

  • hxl.io.write_hxl — generate a CSV representation of a hxl.model.Dataset
  • hxl.io.write_json — generate a simple JSON representation of a hxl.model.Dataset

The hxl.filters module defines the operations that the user can perform on the data defined in hxl.model. It is comparable to the Controller part of a web-development 'Model-View-Controller' (MVC) design.

You can find documentation on all of the filter classes in the API docs. Each filter class has a corresponding method in the hxl.model.Dataset class, and most users will use those convenience classes rather than creating the classes in this module directly.

To add a new filter class, follow these steps:

  1. Create the class itself, deriving it from one of the appropriate filter base classes (see below), including a from_recipe method to construct a filter object from a JSON recipe.
  2. Add a helper method to hxl.model.Dataset to invoke the filter.
  3. Edit the hxl.filters.from_recipe method to recognise a JSON-encoding of the filter and pass it off to your filter class's from_recipe method.
  4. Add a new command-line script in hxl.scripts

Filter base classes

If you plan to add a new filter, then you should derive it from one of these base classes:

Helper items

TODO

TODO

TODO

TODO

TODO

TODO

Unit tests

TODO

Clone this wiki locally