-
Notifications
You must be signed in to change notification settings - Fork 11
libhxl code walkthrough
This is a high-level walkthrough of the libhxl-python code base, to help new contributors find their way around. For library users (rather than contributors), a high-level introduction, a coding quick start, and detailed API documentation are also available.
Directories:
-
hxl/
— root of main library source code. -
tests/
— PyUnit unit tests and input files (run withpython setup.py test
).
Files:
-
CHANGELOG
— list of feature-level changes for each release in Gnu Change Log format. These changes should be much higher-level than the output ofgit log
, focusing on what users will/should notice between releases. Keep it up to date whenever you add, remove, or change a feature. -
MANIFEST.in
— a Python Manifest file, used only to ensure that CSV files (such as the default HXL schema) are included in the distribution. Edit this only if you need to include other types of non-Python files as well. -
README.md
— a short overview of the package in Markdown format. Keep this up to date for major changes in usage or installation (even higher level than in the CHANGELOG). -
requirements.txt
— list of other python packages required by libhxl-python. -
setup.py
— a standard Python setup file for installing the library, building and uploading distributions, running tests, etc.
The library has three primary modules:
- Classes in
hxl.model
define the representation of HXL data. - Classes in
hxl.filters
define operations on the data defined byhxl.model
. - Classes and functions in
hxl.io
define how to load and save the data defined byhxl.model
.
There are other modules under hxl
, but these three are the heart of the library.
Design caveat:
The hxl.model.Dataset
class is aware of (some of) the operations in hxl.filter
: for example, the hxl.model.Dataset.count
method creates and returns a hxl.filters.CountFilter
embedding the original dataset. That creates a two-way dependency between hxl.model.Dataset
and hxl.filters.CountFilter
, which is usually a bad choice in software design; however, doing so allows libhxl to use the JQuery coding style of chained filters as methods, which is both popular and elegant for the code user. That means that instead of writing
AddColumnsFilter(CountFilter(data(url), '#org'), "#country=Malaysia")
users can write:
hxl.data(url).count('#org').add_columns("#country=Malaysia")
Making the library itself less elegant so that users can make their code more elegant is a fair trade-off.
This is the top-level file that identifies hxl/
as a Python module. It imports several names to the top-level for convenience, including (most importantly) hxl.io.data
, so that someone can load a HXL dataset using simple calls like this:
hxl.data('http://example.org/data.csv')
The hxl.model
module defines the classes that make up the main data model for HXL, independent of filters, I/O, and so on. It is comparable to the Model part of a web-development 'Model-View-Controller' (MVC) design.
A library user always starts working with a dataset (often created using the hxl.data
function), then moves down to specific columns and rows as needed. There are also classes that let users construct patterns for matching hashtags, and for constructing queries for matching rows.
-
hxl.model.Dataset
— abstract base class for any HXL dataset, defining the basic operations that are available (hxl.data(url)
returns an object implementing this class or a subclass of it). A dataset contains multiple column definitions and row objects. -
hxl.model.Column
— definition of a single column in a dataset, including a text header and a HXL hashtag (possibly with attributes). There is a static method for parsing a HXL hashtag from a string. -
hxl.model.Row
— a row of data in dataset, with references to the columns for convenience, and methods for finding information in the row.
-
hxl.model.TagPattern
— a class representing a tag pattern for matching a column by HXL hashtag. You can use tag pattern objects to search or query columns to find matches, much like a regular expression pattern with text. There are static methods for parsing a single tag pattern or a list of tag patterns from strings. -
hxl.model.RowQuery
— a class representing a row query for matching a row based on its contents. You can use row query objects to search or query rows of data, much like a regular expression pattern with text. There are static methods for parsing a single row query or a list of queries from strings, and for matching a list of queries against a row.
The hxl.io
module defines how to load HXL from and write HXL data to external representations, such as CSV or Excel files. It is comparable to the Controller part of a web-development 'Model-View-Controller' (MVC) design.
The library can read data over the Internet or from a local file system, and can also parse data structures like Python arrays. It has special support for extracting data from Excel files, Google Sheets, Dropbox resources, or HDX resources—for example, given the URL of a Google Sheet tab, the module will automatically construct the correct URL for generating a CSV version of that tab.
Warning: this module contains many kludges to help users extract data from different kinds of sources (see, for example, hxl.io.munge_url
and hxl.io.make_input
). Revise with care—it's the most complex, and obfuscated part of the library.
This section gives an overview of only the most-important functions and classes. See the API docs for full details.
-
hxl.io.HXLParseException
— specialisation ofhxl.common.HXLException
for catching problems reading HXL data. -
hxl.io.HXLTagsNotFoundException
— specialisation ofhxl.io.HXLParseException
for the specific case where the data is readable, but does not contain HXL hashtags. This Exception is useful for error reporting, because it allows a client application to let the user know specifically what is wrong. The HXL Proxy catches it to trigger its own Tagger page, giving the user a chance to define hashtags manually.
These items help extract raw input from various types of sources for HXL parsing (or tagging). A library user will likely never need to use or know about the items here. Note that none of these items deals directly with HXL hashtags; they're designed just to prepare the data for higher-level HXL processing.
If you want to add a new kind of source (e.g. reading attributes from a GIS shape file), you need to efine a new subclass of hxl.io.AbstractInput
to extract tabular-like data from the source, then edit hxl.io.make_input
to detect your new type of data source and return an instance of your new class.
- Function:
hxl.io.make_input
— this function takes a URL (or other input source), tries to figure out what kind of input it represents (e.g. CSV? an Excel sheet? a Python array?), and then creates and returns the right kind of input handler derived fromhxl.io.AbstractInput
. It is one of the trickiest and most obfuscated parts of the library, and needs constant attention during testing. - Class:
hxl.io.AbstractInput
— base class for different types of input. - Class:
hxl.io.CSVInput
— raw input handler for basic CSV data (including that extracted from a Google Sheet). - Class:
hxl.io.ExcelInput
— raw input handler for Excel .xls or .xlsx data. - Class:
hxl.io.ArrayInput
— raw input handler for data that is already parsed into a Python array of arrays.
These items use the raw-input functionality to read data from a source, then parse the data for HXL hashtags and create a hxl.model.Dataset
object from it.
- Function:
hxl.io.data
(usually accessed ashxl.data
) — this is the main entry point for HXL parsing, allowing the user to writehxl.data(url)
to get started processing a HXL dataset. This function can recognise and intercept a JSON recipe; otherwise, it creates a newhxl.io.HXLReader
. - Class:
hxl.io.HXLReader
— this class contains the high-level intelligence for detecting HXL hashtags in a tabular data source. It is a subclass ofhxl.model.Dataset
, to the user, it is simply a dataset that will return rows and columns as needed.
These functions serialise a hxl.model.dataset
(or subclass) into formats that other applications can read. They use the generators hxl.model.Dataset.gen_csv
and hxl.model.Dataset.gen_json
to create their output; because they generate the output line by line, they do not require holding a lot of data in memory at once, and can work with very large datasets.
Both of the functions have options for showing or removing both the text headers and the HXL hashtags.
-
hxl.io.write_hxl
— generate a CSV representation of ahxl.model.Dataset
-
hxl.io.write_json
— generate a simple JSON representation of ahxl.model.Dataset
The hxl.filters
module defines the operations that the user can perform on the data defined in hxl.model
. It is comparable to the Controller part of a web-development 'Model-View-Controller' (MVC) design.
You can find documentation on all of the filter classes in the API docs. Each filter class has a corresponding method in the hxl.model.Dataset
class, and most users will use those convenience classes rather than creating the classes in this module directly.
To add a new filter class, follow these steps:
- Create the class itself, deriving it from one of the appropriate filter base classes (see below), including a
from_recipe
method to construct a filter object from a JSON recipe. - Add a helper method to
hxl.model.Dataset
to invoke the filter. - Edit the
hxl.filters.from_recipe
method to recognise a JSON-encoding of the filter and pass it off to your filter class'sfrom_recipe
method. - Add a new command-line script in
hxl.scripts
If you plan to add a new filter, then you should derive it from one of these base classes:
-
hxl.filters.AbstractBaseFilter
— this is the lowest-level base class, extendinghxl.model.Dataset
(so that a filter looks like a dataset to the client). Most new filter classes should extendAbstractStreamingFilter
orAbstractCachingFilter
to get extra functionality for free, but occasionally, an unusual type of filter class need to extend this one directly (as is the case forhxl.filters.AppendFilter
andhxl.filters.ExplodeFilter
). -
hxl.filters.AbstractStreamingFilter
— base class for filters that modify the contents of rows but do not change the number of rows returned (e.g.hxl.filters.ReplaceDataFilter
). These filters tend to be highly efficient, and can handle datasets hundreds of thousands of rows in length without placing heavy demands on a server. -
hxl.filters.AbstractCachingFilter
— base class for filters that need to read the entire source dataset before producing any results (e.g.hxl.filters.CountFilter
. These are less efficient than streaming filters, and you need to use them with care on large datasets.
- Function:
hxl.filters.from_recipe
— construct a chain of filter objects from a JSON recipe, using each filter class'sfrom_recipe
method. - Functions:
hxl.filters.opt_arg
andhxl.filters.req_arg
— helper checking arguments in classfrom_recipe
methods. - Exception:
hxl.filters.HXLFilterException
— derived fromhxl.common.HXLException
solely to make it easier to distinguish filter-related exceptions in client code. Your filters should throw this exception when there is a problem with their parameters or input. - Class:
hxl.filters.Aggregator
— class for tracking an aggregate value, currently used byhxl.filters.CountFilter
.
TODO
TODO
TODO
TODO
TODO
TODO
TODO
Standard: http://hxlstandard.org | Mailing list: [email protected]