Skip to content

mabel.data.readers.reader

Justin Joyce edited this page Jul 23, 2022 · 13 revisions

Reader ()

Reads records from a data store, opinionated toward Google Cloud Storage but a filesystem reader is available to assist with local development.

The Reader will iterate over a set of files and return them to the caller as a single stream of records. The files can be read from a single folder or can be matched over a set of date/time formatted folder names. This is useful to read over a set of logs. The date range is provided as part of the call; this is essentially a way to partition the data by date/time.

The reader can filter records to return a subset, for JSON formatted data the records can be converted to dictionaries before filtering. JSON data can also be used to select columns, so not all read data is returned.

The reader does not support aggregations, calculations or grouping of data, it is a log reader and returns log entries. The reader can convert a set into Pandas dataframe, or the dictset helper library can perform some activities on the set in a more memory efficient manner.

Note

  • Different inner_readers may take or require additional parameters. This class has a decorator which helps to ensure it is called correctly.

Parameters

  • select - string (optional)
    A select expression, this is usually a comma separated list of field names although can include predefined functions. Default is "*" which presents all of the fields in the dataset.
  • dataset - string
    The path to the data source (exact syntax differs per inner_reader)
  • filters - string or list/tuple (optional)
    STRING: An expression which when evaluated for each row, if False the row will be removed from the resulant data set, like the WHERE clause of of a SQL statement. LIST/TUPLE: Filter expressed as DNF.
  • inner_reader - BaseReader (optional)
    The reader class to perform the data access Operators, the default is GoogleCloudStorageReader
  • start_date - datetime (optional)
    The starting date of the range to read over, default is today
  • end_date - datetime (optional)
    The end date of the range to read over, default is today
  • freshness_limit - string (optional)
    a time delta string (e.g. 6h30m = 6hours and 30 minutes) which incidates the maximum age of a dataset before it is no longer considered fresh. Where the 'time' of a dataset cannot be determined, it will be treated as midnight (00:00) for the date.
  • persistence - STORAGE_CLASS (optional)
    How to cache the results, the default is NO_PERSISTANCE which will almost always return a generator. MEMORY should only be used where the dataset isn't huge and DISK is many times slower than MEMORY. COMPRESSED_MEMORY fits in between, usually faster than DISK but slower than MEMORY.
  • cursor - dictionary (or string)
    Resume read from a given point (assumes other parameters are the same). If a JSON string is provided, it will converted to a dictionary.
  • override_format - string (optional)
    Override the format detection - sometimes users know better.
  • multiprocess - boolean (optional)
    Split the task over multiple CPUs to improve throughput. Note that there are conditions that must be met for the multiprocessor to be safe which may mean even though this is set, data is accessed serially.
  • valid_dataset_prefixes - list (optional)
    Raises an error if the start of the dataset isn't on the list. The intended use is for situations where an external agent can initiate the request (such as the Query application). This allows a whitelist of allowable resources to be defined.
  • partitions - list (optional)
    List of folder names, with datetime placeholders, to use to build a path to the data files.
  • partition_filter - tuple (optional)
    Provide a hint on how to filter the partitions, as a single tuple in DNF notiation, this may be ignored.

Returns

  • DictSet

_CLASS: LowLevelReader (reader_class, freshness_limit, select, filters, override_format, cursor, multiprocess)


This file has been automatically generated, it is not the truth. If in doubt the code will tell you unambiguously what it does.

Clone this wiki locally