Skip to content

The Data Observation Toolkit (DOT) can be used to monitor data in order to flag problems with data integrity and scenarios that might need attention.

License

Notifications You must be signed in to change notification settings

wvelebanks/Data-Observation-Toolkit

 
 

Repository files navigation

License Airflow Version Docker Image Version

The Data Observation Toolkit (DOT)

In 2019, the United Nations Statistical Commission highlighted the critical role of accurate health data, stating, “Every misclassified or unrecorded death is a lost opportunity to ensure other mothers and babies do not die in the same way. When it comes to health, better data can be a matter of life and death.” In response, DataKind developed DOT to increase trust in public health data, which is essential for equitable, data-driven health service delivery and optimized policy responses. DOT was created in collaboration with our global network of frontline health partners, including Ministries of Health, frontline health workers, and funders, all working together to strengthen health systems worldwide. You can read more of this initiative in the articles below:

The Data Observation Toolkit (DOT) is designed to monitor data and flag potential issues related to data integrity. It can identify problems such as missing or duplicate data, inconsistencies, outliers, and even domain-specific issues like missed follow-up medical treatments after diagnosis. DOT features a user-friendly interface for easily configuring powerful tools like the DBT and Great Expectations libraries, along with a built-in database for storing and classifying monitoring results. The primary goal of DOT is to make data monitoring more accessible, allowing users to ensure high-quality data without requiring extensive technical expertise. – Below is a high overview of the tool and how is architected:

DOT high overview

 dot_overview

DOT Architecture

dot_acrh

General Configuration Pre-requisites:

Ensure that your system has a minimum of 8GB of free storage space and at least 8GB of RAM. These requirements are essential to accommodate the necessary packages and ensure optimal performance of the application. To run DOT you will need to:

  1. Install Python 3.8.9
  2. Install the necessary python packages by running the following commands in your terminal (Additional information Mac/Linux terminal, additional information Windows terminal):
    • pip install gdown
    • pip install python-on-whales
  3. Install Docker desktop. First make sure you have checked the Docker prerequisites. We recommend using at least 4GB memory which can be set in the docker preferences, but this can vary depending on the volume of data being tested
  4. If running on a Mac M1/M2 chip, install Rosetta and set export DOCKER_DEFAULT_PLATFORM=linux/amd64 in the terminal where you will run the instructions below
  5. (Windows Users only) Need to install WSL for Linux on Windows Pcs

Alternatively, you can use the provided environment.yml if you have miniconda installed.

After completing the software prerequisites for your operating system, download or clone the DOT repository to your computer. You will need this repository for all the setups listed below.

Configuration

The following sections provide step-by-step instructions for configuring various components of DOT:

Sample data

Explore these comprehensive datasets, including global COVID-19 data, U.S. childhood obesity records, and datasets ranging from 1,000 to over a million patient entries, along with a synthetic dataset demonstrating DOT's capabilities with frontline health data.

Guidelines for adding new tests

  • Existing tests are at the self-tests folder
  • All tests extend the test base class that
    • facilitates the import of modules under test
    • recreates a directory in the file system for the test outputs
    • provides a number of function for supporting tests that access the database, mocking the config files to point to the test dot_config.yml, (re)creates a schema for DOT configuration and loads it with test data, etc.

Code quality

We have instituted a pair of tools to ensure the code base will remain at an acceptable quality as it is shared and developed in the community.

  1. The formulaic python formatter “black”. As described by its authors it is deterministic and fast but can be modified. We use the default settings, most notably formatting to a character limit of 88 per line.
  2. The code linter pylint. This follows the PEP8 style standard. PEP8 formatting standards are taken care of in black, with the exception that the default pylint line length is 80. Pylint is also modifiable and a standard set of exclusion to the PEP8 standard we have chosen are found here. We chose the default score of 7 as the minimum score for pylint to be shared. The combination of black and pylint can be incorporated into the git process using a pre-commit hook by running setup_hooks.sh

For detailed information on advanced configuration options and guidelines for contributing to the project, please refer to the CONTRIBUTING.md document.



About

The Data Observation Toolkit (DOT) can be used to monitor data in order to flag problems with data integrity and scenarios that might need attention.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.3%
  • Other 0.7%