Hi! Thanks for your interest in contributing to DOT, we're really excited to see you! In this document we'll try to summarize everything that you need to know to do a good job.
To get an overview of the project, please read the README and our Code of Conduct to keep our community approachable and respectable.
If you spot a problem, search if an issue already exists. If a related issue doesn't exist, you can open a new issue using a relevant issue form.
As a general rule, we don’t assign issues to anyone. If you find an issue to work on, you are welcome to open a PR with a fix.
All the configuration files must be located under the config folder of the DOT.
The main config file must be called dot_config.yml
and located at the top config folder. Note that
this file will be ignored for version control. You may use the example dot_config yaml
as a template.
Besides the DOT DB connection in the paragraph above, see below for additional config options.
For each of the projects you would like to run, add a key to the DOT config yaml with the following structure:
<project_name>_db:
type: connection type e.g. postgres
host: host
user: username
pass: password
port: port number e.g 5432
dbname: database name
schema: schema name, e.g. public
threads: nubmer of threads for DBT, e.g. 4
The DOT generates 2 kind of database objects:
- Entities of the models that are being tested, e.g. assessments, follow ups, patients
- Results of the failing tests
If nothing is done, these objects would be created in the same schema as the original data for the project
(thus polluting the DB). If the key output_schema_suffix
is added, its value will be added as a suffix; i.e. if the
project data is stored in a certain schema, the output objects will go to <project_schema>_<schema_suffix>
(e.g. to public_tests
if the project schema is public
and the suffix is set to tests
in the lines above).
Note that this mechanism uses a DBT feature, and that the same applies to the GE tests.
The key save_passed_tests
accepts boolean values. If set to true, tha results of the passing tests will be also stored
to the DOT DB. If not, only the results of failing tests will be stored.
Optional configuration for DBT and Great Expectations can be added, per project, in a structure as follows.
|____config
| |____<project_name>
| | |____dbt
| | | |____profiles.yml
| | | |____dbt_project.yml
| | |____ge
| | | |____great_expectations.yml
| | | |____config_variables.yml
| | | |____batch_config.json
In general these customizations will not be needed, but only in some scenarios with particular requirements; these require a deeper knowledge of the DOT and of either DBT and/or Great Expectations.
There are examples for all the files above under this folder. For each of the files you want to customize, you may copy and adapt the examples provided following the directory structure above.
More details in the config README.
To set up your local development environment for contributing follow the steps in the paragraphs below.
The easiest way to develop DOT is to use the provided Docker environment, see README for more details. This comes with the user interface and Postgres database included. Self tests will also work there too, so we encourage using this environment if you can. The Docker image will mount your filesystem, so changes to files will be reflected in the running instance of DOT and its user interface.
If you wish to build locally, then ...
- Install miniconda by selecting the installer that fits your OS version. Once it is installed you may have to restart your terminal (closing your terminal and opening again)
- In this directory, open terminal
conda env create -f environment.yml
conda activate dot_conda_env
- You will need a postgres database called 'dot_db'. To populate objects run the scripts in ./db/dot sequentially.
- Update your ./dot/config/dot_config.yml to point at your local database
- Create a config file for the database connection details, located at the directory
dot/self_tests/data/base_self_test
.
Run the following and hopefully you get a successful output.
pytest dot/self_tests/unit
You can also run
git commit
since you have added the Code Quality
tools referenced in the main README as a pre-commit hook,
together with the self-tests.
As many other open source projects, we use the famous gitflow to manage our branches.
Summary of our git branching model:
- Get all the latest work from the upstream
datakind/Data-Observation-Toolkit
repository (git checkout main
) - Create a new branch off with a descriptive name (for example:
feature/new-test-macro
,bugfix/bug-when-uploading-results
). You can do it with (git checkout -b <branch name>
) - Make your changes and commit them locally (
git add <changed files>>
,git commit -m "Add some change" <changed files>
). Whenever you commit, the self-tests and code quality will kick in; fix anything that gets broken - Push to your branch on GitHub (with the name as your local branch:
git push origin <branch name>
). This will output a URL for creating a Pull Request (PR) - Create a pull request by opening the URL a browser. You can also create PRs in the GitHub interface, choosing your branch to merge into main
- Wait for comments and respond as-needed
- Once PR review is complete, your code will be merged. Thanks!!
- Write helpful commit messages
- Anything in your branch must have no failing tests. You can check by looking at your PR online in GitHub
- Never use
git add .
: it can add unwanted files; - Avoid using
git commit -a
unless you know what you're doing; - Check every change with
git diff
before adding them to the index (stage area) and withgit diff --cached
before committing; - If you have push access to the main repository, please do not commit directly
to
dev
: your access should be used only to accept pull requests; if you want to make a new feature, you should use the same process as other developers so your code will be reviewed.
- Use PEP8;
- Write tests for your new features (please see "Tests" topic below);
- Always remember that commented code is dead code;
- Name identifiers (variables, classes, functions, module names) with readable
names (
x
is always wrong); - When manipulating strings, we prefer either f-string
formatting
(f
'{a} = {b}'
) or new-style formatting ('{} = {}'.format(a, b)
), instead of the old-style formatting ('%s = %s' % (a, b)
); - You will know if any test breaks when you commit, and the tests will be run again in the continuous integration pipeline (see below);
You should write tests for every feature you add or bug you solve in the code. Having automated tests for every line of our code lets us make big changes without worries: there will always be tests to verify if the changes introduced bugs or lack of features. If we don't have tests we will be blind and every change will come with some fear of possibly breaking something.
For a better design of your code, we recommend using a technique called test-driven development, where you write your tests before writing the actual code that implements the desired feature.
You can use pytest
to run your tests, no matter which type of test it is.
We use GitHub Actions for continuous integration. See here for GitHub's documentation.
The .github/workflows/lint.yml
file configures the CI.