-
Notifications
You must be signed in to change notification settings - Fork 597
Writing GATK Tools that use Python
Under construction.
Some GATK tools depend on the use of Python for machine learning tasks. Such tools must have a Java front-end that:
- uses standard GATK arguments
- handles reading/writing of user inputs and final outputs to ensure GCS support/consistent authentication
- handles temporary file and resource management
- uses Python only when necessary, as a computational kernel
- documents all dependencies
- minimizes amount of code written in Python
Additionally, tool authors should:
- ceclare all dependencies in the Conda environment definition file gatkcondaenv.yml
- not depend on package versions that have Linux or Mac-specific dependencies
- prefer single line commands embedded in Java over multiple, serial commands
- write Python errors to stderr
- raise exceptions in Python for error conditions
- ensure that program correctness should not rely on consumption of Python stdout
- logging: TBD
GATK relies on a Conda environment to establish the correct version of Python and underlying required dependencies. This environment is defined declaratively in the file gatkcondaenv.yml, and shared by all GATK Python tools and peripheral code. Removing or changing the version of a dependency in this file should be done with care, and by consensus with all teams that are dependent on that package.
There are two methods for integrating Python with aJava front end (PythonScriptExeutor
and StreamingPythonScriptExecutor
). PythonScriptExecutor
is an easy-to-use method for synchronously executing a single Python command, script or module. StreamingPythonScriptExecutor
employs a more complex, keep-alive model,
that allows execution of multiple commands, asynchronous commands, and data transfer through named pipes.
Under construction.
Under construction.