diff --git a/docs/source/assets/snakemake/metadata.png b/docs/source/assets/snakemake/metadata.png new file mode 100644 index 00000000..627d0663 Binary files /dev/null and b/docs/source/assets/snakemake/metadata.png differ diff --git a/docs/source/index.md b/docs/source/index.md index 086340e3..86fd9b43 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -147,9 +147,12 @@ cli/mv ```{toctree} :hidden: :maxdepth: 2 -:caption: Manual -manual/snakemake.md -manual/tutorial.md +:caption: Snakemake Integration +snakemake/tutorial.md +snakemake/lifecycle.md +snakemake/metadata.md +snakemake/cloud.md +snakemake/troubleshooting.md ``` ```{toctree} diff --git a/docs/source/manual/snakemake.md b/docs/source/manual/snakemake.md deleted file mode 100644 index 950f0ef5..00000000 --- a/docs/source/manual/snakemake.md +++ /dev/null @@ -1,436 +0,0 @@ -# Snakemake Integration - -## Getting Started - -Latch's snakemake integration allows developers to build graphical interfaces to expose their workflows to wet lab teams. It also provides managed cloud infrastructure for execution of the workflow's jobs. - -A primary design goal for integration is to allow developers to register existing projects with minimal added boilerplate and modifications to code. Here we outline exactly what these changes are and why they are needed. - -Recall a snakemake project consists of a `Snakefile` , which describes workflow -rules in an ["extension"](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html) of Python, and associated python code imported and called by these rules. To make this project compatible with Latch, we need to do the following: - -1. Identify and construct explicit parameters for each file dependency in `latch_metadata.py` -2. Build a container with all runtime dependencies -3. Ensure your `Snakefile` is compatible with cloud execution - -### Step 1: Construct a `latch_metadata.py` file - -The snakemake framework was designed to allow developers to both define and execute their workflows. This often means that the workflow parameters are sometimes ill-defined and scattered throughout the project as configuration values, static values in the `Snakefile` or command line flags. - -To construct a graphical interface from a snakemake workflow, the file parameters need to be explicitly identified and defined so that they can be presented to scientist to be filled out through a web application. The `latch_metadata.py` file holds these parameter definitions, along with any styling or cosmetic modifications the developer wishes to make to each parameter. - -*Currently, only file and directory parameters are supported*. - -To identify the file "dependencies" that should be pulled out as parameters, it -can be useful to start with the `config.yaml` file that is used to configure -many Snakemake projects. Thinking about the minimum set of files needed to run -a successful workflow on fresh machine can also help identify these parameters. - -Below is an example of how to create the `latch_metadata.py` file based on the `config.yaml` file: - -Example of `config.yaml` file: - -```yaml -# config.yaml -r1_fastq: "tests/r1.fq.gz" -r2_fastq: "tests/r2.fq.gz" -path: "tests/hs38DH" -``` - -Example of `latch_metadata.py` file: - -```python -# latch_metadata.py - -from pathlib import Path - -from latch.types.directory import LatchDir -from latch.types.file import LatchFile -from latch.types.metadata import LatchAuthor, SnakemakeFileParameter, SnakemakeMetadata - -SnakemakeMetadata( - display_name="fgbio Best Practise FASTQ -> Consensus Pipeline", - author=LatchAuthor( - name="Fulcrum Genomics", - ), - parameters={ - "r1_fastq": SnakemakeFileParameter( - display_name="Read 1 FastQ", - type=LatchFile, - path=Path("tests/r1.fq.gz"), - ), - "r2_fastq": SnakemakeFileParameter( - display_name="Read 2 FastQ", - type=LatchFile, - path=Path("tests/r2.fq.gz"), - ), - "genome": SnakemakeFileParameter( - display_name="Reference Genome", - type=LatchDir, - path=Path("tests/hs38DH"), - ), - }, -) -``` - -### Step 2: Define all dependencies in a container - -When executing Snakemake jobs on Latch, the jobs run within an environment specified by a `Dockerfile` . It is important to ensure that all required dependencies, whether they are third-party binaries, python libraries, or shell scripts, are correctly installed and configured within this `Dockerfile` so the job has access to them. - -**Key Dependencies to Consider**: -* Python Packages: - + Specify these in a `requirements.txt` or `environment.yaml` file. -* Conda Packages: - + List these in an `environment.yaml` file. -* Bioinformatics Tools: - + Often includes third-party binaries. They will need to be manually added to the Dockerfile. -* Snakemake wrappers and containers: - + Note that while many Snakefile rules use singularity or docker containers, Latch doesn't currently support these wrapper or containerized environments. Therefore, all installation codes for these must be manually added into the Dockerfile. - -**Generating a Customizable Dockerfile:** - -To generate a `Dockerfile` that can be modified, use the following command: - - `latch dockerfile ` - -The above command searches for the `environment.yaml` and `requirements.txt` files within your project directory. Based on these, it generates Dockerfile instructions to install the specified Conda and Python dependencies. - -Once the Dockerfile is generated, you can manually append it with third-party Linux installations or source codes related to Snakemake wrappers or containers. - -When you register your snakemake project with Latch, a container is automatically built from the generated Dockerfile. - -### Step 3: Ensure your `Snakefile` is compatible with cloud execution - -When snakemake workflows are executed on Latch, each generated job is run in a separate container on a potentially isolated machine. This means your `Snakefile` might need to be modified to address problems that arise from this type of execution that were not present when executing locally: - -* Add missing rule inputs that are implicitly fulfiled when executing locally. Index files for biological data are commonly expected to always be alongside their matching data. -* Make sure shared code does not rely on input files. This is any code that is not under a rule and so gets executed by every task -* Add `resources` directives if tasks run out of memory or disk space -* Optimize data transfer by merging tasks that have 1-to-1 dependencies - -### Step 4: Register your project - -When the above steps have been taken, it is safe to register your project with the Latch CLI. - -Example: `latch register / --snakefile /Snakefile` - -This command will build a container and construct a graphical interface from your `latch_metdata.py` file. When this process has completed, a link to view your workflow on the Latch console will be printed to `stdout` . - ---- - -## Lifecycle of a Snakemake Execution on Latch - -Snakemake support is currently based on JIT (Just-In-Time) registraton. This means that the workflow produced by `latch register` will only register a second workflow, which will run the actual pipeline tasks. This is because the actual structure of the workflow cannot be specified until parameter values are provided. - -### JIT Workflow - -The first ("JIT") workflow does the following: - -1. Download all input files -2. Import the Snakefile, calculate the dependency graph, determine which jobs need to be run -3. Generate a Latch SDK workflow Python script for the second ("runtime") workflow and register it -4. Run the runtime workflow using the same inputs - -Debugging: - -* The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows//entrypoint.py` -* Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows//spec` - -### Runtime Workflow - -The runtime workflow contains a task per each Snakemake job. This means that there will be a separate task per each wildcard instatiation of each rule. This can lead to workflows with hundreds of tasks. Note that the execution graph can be filtered by task status. - -Each task runs a modified Snakemake executable using a script from the Latch SDK which monkey-patches the appropriate parts of the Snakemake package. This executable is different in two ways: - -1. Rules that are not part of the task's target are entirely ignored -2. The target rule has all of its properties (currently inputs, outputs, benchmark, log, shellcode) replaced with the job-specific strings. This is the same as the value of these directives with all wildcards expanded and lazy values evaluated - -Debugging: - -* The Snakemake-compiled tasks are uploaded to `latch:///.snakemake_latch/workflows//compiled_tasks` - -#### Example - -Snakefile rules: - -```Snakemake -rule all: - input: - os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"), - os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html") - -rule fastqc: - input: os.path.join(WORKDIR, "fastq", "{sample}.fastq") - output: os.path.join(WORKDIR, "qc", "fastqc", "{sample}_fastqc.html") - shellcmd: "fastqc {input} -o {output}" -``` - -Produced jobs: - -1. Rule: `fastqc` Wildcards: `sample=read1` -1. Rule: `fastqc` Wildcards: `sample=read2` - -Resulting single-job executable for job 1: - -```py -# @workflow.rule(name='all', lineno=1, snakefile='/root/Snakefile') -# @workflow.input( # os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"), -# # os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html"), -# ) -# @workflow.norun() -# @workflow.run -# def __rule_all(input, output, ...): -# pass - -@workflow.rule(name='fastqc', lineno=6, snakefile='/root/Snakefile') -@workflow.input("work/fastq/read1.fastq" # os.path.join(WORKDIR, "fastq", "{sample}.fastq") -) -@workflow.shellcmd("fastqc work/fastq/read1.fastq -o work/qc/fastqc/read1_fastqc.html") -@workflow.run -def __rule_fastqc(input, output, ...): - shell("fastqc {input} -o {output}", ...) -``` - -Note: - -* The "all" rule is entirely commented out -* The "fastqc" rule has no wildcards in its decorators - -### Limitations - -1. The workflow will execute the first rule defined in the Snakefile (matching standard Snakemake behavior). There is no way to change the default rule other than by moving the desired rule up in the file -1. The workflow will output files that are not used by downstream tasks. This means that intermediate files cannot be included in the output. The only way to exclude an output is to write a rule that lists it as an input -1. Input files and directories are downloaded fully, even if they are not used to generate the dependency graph. This commonly leads to issues with large directories being downloaded just to list the files contained within, delaying the JIT workflow by a large amount of time and requiring a large amount of disk space -1. Only the JIT workflow downloads input files. Rules only download their individual inputs, which can be a subset of the input files. If the Snakefile tries to read input files outside of rules it will usually fail at runtime -1. Large files that move between tasks will need to be uploaded by the outputting task and downloaded by each consuming task. This can take a large amount of time. Frequently it's possible to merge the producer and the consumer into one task to improve performance -1. Environment dependencies (Conda packages, Python packages, other software) must be well-specified. Missing dependencies will lead to JIT-time or runtime crashes -1. Config files are not supported and must be hard-coded into the workflow Docker image -1. `conda` directives will frequently fail with timeouts/SSL errors because Conda does not react well to dozens of tasks trying to install conda environments over a short timespan. It is recommended that all conda environments are included in the Docker image -1. The JIT workflow hard-codes the latch paths for rule inputs, outputs and other files. If these files are missing when the runtime workflow task runs, it will fail - -## Metadata - -Workflow metadata is read from the Snakefile. For this purpose, `SnakemakeMetadata` should be instantiated at the beginning of the file outside of any rules. - -### Dependency Issues - -Some Snakefiles import third-party dependencies at the beginning. This will cause the metadata extraction to fail if the dependencies are not installed. There are two ways of dealing with this problem: - -1. Install the missing dependencies on the registering computer (the computer running the `latch` command) -2. Use a `latch_metadata.py` file - -If registration fails before metadata can be pulled, the CLI will generate an example `latch_metadata.py` file. - -### Input Parameters - -Since there is no explicit entrypoint ( `@workflow` ) function in a Snakemake workflow, parameters are instead specified in the metadata file. - -Currently only `LatchFile` and `LatchDir` parameters are supported. Both directory and file inputs are specified using `SnakemakeFileParameter` and setting the `type` field as appropriate. - -Parameters must include a `path` field which specifies where the data will be downloaded to. This usually matches some file location expected by a Snakemake rule. Frequently, instead of simple paths, a rule with use a `configfile` to dynamically find input paths. In this case the only requiremtn is that the path matches the config file included in the workflow Docker image. - -Example: - -```py -parameters = { - "example": SnakemakeFileParameter( - display_name="Example Parameter", - type=LatchFile, - path=Path("example.txt"), - ) -} -``` - -## Troubleshooting - -| Problem | Common Solution | -| -------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `The above error occured when reading the Snakefile to extract workflow metadata.` | Snakefile has errors outside of any rules. Frequently caused by missing dependencies (look for `ModuleNotFoundError` ). Either install dependencies or add a `latch_metadata.py` file | -| `snakemake.exceptions.WorkflowError: Workflow defines configfile config.yaml but it is not present or accessible (full checked path: /root/config.yaml)` | Include a `config.yaml` in the workflow Docker image. Currently, config files cannot be generated from workflow parameters. | -| `Command '['/usr/local/bin/python', '-m', 'latch_cli.snakemake.single_task_snakemake', ...]' returned non-zero exit status 1.` | The runtime single-job task failed. Look at logs to find the error. It will be marked with the string `[!] Failed` . | -| Runtime workflow task fails with `FileNotFoundError in file /root/workflow/Snakefile` but the file is specified in workflow parameters | Wrap the code that reads the file in a function. **See section "Input Files Referenced Outside of Rules"** | -| MultiQC `No analysis results found. Cleaning up..` | FastQC outputs two files for every FastQ file: the raw `.zip` data and the HTML report. Include the raw `.zip` outputs of FastQC in the MultiQC rule inputs. **See section "Input Files Not Explicitly Defined in Rules"** " - -## Troubleshooting: Input Files Referenced Outside of Rules - -Only the JIT workflow downloads every input file. Tasks at runtime will only download files their target rules explicitly depend on. This means that Snakefile code that is not under a rule will usually fail if it tries to read input files. - -**Example:** - -```python -# ERROR: this reads a directory, regardless of which rule is executing! -samples = Path("inputs").glob("*.fastq") - -rule all: - input: - expand("fastqc/{sample}.html", sample=samples) - -rule fastqc: - input: - "inputs/{sample}.fastq" - output: - "fastqc/{sample}.html" - shellcmd: - fastqc {input} -o {output} -``` - -Since the `Path("inputs").glob(...)` call is not under any rule, _it runs in all tasks._ Because the `fastqc` rule does not specify `input_dir` as an `input` , it will not be downloaded and the code will throw an error. - -### Solution - -Only access files when necessary (i.e. when computing dependencies as in the example, or in a rule body) by placing problematic code within rule definitions. Either directly inline the variable or write a function to use in place of the variable. - -**Example:** - -```python -rule all_inline: - input: - # This code will only run in the JIT step - expand("fastqc/{sample}.html", sample=Path("inputs").glob("*.fastq")) - -def get_samples(): - # This code will only run if the function is called - samples = Path("inputs").glob("*.fastq") - return samples - -rule all_function: - input: - expand("fastqc/{sample}.html", sample=get_samples()) -``` - -This works because the JIT step replaces `input` , `output` , `params` , and other declarations with static strings for the runtime workflow so any function calls within them will be replaced with pre-computed strings and the Snakefile will not attempt to read the files again. - -**Same example at runtime:** - -```python -rule all_inline: - input: - "fastqc/example.html" - -def get_samples(): - # Note: this function is no longer called anywhere in the file - samples = Path("inputs").glob("*.fastq") - return samples - -rule all_function: - input: - "fastqc/example.html" -``` - -**Example using multiple return values:** - -```python -def get_samples_data(): - samples = Path("inputs").glob("*.fastq") - return { - "samples": samples, - "names": [x.name for x in samples] - } - -rule all: - input: - expand("fastqc/{sample}.html", sample=get_samples_data()["samples"]), - expand("reports/{name}.txt", name=get_samples_data()["names"]), -``` - -## Troubleshooting: Input Files Not Explicitly Defined in Rules - -When running the snakemake workflow locally, not all input files must be explicitly defined in every rule because all files are generated on one computer. However, tasks on Latch only download files specified by their target rules. Thus, unspecified input files will cause the Snakefile rule to fail due to missing input files. - -**Example** - -```python -# ERROR: the .zip file produced by the the fastqc rule is not found in the multiqc rule! - -WORKDIR = "/root/" - -rule fastqc: - input: join(WORKDIR, 'fastq', 'raw', "{sample}.fastq") - output: - html = join(WORKDIR, "QC", "fastqc", 'raw', "Sample_{sample}") - params: - join(WORKDIR, "QC","fastqc", 'raw', "Sample_{sample}") - run: - if not os.path.exists(join(WORKDIR, str(params))): - os.makedirs(join(WORKDIR, str(params))) - shell("fastqc -o {params} --noextract -k 5 -t 8 -f fastq {input} 2>{log}") - -rule multiqc: - input: - aligned_sequences = join(WORKDIR, "plasmid_wells_aligned_sequences.csv") - output: directory(join(WORKDIR, "QC", "multiqc_report", 'raw')) - params: - join(WORKDIR, "QC", "fastqc", 'raw') - benchmark: - join(BENCHMARKDIR, "multiqc.txt") - log: - join(LOGDIR, "multiqc.log") - shell: - "multiqc {params} -o {output} --force" -``` - -### Solution - -For programs that produce multiple types of input files (e.g. `.zip` and `.html` in the case of FastQC), explicitly specify these files in the outputs of the previous rule and in the inputs of the subsequent rule. - -**Example** - -```python -def get_samples(): - samples = Path("/root").glob("*fastqc.zip") - return samples - -WORKDIR = "/root/" -rule fastqc: - input: join(WORKDIR, 'fastq', 'raw', "{sample}.fastq") - output: - html = join(WORKDIR, "QC", "fastqc", 'raw', "Sample_{sample}", "_{sample}_fastqc.html") - # Specify zip as the output for every sample from fastqc - zip = join(WORKDIR, "QC", "fastqc", 'raw', "Sample_{sample}", "_{sample}_fastqc.zip") - params: - join(WORKDIR, "QC","fastqc", 'raw', "Sample_{sample}") - run: - if not os.path.exists(join(WORKDIR, str(params))): - os.makedirs(join(WORKDIR, str(params))) - shell("fastqc -o {params} --noextract -k 5 -t 8 -f fastq {input} 2>{log}") - -rule multiqc: - input: - aligned_sequences = join(WORKDIR, "plasmid_wells_aligned_sequences.csv") - # Specify zip as the input for every sample from fastqc - zip = expand( - join(WORKDIR, "QC", "fastqc", 'raw', "Sample_{sample}", "_{sample}_fastqc.zip"), sample=get_samples() - ) - output: directory(join(WORKDIR, "QC", "multiqc_report", 'raw')) - params: - join(WORKDIR, "QC", "fastqc", 'raw') - benchmark: - join(BENCHMARKDIR, "multiqc.txt") - log: - join(LOGDIR, "multiqc.log") - shell: - # Explicitly pass the input into the script instead of the Snakefile rule `params` - # Before: "multiqc {params} -o {output} --force" - # After - "multiqc {input.zip} -o {output} --force" -``` - -## Snakemake Roadmap - -### Known Issues - -* Task caching does not work, tasks always re-run when a new version of the workflow is run even if nothing specific has changed -* It is not possible to configure theĀ amount of available ephemeral storage -* Remote registration is not supported -* Snakemake tasks are serialized using a faulty custom implementation which does not support things like caching. Should use actual generated python code instead -* JIT workflow image should run snakemake extraction as a smoketest before being registered as a workflow -* Workflows with no parameters break the workflow params page on console UI -* Cannot set parameter defaults -* Parameter keys are unusued but are required in the metadata -* Log file tailing does not work - -### Future Work - -* Warn when the Snakefile reads files not on the docker image outside of any rules -* FUSE -* File/directory APIs diff --git a/docs/source/snakemake/cloud.md b/docs/source/snakemake/cloud.md new file mode 100644 index 00000000..9d7e9237 --- /dev/null +++ b/docs/source/snakemake/cloud.md @@ -0,0 +1,451 @@ +# Snakemake Workflow Cloud Compatibility + +When Snakemake workflows are executed locally on a single computer or high-performance cluster, all dependencies and input/ output files are on a single machine. + +When a Snakemake workflow is executed on Latch, each generated job is run in a separate container on a potentially isolated machine. + +Therefore, it may be necessary to adapt your Snakefile to address issues arising from this execution method, which were not encountered during local execution: + +* Add missing rule inputs that are implicitly fulfilled when executing locally. +* Make sure shared code does not rely on input files. This is any code that is not under a rule, and so gets executed by every task +* Add `resources` directives if tasks run out of memory or disk space +* Optimize data transfer by merging tasks that have 1-to-1 dependencies + +Here, we will walk through examples of each of the cases outlined above. + +## Add missing rule inputs + +When a Snakemake workflow is executed on Latch, each generated job for the Snakefile rule is run on a separate machine. Only files and directories explicitly specified under the `input` directive of the rule are downloaded in the task. + +A typical example is if the index files for biological data are not explicitly specified as a Snakefile input, the generated job for that rule will fail due to the missing index files. + +#### Example + +In the example below, there are two Snakefile rules: + +* `delly_s`: The rule runs Delly to call SVs and outputs an unfiltered BCF file, followed by quality filtering using `bcftools` filter to retain only the SV calls that pass certain filters. Finally, it indexes the BCF file. +* `delly_merge`: This rule merges or concatenates BCF files containing SV calls from the delly_s rule, producing a single VCF file. The rule requires the index file to be available for each corresponding BAM file. + +```python +rule delly_s: # single-sample analysis + input: + fasta=get_fasta(), + fai=get_faidx()[0], + bam=get_bam("{path}/{sample}"), + bai=get_bai("{path}/{sample}"), + excl_opt=get_bed() + params: + excl_opt='-x "%s"' % get_bed() if exclude_regions() else "", + output: + bcf = os.path.join( + "{path}", + "{sample}", + get_outdir("delly"), + "delly-{}{}".format("{sv_type}", config.file_exts.bcf), + ) + + conda: + "../envs/caller.yaml" + threads: 1 + resources: + mem_mb=config.callers.delly.memory, + tmp_mb=config.callers.delly.tmpspace, + shell: + """ + set -xe + + OUTDIR="$(dirname "{output.bcf}")" + PREFIX="$(basename "{output.bcf}" .bcf)" + OUTFILE="${{OUTDIR}}/${{PREFIX}}.unfiltered.bcf" + + # run dummy or real job + if [ "{config.echo_run}" -eq "1" ]; then + echo "{input}" > "{output}" + else + # use OpenMP for threaded jobs + export OMP_NUM_THREADS={threads} + + # SV calling + delly call \ + -t "{wildcards.sv_type}" \ + -g "{input.fasta}" \ + -o "${{OUTFILE}}" \ + -q 1 `# min.paired-end mapping quality` \ + -s 9 `# insert size cutoff, DELs only` \ + {params.excl_opt} \ + "{input.bam}" + # SV quality filtering + bcftools filter \ + -O b `# compressed BCF format` \ + -o "{output.bcf}" \ + -i "FILTER == 'PASS'" \ + "${{OUTFILE}}" + # index BCF file + bcftools index "{output.bcf}" + fi + """ + +rule delly_merge: # used by both modes + input: + bcf = [ + os.path.join( + "{path}", + "{tumor}--{normal}", + get_outdir("delly"), + "delly-{}{}".format(sv, config.file_exts.bcf), + ) + for sv in config.callers.delly.sv_types + ] + if config.mode is config.mode.PAIRED_SAMPLE + else [ + os.path.join( + "{path}", + "{sample}", + get_outdir("delly"), + "delly-{}{}".format(sv, config.file_exts.bcf), + ) + for sv in config.callers.delly.sv_types + ], + if config.mode is config.mode.PAIRED_SAMPLE + else [ + os.path.join( + "{path}", + "{sample}", + get_outdir("delly"), + "delly-{}{}".format(sv, config.file_exts.bcf), + ) + ".csi" + for sv in config.callers.delly.sv_types + ] + output: + os.path.join( + "{path}", + "{tumor}--{normal}", + get_outdir("delly"), + "delly{}".format(config.file_exts.vcf), + ) + if config.mode is config.mode.PAIRED_SAMPLE + else os.path.join( + "{path}", + "{sample}", + get_outdir("delly"), + "delly{}".format(config.file_exts.vcf), + ), + conda: + "../envs/caller.yaml" + threads: 1 + resources: + mem_mb=1024, + tmp_mb=0, + shell: + """ + set -x + + # run dummy or real job + if [ "{config.echo_run}" -eq "1" ]; then + cat {input} > "{output}" + else + # concatenate rather than merge BCF files + bcftools concat \ + -a `# allow overlaps` \ + -O v `# uncompressed VCF format` \ + -o "{output}" \ + {input.bcf} + fi + """ +``` + +The above code will fail with the error: + +```bash +Failed to open: /root/workflow/data/bam/3/T3--N3/delly_out/delly-BND.bcf.csi +``` + +### Solution + +The task failed because the BAM index files (ending with `bcf.csi`) are produced by the `delly_s` rule but is not explicitly specified as input to the `delly_merge` rule. Hence, the index files are not downloaded into the task that executes the `delly_merge` rule. + +To resolve the error, we need to add the index files as the output of the `delly_s` rule and the input of the `delly_merge` rule: + +```python +rule delly_s: # single-sample analysis + input: + fasta=get_fasta(), + fai=get_faidx()[0], + bam=get_bam("{path}/{sample}"), + bai=get_bai("{path}/{sample}"), + excl_opt=get_bed() + params: + excl_opt='-x "%s"' % get_bed() if exclude_regions() else "", + output: + bcf = os.path.join( + "{path}", + "{sample}", + get_outdir("delly"), + "delly-{}{}".format("{sv_type}", config.file_exts.bcf), + ), + + # Add bcf_index as the rule's output + bcf_index = os.path.join( + "{path}", + "{sample}", + get_outdir("delly"), + "delly-{}{}".format("{sv_type}", config.file_exts.bcf), + ) + ".csi" + ... +``` + +```python +rule delly_merge: # used by both modes + input: + bcf = [ + os.path.join( + "{path}", + "{tumor}--{normal}", + get_outdir("delly"), + "delly-{}{}".format(sv, config.file_exts.bcf), + ) + for sv in config.callers.delly.sv_types + ] + if config.mode is config.mode.PAIRED_SAMPLE + else [ + os.path.join( + "{path}", + "{sample}", + get_outdir("delly"), + "delly-{}{}".format(sv, config.file_exts.bcf), + ) + for sv in config.callers.delly.sv_types + ], + + # Add bcf_index as input + bcf_index = [ + os.path.join( + "{path}", + "{tumor}--{normal}", + get_outdir("delly"), + "delly-{}{}".format(sv, config.file_exts.bcf), + ) + ".csi" + for sv in config.callers.delly.sv_types + ] + ... +``` + +## Make sure shared code doesn't rely on input files + +Tasks at runtime will only download files their target rules explicitly depend on. Shared code, or Snakefile code that is not under any rule, will usually fail if it tries to read input files. + +#### Example + +```python +# ERROR: this reads a directory, regardless of which rule is executing! +samples = Path("inputs").glob("*.fastq") + +rule all: + input: + expand("fastqc/{sample}.html", sample=samples) + +rule fastqc: + input: + "inputs/{sample}.fastq" + output: + "fastqc/{sample}.html" + shellcmd: + fastqc {input} -o {output} +``` + +Since the `Path("inputs").glob(...)` call is not under any rule, _it runs in all tasks._ Because the `fastqc` rule does not specify `input_dir` as an `input` , it will not be downloaded and the code will throw an error. + +### Solution + +Only access files when necessary (i.e. when computing dependencies as in the example, or in a rule body) by placing problematic code within rule definitions. Either directly inline the variable or write a function to use in place of the variable. + +#### Example + +```python +rule all_inline: + input: + # This code will only run in the JIT step + expand("fastqc/{sample}.html", sample=Path("inputs").glob("*.fastq")) + +def get_samples(): + # This code will only run if the function is called + samples = Path("inputs").glob("*.fastq") + return samples + +rule all_function: + input: + expand("fastqc/{sample}.html", sample=get_samples()) +``` + +This works because the JIT step replaces `input` , `output` , `params` , and other declarations with static strings for the runtime workflow so any function calls within them will be replaced with pre-computed strings and the Snakefile will not attempt to read the files again. + +**Same example at runtime:** + +```python +rule all_inline: + input: + "fastqc/example.html" + +def get_samples(): + # Note: this function is no longer called anywhere in the file + samples = Path("inputs").glob("*.fastq") + return samples + +rule all_function: + input: + "fastqc/example.html" +``` + +**Example using multiple return values:** + +```python +def get_samples_data(): + samples = Path("inputs").glob("*.fastq") + return { + "samples": samples, + "names": [x.name for x in samples] + } + +rule all: + input: + expand("fastqc/{sample}.html", sample=get_samples_data()["samples"]), + expand("reports/{name}.txt", name=get_samples_data()["names"]), +``` + +## Add `resources` directives + +It is common for a Snakefile rule to run into out-of-memory errors. + +#### Example + +The following workflow failed because Kraken2 requires at least 256GB of RAM to run. + +```python +rule kraken: + input: + reads = lambda wildcards: get_samples()["sample_reads"][wildcards.samp], + output: + krak = join(outdir, "classification/{samp}.krak"), + krak_report = join(outdir, "classification/{samp}.krak.report") + params: + db = config['database'], + paired_string = get_paired_string(), + confidence_threshold = confidence_threshold + threads: 16 + resources: + mem_mb=128000, + singularity: "docker://quay.io/biocontainers/kraken2:2.1.2--pl5262h7d875b9_0" + shell: """ + s5cmd cp 's3://latch-public/test-data/4034/kraken_test/db/*' {params.db} &&\ + + time kraken2 --db {params.db} --threads 16 --output {output.krak} \ + --report {output.krak_report} {params.paired_string} {input.reads} \ + --confidence {params.confidence_threshold} --use-names + """ +``` + +### Solution + +Modify the `resources` directive of the Snakefile rule. + +```python +rule kraken: + ... + resources: + mem_mb=128000 + ... +``` + +## Optimize data transfer + +In a Snakemake workflow, each rule is executed on a separate, isolated machine. As a result, all input files specified for a rule are downloaded to the machine every time the rule is run. Frequent downloading of the same input files across multiple rules can lead to increased workflow runtime and higher costs, especially if the data files are large. + +To optimize performance and minimize costs, it is recommended to consolidate the logic that relies on shared inputs into a single rule. + +#### Example + +* Inefficient example with multiple rules processing the same BAM file: + +```python +rule all: + input: + "results/final_variants.vcf" + +rule mark_duplicates: + input: + "data/sample.bam" + output: + "results/dedupped_sample.bam" + shell: + """ + gatk MarkDuplicates \ + -I {input} \ + -O {output} \ + -M results/metrics.txt + """ + +rule call_variants: + input: + bam = "results/dedupped_sample.bam", + ref = "data/reference.fasta" + output: + "results/raw_variants.vcf" + shell: + """ + gatk HaplotypeCaller \ + -R {input.ref} \ + -I {input.bam} \ + -O {output} + """ + +rule filter_variants: + input: + "results/raw_variants.vcf" + output: + "results/final_variants.vcf" + shell: + """ + gatk VariantFiltration \ + -V {input} \ + -O {output} \ + --filter-name "QD_filter" \ + --filter-expression "QD < 2.0" + """ +``` + +### Solution + +Instead of having separate rules processing the BAM file for marking duplicates, calling variants, and filtering variants, we consolidate the logic into a single rule, reducing redundant data downloads. + +```python +# Efficient Example - Consolidated logic to minimize input data downloads +rule process_and_call_variants: + input: + bam = "data/sample.bam", + ref = "data/reference.fasta" + output: + vcf = "results/final_variants.vcf", + dedupped_bam = temp("results/dedupped_sample.bam"), + raw_vcf = temp("results/raw_variants.vcf") + shell: + """ + # Mark duplicates using GATK + gatk MarkDuplicates \ + -I {input.bam} \ + -O {output.dedupped_bam} \ + -M results/metrics.txt + + # Call variants using GATK HaplotypeCaller + gatk HaplotypeCaller \ + -R {input.ref} \ + -I {output.dedupped_bam} \ + -O {output.raw_vcf} + + # Filter variants using GATK VariantFiltration + gatk VariantFiltration \ + -V {output.raw_vcf} \ + -O {output.vcf} \ + --filter-name "QD_filter" \ + --filter-expression "QD < 2.0" + """ +``` diff --git a/docs/source/snakemake/lifecycle.md b/docs/source/snakemake/lifecycle.md new file mode 100644 index 00000000..7e6207c9 --- /dev/null +++ b/docs/source/snakemake/lifecycle.md @@ -0,0 +1,89 @@ +# Lifecycle of a Snakemake Execution on Latch + +Snakemake support is currently based on JIT (Just-In-Time) registraton. This means that the workflow produced by `latch register` will only register a second workflow, which will run the actual pipeline tasks. This is because the actual structure of the workflow cannot be specified until parameter values are provided. + +### JIT Workflow + +The first ("JIT") workflow does the following: + +1. Download all input files +2. Import the Snakefile, calculate the dependency graph, determine which jobs need to be run +3. Generate a Latch SDK workflow Python script for the second ("runtime") workflow and register it +4. Run the runtime workflow using the same inputs + +Debugging: + +* The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows//entrypoint.py` +* Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows//spec` + +### Runtime Workflow + +The runtime workflow contains a task per each Snakemake job. This means that there will be a separate task per each wildcard instatiation of each rule. This can lead to workflows with hundreds of tasks. Note that the execution graph can be filtered by task status. + +Each task runs a modified Snakemake executable using a script from the Latch SDK which monkey-patches the appropriate parts of the Snakemake package. This executable is different in two ways: + +1. Rules that are not part of the task's target are entirely ignored +2. The target rule has all of its properties (currently inputs, outputs, benchmark, log, shellcode) replaced with the job-specific strings. This is the same as the value of these directives with all wildcards expanded and lazy values evaluated + +Debugging: + +* The Snakemake-compiled tasks are uploaded to `latch:///.snakemake_latch/workflows//compiled_tasks` + +#### Example + +Snakefile rules: + +```Snakemake +rule all: + input: + os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"), + os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html") + +rule fastqc: + input: os.path.join(WORKDIR, "fastq", "{sample}.fastq") + output: os.path.join(WORKDIR, "qc", "fastqc", "{sample}_fastqc.html") + shellcmd: "fastqc {input} -o {output}" +``` + +Produced jobs: + +1. Rule: `fastqc` Wildcards: `sample=read1` +1. Rule: `fastqc` Wildcards: `sample=read2` + +Resulting single-job executable for job 1: + +```py +# @workflow.rule(name='all', lineno=1, snakefile='/root/Snakefile') +# @workflow.input( # os.path.join(WORKDIR, "qc", "fastqc", "read1_fastqc.html"), +# # os.path.join(WORKDIR, "qc", "fastqc", "read2_fastqc.html"), +# ) +# @workflow.norun() +# @workflow.run +# def __rule_all(input, output, ...): +# pass + +@workflow.rule(name='fastqc', lineno=6, snakefile='/root/Snakefile') +@workflow.input("work/fastq/read1.fastq" # os.path.join(WORKDIR, "fastq", "{sample}.fastq") +) +@workflow.shellcmd("fastqc work/fastq/read1.fastq -o work/qc/fastqc/read1_fastqc.html") +@workflow.run +def __rule_fastqc(input, output, ...): + shell("fastqc {input} -o {output}", ...) +``` + +Note: + +* The "all" rule is entirely commented out +* The "fastqc" rule has no wildcards in its decorators + +### Limitations + +1. The workflow will execute the first rule defined in the Snakefile (matching standard Snakemake behavior). There is no way to change the default rule other than by moving the desired rule up in the file +1. The workflow will output files that are not used by downstream tasks. This means that intermediate files cannot be included in the output. The only way to exclude an output is to write a rule that lists it as an input +1. Input files and directories are downloaded fully, even if they are not used to generate the dependency graph. This commonly leads to issues with large directories being downloaded just to list the files contained within, delaying the JIT workflow by a large amount of time and requiring a large amount of disk space +1. Only the JIT workflow downloads input files. Rules only download their individual inputs, which can be a subset of the input files. If the Snakefile tries to read input files outside of rules it will usually fail at runtime +1. Large files that move between tasks will need to be uploaded by the outputting task and downloaded by each consuming task. This can take a large amount of time. Frequently it's possible to merge the producer and the consumer into one task to improve performance +1. Environment dependencies (Conda packages, Python packages, other software) must be well-specified. Missing dependencies will lead to JIT-time or runtime crashes +1. Config files are not supported and must be hard-coded into the workflow Docker image +1. `conda` directives will frequently fail with timeouts/SSL errors because Conda does not react well to dozens of tasks trying to install conda environments over a short timespan. It is recommended that all conda environments are included in the Docker image +1. The JIT workflow hard-codes the latch paths for rule inputs, outputs and other files. If these files are missing when the runtime workflow task runs, it will fail diff --git a/docs/source/snakemake/metadata.md b/docs/source/snakemake/metadata.md new file mode 100644 index 00000000..28b1200d --- /dev/null +++ b/docs/source/snakemake/metadata.md @@ -0,0 +1,210 @@ +# Metadata + +The Snakemake framework was designed to allow developers to both define and execute their workflows. This often means that the workflow parameters are sometimes ill-defined and scattered throughout the project as configuration values, static values in the `Snakefile` or command line flags. + +To construct a graphical interface from a snakemake workflow, the file parameters need to be explicitly identified and defined so that they can be presented to scientists to be filled out through a web application. + +The `latch_metadata.py` file holds these parameter definitions, along with any styling or cosmetic modifications the developer wishes to make to each parameter. + +To generate a `latch_metadata.py` file, type: +```console +latch generate-metadata +``` + +The command automatically parses the existing `config.yaml` file in the Snakemake repository, and create a Python parameters file. + +#### Examples + +Below is an example `config.yaml` file from the [rna-seq-star-deseq2 workflow](https://github.com/snakemake-workflows/rna-seq-star-deseq2) from Snakemake workflow catalog. + +`config.yaml` +```yaml +# path or URL to sample sheet (TSV format, columns: sample, condition, ...) +samples: config/samples.tsv +# path or URL to sequencing unit sheet (TSV format, columns: sample, unit, fq1, fq2) +# Units are technical replicates (e.g. lanes, or resequencing of the same biological +# sample). +units: config/units.tsv + + +ref: + # Ensembl species name + species: homo_sapiens + # Ensembl release (make sure to take one where snpeff data is available, check 'snpEff databases' output) + release: 100 + # Genome build + build: GRCh38 + +trimming: + # If you activate trimming by setting this to `True`, you will have to + # specify the respective cutadapt adapter trimming flag for each unit + # in the `units.tsv` file's `adapters` column + activate: False + +pca: + activate: True + # Per default, a separate PCA plot is generated for each of the + # `variables_of_interest` and the `batch_effects`, coloring according to + # that variables groups. + # If you want PCA plots for further columns in the samples.tsv sheet, you + # can request them under labels as a list, for example: + # - relatively_uninteresting_variable_X + # - possible_batch_effect_Y + labels: "" + +diffexp: + # variables for whome you are interested in whether they have an effect on + # expression levels + variables_of_interest: + treatment_1: + # any fold change will be relative to this factor level + base_level: B + treatment_2: + # any fold change will be relative to this factor level + base_level: C + # variables whose effect you want to model to separate them from your + # variables_of_interest + batch_effects: + - jointly_handled + # contrasts for the deseq2 results method to determine fold changes + contrasts: + A-vs-B_treatment_1: + # must be one of the variables_of_interest, for details see: + # https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#contrasts + variable_of_interest: treatment_1 + # must be a level present in the variable_of_interest that is not the + # base_level specified above + level_of_interest: A + # The default model includes all interactions among variables_of_interest + # and batch_effects added on. For the example above this implicitly is: + # model: ~jointly_handled + treatment_1 * treatment_2 + # For the default model to be used, simply specify an empty `model: ""` below. + # If you want to introduce different assumptions into your model, you can + # specify a different model to use, for example skipping the interaction: + # model: ~jointly_handled + treatment_1 + treatment_2 + model: "" + + +params: + cutadapt-pe: "" + cutadapt-se: "" + star: "" +``` + +The Python `latch_metadata.py` generated from the Latch command: +```python +from dataclasses import dataclass +import typing + +from latch.types.metadata import SnakemakeParameter, SnakemakeFileParameter +from latch.types.file import LatchFile +from latch.types.directory import LatchDir + +@dataclass +class ref: + species: str + release: int + build: str + + +@dataclass +class trimming: + activate: bool + + +@dataclass +class pca: + activate: bool + labels: str + + +@dataclass +class treatment_1: + base_level: str + + +@dataclass +class treatment_2: + base_level: str + + +@dataclass +class variables_of_interest: + treatment_1: treatment_1 + treatment_2: treatment_2 + + +@dataclass +class A_vs_B_treatment_1: + variable_of_interest: str + level_of_interest: str + + +@dataclass +class contrasts: + A_vs_B_treatment_1: A_vs_B_treatment_1 + + +@dataclass +class diffexp: + variables_of_interest: variables_of_interest + batch_effects: typing.List[str] + contrasts: contrasts + model: str + + +@dataclass +class params: + cutadapt_pe: str + cutadapt_se: str + star: str + + + + +# Import these into your `__init__.py` file: +# +# from .parameters import generated_parameters +# +generated_parameters = { + 'samples': SnakemakeFileParameter( + display_name='samples', + type=LatchFile, + config=True, + ), + 'units': SnakemakeFileParameter( + display_name='units', + type=LatchFile, + config=True, + ), + 'ref': SnakemakeParameter( + display_name='ref', + type=ref, + default=ref(species='homo_sapiens', release=100, build='GRCh38'), + ), + 'trimming': SnakemakeParameter( + display_name='trimming', + type=trimming, + default=trimming(activate=False), + ), + 'pca': SnakemakeParameter( + display_name='pca', + type=pca, + default=pca(activate=True, labels=''), + ), + 'diffexp': SnakemakeParameter( + display_name='diffexp', + type=diffexp, + default=diffexp(variables_of_interest=variables_of_interest(treatment_1=treatment_1(base_level='B'), treatment_2=treatment_2(base_level='C')), batch_effects=['jointly_handled'], contrasts=contrasts(A_vs_B_treatment_1=A_vs_B_treatment_1(variable_of_interest='treatment_1', level_of_interest='A')), model=''), + ), + 'params': SnakemakeParameter( + display_name='params', + type=params, + default=params(cutadapt_pe='', cutadapt_se='', star=''), + ), +} +``` + +Once the workflow is registered to Latch, it will receive an interface like below: + +![Snakemake workflow GUI](../assets/snakemake/metadata.png) diff --git a/docs/source/snakemake/troubleshooting.md b/docs/source/snakemake/troubleshooting.md new file mode 100644 index 00000000..5d22a76c --- /dev/null +++ b/docs/source/snakemake/troubleshooting.md @@ -0,0 +1,175 @@ +# Troubleshooting + +The following page outlines common problems with uploading Snakemake workflows and solutions. + +| Problem | Common Solution | +| -------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `The above error occured when reading the Snakefile to extract workflow metadata.` | Snakefile has errors outside of any rules. Frequently caused by missing dependencies (look for `ModuleNotFoundError` ). Either install dependencies or add a `latch_metadata.py` file | +| `snakemake.exceptions.WorkflowError: Workflow defines configfile config.yaml but it is not present or accessible (full checked path: /root/config.yaml)` | Include a `config.yaml` in the workflow Docker image. Currently, config files cannot be generated from workflow parameters. | +| `Command '['/usr/local/bin/python', '-m', 'latch_cli.snakemake.single_task_snakemake', ...]' returned non-zero exit status 1.` | The runtime single-job task failed. Look at logs to find the error. It will be marked with the string `[!] Failed` . | +| Runtime workflow task fails with `FileNotFoundError in file /root/workflow/Snakefile` but the file is specified in workflow parameters | Wrap the code that reads the file in a function. **See section "Input Files Referenced Outside of Rules"** | +| MultiQC `No analysis results found. Cleaning up..` | FastQC outputs two files for every FastQ file: the raw `.zip` data and the HTML report. Include the raw `.zip` outputs of FastQC in the MultiQC rule inputs. **See section "Input Files Not Explicitly Defined in Rules"** " + +## Troubleshooting: Input Files Referenced Outside of Rules + +Only the JIT workflow downloads every input file. Tasks at runtime will only download files their target rules explicitly depend on. This means that Snakefile code that is not under a rule will usually fail if it tries to read input files. + +**Example:** + +```python +# ERROR: this reads a directory, regardless of which rule is executing! +samples = Path("inputs").glob("*.fastq") + +rule all: + input: + expand("fastqc/{sample}.html", sample=samples) + +rule fastqc: + input: + "inputs/{sample}.fastq" + output: + "fastqc/{sample}.html" + shellcmd: + fastqc {input} -o {output} +``` + +Since the `Path("inputs").glob(...)` call is not under any rule, _it runs in all tasks._ Because the `fastqc` rule does not specify `input_dir` as an `input` , it will not be downloaded and the code will throw an error. + +### Solution + +Only access files when necessary (i.e. when computing dependencies as in the example, or in a rule body) by placing problematic code within rule definitions. Either directly inline the variable or write a function to use in place of the variable. + +**Example:** + +```python +rule all_inline: + input: + # This code will only run in the JIT step + expand("fastqc/{sample}.html", sample=Path("inputs").glob("*.fastq")) + +def get_samples(): + # This code will only run if the function is called + samples = Path("inputs").glob("*.fastq") + return samples + +rule all_function: + input: + expand("fastqc/{sample}.html", sample=get_samples()) +``` + +This works because the JIT step replaces `input` , `output` , `params` , and other declarations with static strings for the runtime workflow so any function calls within them will be replaced with pre-computed strings and the Snakefile will not attempt to read the files again. + +**Same example at runtime:** + +```python +rule all_inline: + input: + "fastqc/example.html" + +def get_samples(): + # Note: this function is no longer called anywhere in the file + samples = Path("inputs").glob("*.fastq") + return samples + +rule all_function: + input: + "fastqc/example.html" +``` + +**Example using multiple return values:** + +```python +def get_samples_data(): + samples = Path("inputs").glob("*.fastq") + return { + "samples": samples, + "names": [x.name for x in samples] + } + +rule all: + input: + expand("fastqc/{sample}.html", sample=get_samples_data()["samples"]), + expand("reports/{name}.txt", name=get_samples_data()["names"]), +``` + +## Troubleshooting: Input Files Not Explicitly Defined in Rules + +When running the snakemake workflow locally, not all input files must be explicitly defined in every rule because all files are generated on one computer. However, tasks on Latch only download files specified by their target rules. Thus, unspecified input files will cause the Snakefile rule to fail due to missing input files. + +**Example** + +```python +# ERROR: the .zip file produced by the the fastqc rule is not found in the multiqc rule! + +WORKDIR = "/root/" + +rule fastqc: + input: join(WORKDIR, 'fastq', 'raw', "{sample}.fastq") + output: + html = join(WORKDIR, "QC", "fastqc", 'raw', "Sample_{sample}") + params: + join(WORKDIR, "QC","fastqc", 'raw', "Sample_{sample}") + run: + if not os.path.exists(join(WORKDIR, str(params))): + os.makedirs(join(WORKDIR, str(params))) + shell("fastqc -o {params} --noextract -k 5 -t 8 -f fastq {input} 2>{log}") + +rule multiqc: + input: + aligned_sequences = join(WORKDIR, "plasmid_wells_aligned_sequences.csv") + output: directory(join(WORKDIR, "QC", "multiqc_report", 'raw')) + params: + join(WORKDIR, "QC", "fastqc", 'raw') + benchmark: + join(BENCHMARKDIR, "multiqc.txt") + log: + join(LOGDIR, "multiqc.log") + shell: + "multiqc {params} -o {output} --force" +``` + +### Solution + +For programs that produce multiple types of input files (e.g. `.zip` and `.html` in the case of FastQC), explicitly specify these files in the outputs of the previous rule and in the inputs of the subsequent rule. + +**Example** + +```python +def get_samples(): + samples = Path("/root").glob("*fastqc.zip") + return samples + +WORKDIR = "/root/" +rule fastqc: + input: join(WORKDIR, 'fastq', 'raw', "{sample}.fastq") + output: + html = join(WORKDIR, "QC", "fastqc", 'raw', "Sample_{sample}", "_{sample}_fastqc.html") + # Specify zip as the output for every sample from fastqc + zip = join(WORKDIR, "QC", "fastqc", 'raw', "Sample_{sample}", "_{sample}_fastqc.zip") + params: + join(WORKDIR, "QC","fastqc", 'raw', "Sample_{sample}") + run: + if not os.path.exists(join(WORKDIR, str(params))): + os.makedirs(join(WORKDIR, str(params))) + shell("fastqc -o {params} --noextract -k 5 -t 8 -f fastq {input} 2>{log}") + +rule multiqc: + input: + aligned_sequences = join(WORKDIR, "plasmid_wells_aligned_sequences.csv") + # Specify zip as the input for every sample from fastqc + zip = expand( + join(WORKDIR, "QC", "fastqc", 'raw', "Sample_{sample}", "_{sample}_fastqc.zip"), sample=get_samples() + ) + output: directory(join(WORKDIR, "QC", "multiqc_report", 'raw')) + params: + join(WORKDIR, "QC", "fastqc", 'raw') + benchmark: + join(BENCHMARKDIR, "multiqc.txt") + log: + join(LOGDIR, "multiqc.log") + shell: + # Explicitly pass the input into the script instead of the Snakefile rule `params` + # Before: "multiqc {params} -o {output} --force" + # After + "multiqc {input.zip} -o {output} --force" +``` diff --git a/docs/source/manual/tutorial.md b/docs/source/snakemake/tutorial.md similarity index 75% rename from docs/source/manual/tutorial.md rename to docs/source/snakemake/tutorial.md index b36b87f2..0d0ed430 100644 --- a/docs/source/manual/tutorial.md +++ b/docs/source/snakemake/tutorial.md @@ -1,4 +1,17 @@ -# A simple Snakemake example +# Getting started + +## Motivation +Latch's snakemake integration allows developers to build graphical interfaces to expose their workflows to wet lab teams. It also provides managed cloud infrastructure for the execution of the workflow's jobs. + +A primary design goal for the Snakemake integration is to allow developers to register existing projects with minimal added boilerplate and modifications to code. Here, we outline these changes and why they are needed. + +## How to Upload a Snakemake Workflow +Recall a snakemake project consists of a `Snakefile` , which describes workflow +rules in an ["extension"](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html) of Python, and associated python code imported and called by these rules. To make this project compatible with Latch, we need to do the following: + +1. Identify and construct explicit parameters for each file dependency in `latch_metadata.py` +2. Build a container with all runtime dependencies +3. Ensure your `Snakefile` is compatible with cloud execution In this guide, we will walk through how you can upload a simple Snakemake workflow to Latch. @@ -12,8 +25,6 @@ The example being used here comes from the [short tutorial in Snakemake's docume pip install latch[snakemake] ``` -* Install [Docker](https://www.docker.com/get-started/) and have Docker run locally - ## Step 1 First, initialize an example Snakemake workflow: @@ -93,6 +104,11 @@ SnakemakeMetadata( For each `LatchFile`/`LatchDir` parameter, the `path` keyword specifies the path where files will be copied before the Snakemake workflow is run and should match the paths of the inputs for each rule in the Snakefile. +If your Snakemake project has an existing `config.yaml` file, you can automatically generate the `latch_metadata.py` file by typing: +```console +latch generate-metadata +``` + ## Step 3: Add dependencies Next, create an `environment.yaml` file to specify the dependencies that the Snakefile needs to run successfully: @@ -140,5 +156,7 @@ Once the workflow finishes running, results will be deposited to [Latch Data](ht ## Next Steps -* Learn more in-depth about how Snakemake integration works on Latch by reading our [manual](../manual/snakemake.md). +* Learn more about the lifecycle of a Snakemake workflow on Latch by reading our [manual](../snakemake/lifecycle.md). +* Learn about how to modify Snakemake workflows to be cloud-compatible [here](../snakemake/cloud.md). +* Visit [troubleshooting](../snakemake/troubleshooting.md) to diagnose and find solutions to common issues. * Visit the repository of [public examples](https://github.com/latchbio/latch-snakemake-examples) of Snakemake workflows on Latch.