Data directory is disorganized #8

lovettse · 2019-03-27T19:16:46Z

Expected behavior

Clean organization of input/output data separated by sample (and preferably project)
Separation of input/output data from reference data

Actual behavior

Unstructured data directory containing all input and output data

The data directory makes organization difficult and would quickly become extremely cluttered with regular use. One possible structure that would help:

data/ref
data/PROJECT/seq/SAMPLE
data/PROJECT/analysis/SAMPLE

This is related to issue #1 in that the requirement to run all analyses from the same location exacerbates this problem by forcing every analysis for every sample to land in the same place.

dsommer · 2019-03-29T21:17:41Z

Snakemake will automatically make output directory paths for every rule when executed if the output directories do not exist. To organize the output for each rule just specify a sub-directory for each rule.

For an example I subsample the input reads at different coverages and run spades on each to see which coverage works best. In the rule below Snakemake automatically makes separate output sub-directory for each sample + coverage.

rule spades:
      input:
           read1='{sample}/seq/filtered_{sample}_R1.fastq.gz.cov{cov}.fastq',
           read2='{sample}/seq/filtered_{sample}_R2.fastq.gz.cov{cov}.fastq'
       output:
           '{sample}/analysis/spades.{cov}/asm.fasta'
       message: "Running spades "
       run:
           out_dir = os.path.dirname(output[0])
           shell("spades.py -t {config[threads]} -m {config[memory]} -1 {input.read1} -2 {input.read2}  -o {out_dir}")
           shutil.copyfile(out_dir + "/scaffolds.fasta", output[0])

PS. this was before Snakemake 5.2 which now supports giving a directory as output.. As of Snakemake 5.2 you wouldn't need the os.path.dirname command.

dsommer · 2019-04-01T17:38:52Z

To extend this example even further, below is the rule that follows the spades rules. The coverage_eval_filter rules doesn't need to know the exact name of output directory of spades because of Snakemake wildcarding. It will dynamically match "{assembler}.{cov}' to the correct output directory. Hopefully some of this will help with managing your output organization.

rule coverage_eval_filter:
     input:
         asm='{sample}/analysis/{assembler}.{cov}/asm.fasta',
         read1='{sample}/seq/filtered_{sample}_R1.fastq.gz.cov{cov}.fastq',
         read2='{sample}/seq/filtered_{sample}_R2.fastq.gz.cov{cov}.fastq'
     params:
         path="{sample}/analysis/{assembler}.{cov}/"
     output:
         '{sample}/analysis/{assembler}.{cov}/score',
         '{sample}/analysis/{assembler}.{cov}/asm.fasta.bam'
         
     shell:
         'coverage_eval.sh {input} {params.path} {config[threads]}'

rule spades:
      input:
           read1='{sample}/seq/filtered_{sample}_R1.fastq.gz.cov{cov}.fastq',
           read2='{sample}/seq/filtered_{sample}_R2.fastq.gz.cov{cov}.fastq'
       output:
           '{sample}/analysis/spades.{cov}/asm.fasta'
       message: "Running spades "
       run:
           out_dir = os.path.dirname(output[0])
           shell("spades.py -t {config[threads]} -m {config[memory]} -1 {input.read1} -2 {input.read2}  -o {out_dir}")
           shutil.copyfile(out_dir + "/scaffolds.fasta", output[0])

kternus assigned cgrahlm Apr 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data directory is disorganized #8

Data directory is disorganized #8

lovettse commented Mar 27, 2019

dsommer commented Mar 29, 2019 •

edited

Loading

dsommer commented Apr 1, 2019

Data directory is disorganized #8

Data directory is disorganized #8

Comments

lovettse commented Mar 27, 2019

Expected behavior

Actual behavior

dsommer commented Mar 29, 2019 • edited Loading

dsommer commented Apr 1, 2019

dsommer commented Mar 29, 2019 •

edited

Loading