Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data directory is disorganized #8

Open
lovettse opened this issue Mar 27, 2019 · 2 comments
Open

Data directory is disorganized #8

lovettse opened this issue Mar 27, 2019 · 2 comments
Assignees

Comments

@lovettse
Copy link

Expected behavior

Clean organization of input/output data separated by sample (and preferably project)
Separation of input/output data from reference data

Actual behavior

Unstructured data directory containing all input and output data

The data directory makes organization difficult and would quickly become extremely cluttered with regular use. One possible structure that would help:

data/ref
data/PROJECT/seq/SAMPLE
data/PROJECT/analysis/SAMPLE

This is related to issue #1 in that the requirement to run all analyses from the same location exacerbates this problem by forcing every analysis for every sample to land in the same place.

@dsommer
Copy link

dsommer commented Mar 29, 2019

Snakemake will automatically make output directory paths for every rule when executed if the output directories do not exist. To organize the output for each rule just specify a sub-directory for each rule.

For an example I subsample the input reads at different coverages and run spades on each to see which coverage works best. In the rule below Snakemake automatically makes separate output sub-directory for each sample + coverage.

rule spades:
      input:
           read1='{sample}/seq/filtered_{sample}_R1.fastq.gz.cov{cov}.fastq',
           read2='{sample}/seq/filtered_{sample}_R2.fastq.gz.cov{cov}.fastq'
       output:
           '{sample}/analysis/spades.{cov}/asm.fasta'
       message: "Running spades "
       run:
           out_dir = os.path.dirname(output[0])
           shell("spades.py -t {config[threads]} -m {config[memory]} -1 {input.read1} -2 {input.read2}  -o {out_dir}")
           shutil.copyfile(out_dir + "/scaffolds.fasta", output[0])

PS. this was before Snakemake 5.2 which now supports giving a directory as output.. As of Snakemake 5.2 you wouldn't need the os.path.dirname command.

@dsommer
Copy link

dsommer commented Apr 1, 2019

To extend this example even further, below is the rule that follows the spades rules. The coverage_eval_filter rules doesn't need to know the exact name of output directory of spades because of Snakemake wildcarding. It will dynamically match "{assembler}.{cov}' to the correct output directory. Hopefully some of this will help with managing your output organization.

rule coverage_eval_filter:
     input:
         asm='{sample}/analysis/{assembler}.{cov}/asm.fasta',
         read1='{sample}/seq/filtered_{sample}_R1.fastq.gz.cov{cov}.fastq',
         read2='{sample}/seq/filtered_{sample}_R2.fastq.gz.cov{cov}.fastq'
     params:
         path="{sample}/analysis/{assembler}.{cov}/"
     output:
         '{sample}/analysis/{assembler}.{cov}/score',
         '{sample}/analysis/{assembler}.{cov}/asm.fasta.bam'
         
     shell:
         'coverage_eval.sh {input} {params.path} {config[threads]}'

rule spades:
      input:
           read1='{sample}/seq/filtered_{sample}_R1.fastq.gz.cov{cov}.fastq',
           read2='{sample}/seq/filtered_{sample}_R2.fastq.gz.cov{cov}.fastq'
       output:
           '{sample}/analysis/spades.{cov}/asm.fasta'
       message: "Running spades "
       run:
           out_dir = os.path.dirname(output[0])
           shell("spades.py -t {config[threads]} -m {config[memory]} -1 {input.read1} -2 {input.read2}  -o {out_dir}")
           shutil.copyfile(out_dir + "/scaffolds.fasta", output[0])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants