Skip to content

Latest commit

 

History

History
191 lines (145 loc) · 6.45 KB

README.md

File metadata and controls

191 lines (145 loc) · 6.45 KB

Read Processing for paired-end Illumina reads

This repository contains a Makefile to process paired-end illumina data with bowtie2, samtools, picard, and gatk. Much of the analyses are based on @knausb's bam processing workflow, but tweaked for a haploid plant pathogen with an available genome (here we use it for Sclerotinia sclerotiorum). This is designed to run on the HCC SLURM Cluster. There are no guarantees that this will work anywhere else.

Running the workflow

To build your analysis, add your data to directories called reads/ and genome/. In your shell, type:

make

This will generate the genome index, sam files, bam files, g.vcf files and a final GATK/res.vcf.gz.

Makefile options

You can find all of the makefile options by typing make help:

$ make help

COMMANDS
============

all		makes everything: res.vcf.gz
index		generate the bowtie2 index
trim		trims reads with trimmomatic
map		map reads with samtools
bam		convert sam to bam, filter, and remove duplicates
validate	run validation stats on the bam and sam files
gvcf		create g.vcf files from HaplotypeCaller
vcf		create vcf files (this is the longest step)

help		show this message
clean		remove all *.jid files
cleanall	REMOVE ALL GENERATED FILES

PARAMETERS
============

ROOT DIR  :  /work/everhartlab/kamvarz/test/read-processing
TEMP DIR  :  $TMPDIR
INDEX DIR :  bt2-index
PREFIX    :  Ssc
GENOME    :  genome/GCA_001857865.1_ASM185786v1_genomic.fna.gz genome/sclerotinia_sclerotiorum_mitochondria_2_contigs.fasta.gz
READS     :  reads/SS.11.01 reads/SS.11.02

Modules
============

BOWTIE   :  bowtie/2.2
TRIMMOD  :  trimmomatic/0.36
SAMTOOLS :  samtools/1.3
VCFTOOLS :  vcftools/0.1
PICARD   :  picard/2.9
GATK     :  gatk/3.4

Workflow

A workflow with two samples and two genome files (full and mitochondria) would look like this:

"a graph representation of the workflow of make all"

  • Blue nodes: user-supplied files
  • Grey nodes: makefile-generated folders
  • Pink nodes: derived files per sample
  • Purple/periwinkle nodes: indexed genomic data
  • brown nodes: scripts in the scripts directory

You can find the pdf version of this graph here. The color palette used for the graph is from the GrandBudapest2 palette of the The Wes Anderson Palettes by Karthik Ram.

Adding steps to the workflow

If you want to add steps/rules to the workflow, you should first be familiar or comfortable with Makefiles. Here are some helpful guides:

Many of the recipes in the makefile take the form of:

# Converting In to Out --------------------------------------------------------
$(OUT_DIR)/%.out : $(FROM_DIR)/%.in scripts/in-to-out.sh | $(OUT_DIR) $(OUT_RUN)
    sbatch \
    -D $(ROOT_DIR) \
    -J IN-TO-OUT \
    --dependency=afterok:$$(bash scripts/get-job.sh $<.jid) \
    -o $(OUT_RUN)/$*.out \
    -e $(OUT_RUN)/$*.err \
    scripts/in-to-out.sh $< | cut -c 21- > $@.jid

Where sbatch is being used on the HCC to submit a custom SLURM script scripts/in-to-out.sh, which will convert .in files to .out files. One of the first things to note are the trailing backslashes. These are continuation lines for BASH, so when it runs it will run like this:

sbatch -D $(ROOT_DIR) -J IN-TO-OUT --dependency=afterok:$$(bash scripts/get-job.sh $<.jid) -o $(OUT_RUN)/$*.out -e $(OUT_RUN)/$*.err scripts/in-to-out.sh $< | cut -c 21- > [email protected]

All jobs run from sbatch will return the following:

Submitted batch job XXXXX

where XXXXX is the JOBID. Since make does not know anything about jobs running on SLURM, we need to tell SLURM to hold off on running a given job until the dependent jobs are done. We take advantage of this by piping this to cut -c 21- > [email protected], which creates a file with the JOBID that we can use in a downstream process.

The arguments to sbatch are:

  • D Root directory for the system
  • J jobname (we try to make them match the script names)
  • --dependency This stipulates that this job can only run if the the dependency from the .jid file has been fulfilled. Note that multiple jid files can be specified. Note that because $ is special in make, we have to use another $ to escape it.
  • -o,-e the stdout and stderr files, respectively.

You may have noticed that this uses automatic makefile variables such as $< and $@, which stand for the first dependency and target, respectively. Depending on the needs of various scripts, we need different variables. It would be a good idea to keep that page open as a reference when creating new steps.

Here's an example from the script:

# Sorting and Converting to BAM files -----------------------------------------
$(BAM_DIR)/%_nsort.bam : $(SAM_DIR)/%.sam scripts/sam-to-bam.sh | $(BAM_DIR) $(BAM_RUN)
    sbatch \
    -D $(ROOT_DIR) \
    -J SAM-TO-BAM \
    --dependency=afterok:$$(bash scripts/get-job.sh $<.jid) \
    -o $(BAM_RUN)/$*_nsort.out \
    -e $(BAM_RUN)/$*_nsort.err \
    scripts/sam-to-bam.sh $(<D) $(@D) $* $(SAMTOOLS) | \
       cut -c 21- > $@.jid

In this example, we are using the sam-to-bam.sh script to convert sam files to bam files. If you run the script by itself, it will tell you the requirements:

Usage: sam-to-bam.sh <SAMDIR> <BAMDIR> <BASE> <SAMTOOLS>

    <SAMDIR> - the directory for the samfiles
    <BAMDIR> - the directory for the bamfiles
    <BASE>   - base name for the sample (e.g. SS.11.01)
    <SAMTOOLS> - the samtools module (e.g. samtools/1.3)

Generated directories

  • bt2-index/: genome index files generated via make index
  • TRIM/: home for the trimmed reads via trimmomatic
  • SAMS/: mapped sam files generated via make map
  • BAMS/: filtered bam files
  • GVCF/: *.g.vcf and *.vcf files generated via GATK
  • runs/: std out and std err of runs