This repository hosts a pipeline for analysing metatranscriptomic data. This is a comprehensive pipeline launched using Nextflow and incorporates a Singularity container. It performs de novo assembly and differential expression analysis allowing for better identification of novel genes or transcripts.
The pipeline is intended to be used on a cluster or server. The simple command for running the pipeline is as follows:
nextflow run trinity.nf -c config -profile cluster
Additional options for the command are:
-c
[REQUIRED] path to the configuration file
-profile
[REQUIRED] defines the infrastructure being used for execution, cluster for running on cluster and server for running on local servers
--reads
[REQUIRED] full path to the 4 column tab delimited specimen file
--seqtype
[REQUIRED] defines the type of reads, can either be or , default is fq
--libtype
[REQUIRED] defines the strand-specific RNA-seq read orientation. For pair-ended reads it can either be FR or RF, for single reads it can either
be F or R
--mem
[OPTIONAL] maximum memory available for the algorthm to run. Input should be a string in the format "10G". Default value within the pipeline is 50G,
this value will depend on available resources on execution infrastructure and the size of the dataset
--cpus
[REQUIRED] number of cpus to be utilized at each stage of the assembly process. The default value is 8
--dispersion
[REQUIRED] for differential expression analysis. Default is 0.1
--bfly_HeapSpaceMax
[OPTIONAL] parameter for Butterlfy execution, default is 30G
--bfly_HeapSpaceInit
[OPTIONAL] parameter for Butterfly execution, default is 5G
--bfly_GCThreads
[OPTIONAL] specifies threads to be made availble for Butterfly execution, default is 10
--bfly_CPU
[OPTIONAL] specifies maximum CPUs for Butterly algorithm, default is 6
--bowtie2_thr
[OPTIONAL] defines the threads to be made available for bowtie2, default is 8
--outdir
[OPTIONAL] specifies the name of the output directory to save pipeline output to. Default name is results.
Nextflow 21 or higher
Singularity 3.5 or higher
The pipeline processes raw paired-end or single-end fastq reads. The pipeline takes a 4 column tab-delimited file (for paired-end reads) and a 3 column tab-delimited file (for single-end reads). The tab-delimited specimen file specifies the sample ID, group name and absolute paths of raw reads.
A directory named is saved in the current working directory. The following output is saved:
-
butterfly.txt - log file for last stage of assembly where the Butterfly algorithm does analysis.
-
chrysalis.txt - log file for assembly stage where Chrysalis algorithm does analysis.
-
DE - directory consisting of output from differential expression analysis. MA plots and Volcano plots comparing groups are saved here. The matrices used to generate the plots are saved here.
-
inchworm.txt - log file to record Inchworm algorithm during assembly.
-
jellyfish.txt - log file for the first step of assembly.
-
trinity_matrix - directory consisting of the count matrix.
-
trinity_out_dir - directory consisting of assembly intermediate files and trimmed reads.
-
trinity_out_dir.Trinity.fasta - assembly output with all assembled transcript in fasta format.
-
trinity_out_dir.Trinity.fasta.gene_trans_map - gene mapping file.
-
trinity_quantification - directory consisting of isoform and gene counts.
The container run Trinity (version 2.14.0). Development of the container and dependency software are specified in the definition file.