Merge branch 'master' into 'public'

Merge for version 1.4 See merge request icbi-lab/pipelines/rnaseq-nf!41
icbi-lab · Jul 28, 2023 · 71046cf · 71046cf
2 parents 8ac296d + 5c570e0
commit 71046cf
Show file tree

Hide file tree

Showing 14 changed files with 544 additions and 375 deletions.
diff --git a/README.md b/README.md
@@ -66,26 +66,29 @@ curl -s https://get.nextflow.io | bash
 
 The pipeline will install almost all required tools via Singularity images or conda environments. If preferred one can also use local installations of all tools (not recommended, please see `Manual installation` at the end of this document)
 
-The software that needs to be present on the system is **Java** (minimum version 8), **Nextflow** (see above), **Singularity**, **Conda** (optional).
+The software that needs to be present on the system is **Java** (minimum version 8, if running conda java version 17 or higher is needed), **Nextflow** (see above), **Singularity**, **Conda** (optional).
+
+If you intend to run the pipeline with the `conda` profile instead of singularity, we recommend to install `mamba` (<https://github.com/mamba-org/mamba>)
+to speed up the creation of conda environments. If you can not install `mamba` please set `conda.useMamba = false` for the `conda` profile in `conf/profiles.config`
 
 **Optional but recommended:**
 Due to license restrictions you may also need to download and install **HLA-HD** by your own, and set the installation path in ```conf/params.config```. _If HLA-HD is not available Class II neoepitopes will NOT be predicted_
 
-### 1.2 References
+### 1.3 References
 
 The pipeline requires different reference files, indexes and databases:
 
 please see ```conf/resources.config```
 
 For each nextNEOpi version we prepared a bundle with all needed references, indexes and databases which can be obtained from:
 
-`https://apps-01.i-med.ac.at/resources/nextneopi/`
+<https://apps-01.i-med.ac.at/resources/nextneopi/>
 
 the bundle is named to match the release version `nextNEOpi_<version>_resources.tar.gz`
 
 e.g.:
 
-<https://apps-01.i-med.ac.at/resources/nextneopi/nextNEOpi_1.3_resources.tar.gz>
+<https://apps-01.i-med.ac.at/resources/nextneopi/nextNEOpi_1.4_resources.tar.gz>
 
 download and extract the contents of the archive into the directory you specified for ```resourcesBaseDir``` in the ```conf/params.config``` file.
 
@@ -117,6 +120,15 @@ Refs:
 * <https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files>
 * <https://www.gencodegenes.org/human/>
 
+### 1.4 Testdata
+
+If you want to test the pipeline using a working minimal test dataset you may download one from
+
+<https://apps-01.i-med.ac.at/resources/nextneopi/nextNEOpi_testdata.tar.gz>
+
+Please note that due to the limited read coverage `CNVkit` will not run successfully using this test dataset. Please run the
+pipeline using the parameter `--CNVkit false` when testing with this dataset.
+
 ## 2. Usage
 
 Before running the pipeline, the config files in the ```conf/``` directory may need to be edited. In the
@@ -127,8 +139,11 @@ the number of CPUs assigned for each process and adjust according to your system
 Most pipeline parameters can be edited in the ```params.config``` file or changed on run time with command line options by using ```--NameOfTheParameter``` given in the ```params.config```.
 References, databases should be edited in the ```resources.config``` file.
 
+**Note**: nextNEOpi is currently written in nextflow DSL 1, which is only supported up to nextflow version 22.10.8, this means you need to pin the nextflow
+version by setting the environment variable `NXF_VER=22.10.8`, in case you have installed a newer nextflow version.
+
 ```
-nextflow run nextNEOpi.nf --batchFile <batchFile_FASTQ.csv | batchFile_BAM.csv> -profile singularity|conda,[cluster] [-resume] -config conf/params.config
+NXF_VER=22.10.8 nextflow run nextNEOpi.nf --batchFile <batchFile_FASTQ.csv | batchFile_BAM.csv> -profile singularity|conda,[cluster] [-resume] -config conf/params.config
 ```
 
 **Profiles:** conda or singularity
@@ -269,6 +284,8 @@ nextflow run nextNEOpi.nf \
 ```--TCR```                   Run mixcr for TCR prediction
                             Default: true
 
+```--CNVkit```              Run CNVkit for detecting CNAs. Default: true
+
 ```--HLAHD_DIR``` Specify the path to your HLA-HD installation. Needed if Class II neoantigens should be predicted.
 
 ```--HLA_force_RNA``` Use only RNAseq for HLA typing. Default: false
@@ -350,11 +367,11 @@ If you prefer local installation of the analysis tools please install the follow
 * BWA    (Version >= 0.7.17)
 * SAMTOOLS   (Version >= 1.9)
 * GATK3   (Version 3.8-0)
-* GATK4   (Version >= 4.2.5.0)
-* VARSCAN   (Version 2.4.3)
+* GATK4   (Version >= 4.4.0.0)
+* VARSCAN   (Version 2.4.6)
 * MUTECT1   (Version 1.1.7) ---- optional
 * BAMREADCOUNT  (Version 0.8.0)
-* VEP    (Version v105)
+* VEP    (Version v110)
 * BGZIP
 * TABIX
 * BCFTOOLS
@@ -368,7 +385,7 @@ If you prefer local installation of the analysis tools please install the follow
 * YARA
 * HLA-HD
 * ALLELECOUNT
-* RSCRIPT (R > 3.6.1)
+* RSCRIPT (R > 3.6.2)
 * SEQUENZA (3.0)
 * CNVkit
 

diff --git a/assets/.mambarc b/assets/.mambarc
@@ -3,4 +3,4 @@ channels:
   - bioconda
   - defaults
 always_yes: true
-
+channel_priority: flexible
diff --git a/assets/email_template.html b/assets/email_template.html
@@ -44,7 +44,7 @@ <h3>Pipeline Configuration:</h3>
 </table>
 
 <p>icbi/nextNEOpi</p>
-<p><a href="https://gitlab.i-med.ac.at/icbi-lab/pipelines/nextNEOpi">nextNEOpi</a></p>
+<p><a href="https://github.com/icbi-lab/nextNEOpi">nextNEOpi</a></p>
 
 </div>
 

diff --git a/assets/gatkPythonPackageArchive.zip b/assets/gatkPythonPackageArchive.zip
diff --git a/assets/nextNEOpi.def b/assets/nextNEOpi.def
@@ -1,30 +1,44 @@
 Bootstrap: docker
-From: mambaorg/micromamba
+From: mambaorg/micromamba:0.24.0
 
 %files
     nextNEOpi.yml /nextNEOpi.yml
+    ./.mambarc /root/.mambarc
 
 %post
- apt-get update && apt-get install -y \
-    procps \
-    curl \
-    unzip
+    apt-get --allow-releaseinfo-change update && apt-get install -y \
+        procps \
+        curl \
+        unzip \
+        libgomp1 \
+        openjdk-17-jdk
+
+    # set jdk-17 as default
+    update-java-alternatives -s java-1.17.0-openjdk-amd64
 
     export LANG=C.UTF-8 LC_ALL=C.UTF-8
     export PATH=/opt/conda/bin:$PATH
 
-    curl -L -o gatk-4.2.6.1.zip https://github.com/broadinstitute/gatk/releases/download/4.2.6.1/gatk-4.2.6.1.zip
-    unzip -j gatk-4.2.6.1.zip gatk-4.2.6.1/gatkPythonPackageArchive.zip -d ./
+    mkdir -p /opt/gatk
+    mkdir -p /opt/conda/bin
+
+    curl -L -o gatk-4.4.0.0.zip https://github.com/broadinstitute/gatk/releases/download/4.4.0.0/gatk-4.4.0.0.zip
+    unzip -j gatk-4.4.0.0.zip gatk-4.4.0.0/gatkPythonPackageArchive.zip -d ./
+    unzip -j gatk-4.4.0.0.zip gatk-4.4.0.0/gatk-package-4.4.0.0-local.jar -d ./opt/gatk/
+    unzip -j gatk-4.4.0.0.zip gatk-4.4.0.0/gatk -d ./opt/gatk/
+
+    chmod +x /opt/gatk/gatk
+    ln -s /opt/gatk/gatk /opt/conda/bin/gatk
 
     micromamba install --yes --name base --file /nextNEOpi.yml
 
     rm -f /nextNEOpi.yml
-    rm -f gatk-4.2.6.1.zip
+    rm -f gatk-4.4.0.0.zip
     rm -f gatkPythonPackageArchive.zip
 
     apt-get clean
     micromamba clean --all --yes
 
 %environment
     export LANG=C.UTF-8 LC_ALL=C.UTF-8
-    export PATH=/opt/conda/bin:$PATH
+    export PATH=/usr/lib/jvm/java-17-openjdk-amd64/bin/:/opt/conda/bin:$PATH
diff --git a/assets/nextNEOpi.yml b/assets/nextNEOpi.yml
@@ -8,23 +8,19 @@ channels:
 dependencies:
   - bwa
   - samtools
-  - gatk4=4.2.6.1
-  - fastp
-  - fastqc
-  - multiqc
   - sambamba
   - bcftools
   - varscan
   - bam-readcount
   - yara
   - optitype
 
-  # core python dependencies for GATK4 (4.2.6.1)
+  # core python dependencies for GATK4 (4.4.0.0)
   - conda-forge::python=3.6.10 # do not update
   - pip=20.0.2 # specifying channel may cause a warning to be emitted by conda
   - conda-forge::mkl=2019.5 # MKL typically provides dramatic performance increases for theano, tensorflow, and other key dependencies
   - conda-forge::mkl-service=2.3.0
-  - conda-forge::numpy=1.17.5 # do not update, this will break scipy=0.19.1
+  - conda-forge::numpy=1.17.5 # do not update, this will break scipy=1.0.0
     #   verify that numpy is compiled against MKL (e.g., by checking *_mkl_info using numpy.show_config())
     #   and that it is used in tensorflow, theano, and other key dependencies
   - conda-forge::theano=1.0.4 # it is unlikely that new versions of theano will be released
@@ -42,9 +38,12 @@ dependencies:
   - conda-forge::scikit-learn=0.23.1
   - conda-forge::matplotlib=3.2.1
   - conda-forge::pandas=1.0.3
+  - conda-forge::typing_extensions=4.1.1   # see https://github.com/broadinstitute/gatk/issues/7800 and linked PRs
+  - conda-forge::dill=0.3.4                # used for pickling lambdas in TrainVariantAnnotationsModel
+  - conda-forge::joblib=1.1.1
 
   # core R dependencies; these should only be used for plotting and do not take precedence over core python dependencies!
-  - r-base=3.6.2
+  - r-base>=3.6.2
   - r-data.table
   - r-dplyr=0.8.5
   - r-getopt=1.20.3

diff --git a/assets/pVACtools_icbi.def b/assets/pVACtools_icbi.def
@@ -83,11 +83,7 @@ From: continuumio/miniconda3:4.9.2
     pip install protobuf==3.20.1
     pip install tensorflow>=2.2.2
 
-    # need to pin pandas version see: https://github.com/griffithlab/pVACtools/issues/779
-    # will not be required in pVACtools versions > 3.0.0
-    # pip install pandas==0.25.2
-
-    pip install pvactools==3.0.2
+    pip install pvactools==4.0.1
 
     cd /opt
     mkdir tmp_src

diff --git a/assets/rigscore.def b/assets/rigscore.def
@@ -6,7 +6,8 @@ From: mambaorg/micromamba
 
 %post
  apt-get update && apt-get install -y \
-    procps
+    procps \
+    patch
 
     export LANG=C.UTF-8 LC_ALL=C.UTF-8
     export PATH=/opt/conda/bin:$PATH

diff --git a/bin/CSiN.py b/bin/CSiN.py
@@ -43,8 +43,8 @@ def convert_to_df(pvacseq_1_tsv, pvacseq_2_tsv):
         # Rename columns in order for the merge to work properly
         pvacseq_2_df.rename(
             columns={
-                "NetMHCIIpan WT Score": "NetMHCpan WT Score",
-                "NetMHCIIpan MT Score": "NetMHCpan MT Score",
+                "NetMHCIIpan WT IC50 Score": "NetMHCpan WT IC50 Score",
+                "NetMHCIIpan MT IC50 Score": "NetMHCpan MT IC50 Score",
                 "NetMHCIIpan WT Percentile": "NetMHCpan WT Percentile",
                 "NetMHCIIpan MT Percentile": "NetMHCpan MT Percentile",
             },
@@ -63,7 +63,7 @@ def sub_csin(c, IC50_cutoff, xp_cutoff, filtered_df):
             # Filter dataframe
             filtered_df_tmp = filtered_df
             filtered_df_tmp = filtered_df_tmp[filtered_df_tmp["NetMHCpan MT Percentile"] < rank]
-            filtered_df_tmp = filtered_df_tmp[filtered_df_tmp["Best MT Score"] < IC50_cutoff]
+            filtered_df_tmp = filtered_df_tmp[filtered_df_tmp["Best MT IC50 Score"] < IC50_cutoff]
             filtered_df_tmp = filtered_df_tmp[filtered_df_tmp["Gene Expression"] > xp_cutoff]
             # Get the VAF and mean VAF, then normilize
             vaf_mean = filtered_df_tmp["Tumor DNA VAF"].mean()

diff --git a/bin/get_epitopes.py b/bin/get_epitopes.py
@@ -12,26 +12,26 @@
 import sys
 import csv
 
+def filter_tsv(sample, input_file, output_file):
+    # Open the input TSV file and create a CSV reader
+    with open(input_file, 'r', newline='') as infile:
+        reader = csv.DictReader(infile, delimiter='\t')
 
-def parse_mhcI(inFile, epitopes=[]):
-    with open(inFile) as in_file:
-        csv_reader = csv.reader(in_file, delimiter="\t")
-        in_file.readline()
-        for line in csv_reader:
-            if line[19] == "NA" or line[18] == "NA":
-                pass
-            else:
-                # 	print("%s\t%s\t%s" % (line[18], line[19], line[17]))
-                epitopes.append("%s\t%s\t%s" % (line[18], line[19], line[17]))
-    return epitopes
+        # Open the output TSV file and create a CSV writer
+        with open(output_file, 'w', newline='') as outfile:
+            fieldnames = ['Sample_ID', 'mut_peptide', 'Reference', 'peptide_variant_position']
+            writer = csv.DictWriter(outfile, fieldnames=fieldnames, delimiter='\t')
 
+            # Write the header
+            writer.writeheader()
 
-def write_output(outFile, sample_id, epitopes=[]):
-    with open(outFile, "w") as out_file:
-        out_file.write("Sample_ID\tmut_peptide\tReference\tpeptide_variant_position\n")
-        for epitope in epitopes:
-            out_file.write("%s\t%s\n" % (sample_id, epitope))
-    return outFile
+            # Filter rows and write selected columns to the output TSV file
+            for row in reader:
+                if row['WT Epitope Seq'] and row['MT Epitope Seq']:
+                    writer.writerow({'Sample_ID': sample,
+                                     'mut_peptide': row['MT Epitope Seq'],
+                                     'Reference': row['WT Epitope Seq'],
+                                     'peptide_variant_position': row['Mutation Position']})
 
 
 if __name__ == "__main__":
@@ -45,5 +45,4 @@ def write_output(outFile, sample_id, epitopes=[]):
     sample = args.sample_id
     epitope_array = []
 
-    parse_mhcI(infile, epitope_array)
-    write_output(outfile, sample, epitope_array)
+    filter_tsv(sample, infile, outfile)
diff --git a/conf/params.config b/conf/params.config
@@ -49,6 +49,9 @@ params {
   // run controlFREEC
   controlFREEC = false
 
+  // run CNVkit
+  CNVkit = true
+
   // Panel of normals (see: https://gatk.broadinstitute.org/hc/en-us/articles/360040510131-CreateSomaticPanelOfNormals-BETA-)
   mutect2ponFile = 'NO_FILE'
 
@@ -66,8 +69,7 @@ params {
 
   // Directories (need to be in quotes)
   tmpDir          = "/tmp/$USER/nextNEOpi/"  // Please make sure that there is enough free space (~ 50G)
-  workDir         = "$PWD"
-  outputDir       = "${workDir}/RESULTS"
+  outputDir       = "${PWD}/results"
 
   // Result publishing method
   publishDirMode  = "auto" // Choose between:
@@ -110,7 +112,7 @@ params {
   HLA_HD_genome_version = "hg38"
 
   // URL to the installation package of MIXCRC, will be installed automatically.
-  MIXCR_url       = "https://github.com/milaboratory/mixcr/releases/download/v4.0.0/mixcr-4.0.0.zip"
+  MIXCR_url       = "https://github.com/milaboratory/mixcr/releases/download/v4.4.1/mixcr-4.4.1.zip"
   MIXCR_lic       = "" // path to MiXCR license file
   MIXCR           = "" // Optional: specify path to mixcr directory if already installed, will be installed automatically otherwise
   // analyze TCRs using mixcr
@@ -127,8 +129,8 @@ params {
   IGS = "" // optional path to IGS
 
   // IEDB tools urls for MHCI and MHCII. These will be used for IEDB installation into resources.databases.IEDB_dir
-  IEDB_MHCI_url  = "https://downloads.iedb.org/tools/mhci/3.1.2/IEDB_MHC_I-3.1.2.tar.gz"
-  IEDB_MHCII_url = "https://downloads.iedb.org/tools/mhcii/3.1.6/IEDB_MHC_II-3.1.6.tar.gz"
+  IEDB_MHCI_url  = "https://downloads.iedb.org/tools/mhci/3.1.4/IEDB_MHC_I-3.1.4.tar.gz"
+  IEDB_MHCII_url = "https://downloads.iedb.org/tools/mhcii/3.1.8/IEDB_MHC_II-3.1.8.tar.gz"
 
 
   // Java settings: please adjust to your memory available
@@ -176,9 +178,9 @@ params {
 
 
   // VEP
-  vep_version             = "106.1"
+  vep_version             = "110.0"
   vep_assembly            = "GRCh38"
-  vep_cache_version       = "106"
+  vep_cache_version       = "110"
   vep_species             = "homo_sapiens"
   vep_options             = "--everything" // "--af --af_1kg --af_gnomad --appris --biotype --check_existing --distance 5000 --failed 1 --merged --numbers --polyphen b --protein --pubmed --regulatory --sift b --symbol --xref_refseq --tsl --gene_phenotype"
 
@@ -204,7 +206,7 @@ params {
   // pVACseq settings
   mhci_epitope_len         = "8,9,10,11"
   mhcii_epitope_len        = "15,16,17,18,19,20,21,22,23,24,25" // minimum length has to be at least 15 (see pVACtools /opt/iedb/mhc_ii/mhc_II_binding.py line 246)
-  epitope_prediction_tools = "NetMHCpan MHCflurry NetMHCIIpan"
+  epitope_prediction_tools = "NetMHCpan NetMHCpanEL MHCflurry MHCflurryEL NetMHCIIpan NetMHCIIpanEL"
   use_NetChop              = false
   use_NetMHCstab           = true
 
@@ -231,18 +233,22 @@ includeConfig './profiles.config'
 
 timeline {
   enabled = true
+  overwrite = true
   file = "${params.tracedir}/icbi/nextNEOpi_timeline.html"
 }
 report {
   enabled = true
+  overwrite = true
   file = "${params.tracedir}/icbi/nextNEOpi_report.html"
 }
 trace {
   enabled = true
+  overwrite = true
   file = "${params.tracedir}/icbi/nextNEOpi_trace.txt"
 }
 dag {
   enabled = true
+  overwrite = true
   file = "${params.tracedir}/icbi/nextNEOpi_dag.svg"
 }