-
Notifications
You must be signed in to change notification settings - Fork 17
TranscriptClean Tutorial
User note: If using the chromosome 1 reference genome fasta file provided in the example_files directory, please run the following command to unzip it first:
gunzip chr1.fa.gz
At its most basic, TranscriptClean can be run using an input SAM file of transcripts and the reference genome fasta file that was used to map them. Using the example files provided with TranscriptClean for GM12878 chromosome 1, the command is
python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
--genome chr1.fa \
--outprefix /my/path/outprefix
python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
--genome chr1.fa \
--maxLenIndel 10 \
--outprefix /my/path/outprefix
Indel and mismatch correction can be disabled using the --correctIndels and --correctMismatches options, respectively.
No indel correction:
python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
--genome chr1.fa \
--correctIndels false --outprefix /my/path/outprefix
No mismatch correction:
python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
--genome chr1.fa \
--correctMismatches false \
--outprefix /my/path/outprefix
Correcting all mismatches and indels in the transcripts is not an ideal approach, because some of them are likely to be variants. In particular, many mismatches in the sequence may be single nucleotide polymorphisms (SNPs), or more rarely, RNA editing events. To address this, you can run TranscriptClean in variant-aware mode by providing a VCF file. In this mode, mismatches and indels that exactly match a variant in the VCF file will retain their original sequence. The match must be exact- if a mismatch is in the same place as a known SNP but does not match the known alleles, it will be treated as a sequencing error. One downside of this is that we are not yet able to address special cases where there is an indel variant in the sequence disrupted by a sequencing error. This is something we hope to improve in the future.
python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
--genome chr1.fa \
--variants example_files/GM12878_chr1.vcf.gz \
--outprefix /my/path/outprefix
To enable correction of noncanonical splice junctions, you must supply an annotation file of high-confidence splice sites. At this time, the required format matches the splice junction file that the STAR read aligner (https://github.com/alexdobin/STAR) outputs when mapping short RNA-seq reads to the genome. For each splice junction, the file lists its genomic location, strand, intron motif, whether or not the junction is in the provided GTF transcriptome annotation (i.e. GENCODE v24), and the amount of read support it has. If short reads in fastq format are available for the sample that you are working with, the most straightforward way to get a splice junction file is to run STAR. We use the following command with parameters recommended by the ENCODE consortium:
STAR \
--runThreadN 4 \
--genomeDir STAR_hg38_ENCODE \
--readFilesIn illumina_1.fastq illumina_2.fastq \
--sjdbGTFfile gencode.v24.annotation.gtf \
--outFilterType BySJout \
--outFilterMultimapNmax 20 \
--alignSJoverhangMin 8 \
--alignSJDBoverhangMin 1 \
--outFilterMismatchNmax 999 \
--outFilterMismatchNoverLmax 0.04 \
--alignIntronMin 20 \
--alignIntronMax 1000000 \
--alignMatesGapMax 1000000 \
--outSAMattributes NH HI NM MD jM jI \
--outSAMtype SAM
Once you have a splice junction annotation in the appropriate format, run TranscriptClean with the following command to perform splice junction correction.
python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
--genome chr1.fa \
--spliceJns example_files/GM12878_SJs_chr1.tab \
--outprefix /my/path/outprefix
For each noncanonical splice site, TranscriptClean will find the nearest annotated splice site in the file you provided. By default, if the nearest splice site is ≤ 5 bp away, the original splice site will be modified to match the annotated site. If you prefer to use a different distance threshold, you may change it using the --maxSJOffset option.
python TranscriptClean.py --sam example_files/GM12878_chr1.sam \
--genome chr1.fa \
--spliceJns example_files/GM12878_SJs_chr1.tab \
--maxSJOffset 10
--outprefix /my/path/outprefix