Skip to content

Latest commit

 

History

History

shell_rna-seq_tuxedo

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

Title: The Cloud and the Shell - Applied Bioinformatics on the Example of Gene Expression Analysis using Unix and freely available Open Source Tools
Author: Alexander Kerner
EMail: training at silico-sciences.com
Seminar Ruprecht-Karls-Universität Heidelberg

Applied Bioinformatics in the Cloud and in the Shell

RNA-Seq Analysis Using Unix and Open Source Tools

[TOC]

Recap

Nohup

nohup: run a command immune to hangups, with output to a non-tty

$ nohup p1 &
[1] 14958
$ less nohup.out

Remote Shell

Preparations

  1. Get and save the key file

  2. cd to the corresponding directory

  3. 'Install' the ssh key

     $ ssh-add training@silico_rsa 
     Enter passphrase for training@silico_rsa: 
     Identity added: training@silico_rsa (training@silico_rsa)
    
  4. In case of permissions error, fix permissions

     $ chmod 600 training@silico_rsa
    
  5. In case of an ssh-agent error, fix as described here


ssh: OpenSSH SSH client (remote login program)

$ ssh silico-sciences.com

$ ssh 176.28.21.178

$ ssh [yourname]@silico-sciences.com

  1. Login to the remote system using [yourname]@silico-sciences.com

  2. Verify successful login:

    1. whoami

    2. hostname

    3. ifconfig

    4. Get some system infos:

       $ cat /etc/lsb-release
       DISTRIB_ID=Ubuntu
       DISTRIB_RELEASE=14.04
       DISTRIB_CODENAME=trusty
       DISTRIB_DESCRIPTION="Ubuntu 14.04.3 LTS"
      
    5. Type these commands on the remote system and on the local system.

RNA-Seq using the Shell, IGV and the Tuxedo Suite

Getting Data

Use wget to download data from here.

There are many ways to retrieve all files in that list, here are some hints:

  • Download this list as a file.

  • Use sed or tr to replace all new line characters (\n) with a space character ( ).

  • Pipe the resulting list to a file or directly to wget.

  • Take a look at wgets -i option.

  • Use wget with the --no-directories, the --accept-regex and the --recursive option.

Creating a Bowtie2 index

TopHat uses Bowtie2 as a 'mapping engine'. Bowtie2 requires the reference genome to be indexed.

Create this index file as described here.

Mapping Reads to the Reference Sequence

  1. Use Tophat to map the reads to the reference genome:

     $ tophat -o [some-out-dir] -G [reference-annotation].gtf [reference-bowtie2-index-file] [reads]_1_fastq [reads]_2_fastq
    
    `Tophat` produces several output files: 
    
    0. `accepted_hits.bam`
    

    Note: See here howto avoid repetitive index building. Find a pre-build transcriptome index for chromosome three here /var/data/bi/reference/prebuild/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/transciptome_index/genes_chr03.

  2. Use samtools idxstats to see the number of mapped/ unmapped reads in the created accepted_hits.bam file (see here).

     $ samtools idxstats accepted_hits_sorted.bam | column -t
     3  198022430  14926  0
     *  0          0      0
    
     $ samtools idxstats unmapped_sorted.bam | column -t
     3  198022430  0  0
     *  0          0  2
    
  3. Take a look at the bam files:

    1. less accepted_hits.bam

    2. zless accepted_hits.bam

    3. samtools view accepted_hits.bam

Calculate Gene Expressions

  1. Use Cuffquant to precompute gene expression levels.

     $ cuffquant [reference-annotation].gtf [tophat_out]/accepted_hits.bam
    

    Options (less speed, more accuracy):

    1. -b/--frag-bias-correct: use bias correction - reference fasta required

    2. -u/--multi-read-correct: use 'rescue method' for multi-reads

       $ cuffquant -b [reference-seq].fa -u [reference-annotation].gtf [tophat_out]/accepted_hits.bam
      
  2. Use Cuffdiff to find significant changes in expression level.

     $ cuffdiff -o [some-out-dir] -L Lung,Stomach,Heart [reference-annotation].gtf [lung1-4-cuffquant_out]/abundances.cxb,[lung2-4-cuffquant_out]/abundances.cxb,[lung3-4-cuffquant_out]/abundances.cxb,[lung4-4-cuffquant_out]/abundances.cxb [stomach1-4-cuffquant_out]/abundances.cxb,[stomach2-4-cuffquant_out]/abundances.cxb,[stomach3-4-cuffquant_out]/abundances.cxb,[stomach4-4-cuffquant_out]/abundances.cxb [heart1-4-cuffquant_out]/abundances.cxb,[heart2-4-cuffquant_out]/abundances.cxb,[heart3-4-cuffquant_out]/abundances.cxb,[heart4-4-cuffquant_out]/abundances.cxb
     
     $ cuffdiff -o [some-out-dir] -L Lung,Heart [reference-annotation].gtf [lung1-4-cuffquant_out]/abundances.cxb,[lung2-4-cuffquant_out]/abundances.cxb,[lung3-4-cuffquant_out]/abundances.cxb,[lung4-4-cuffquant_out]/abundances.cxb  [heart1-4-cuffquant_out]/abundances.cxb,[heart2-4-cuffquant_out]/abundances.cxb,[heart3-4-cuffquant_out]/abundances.cxb,[heart4-4-cuffquant_out]/abundances.cxb
    

    Note: Pay attention to correct usage of commas and spaces. Separate replicates with commas (don't use ,[whitespace]) and conditions/ labels with space.

Data Analysis

Cuffdiff writes fold changes to the table [cuffdiff_out]/genes_exp.diff.

  1. Use cut to cut away columns that we are not interested in.

  2. Use sort to sort the table by

    1. significance and

    2. absolute log2 fold change (descending).

  3. use grep and | to extract lines with a significant regulation into a new file.

Do not override [cuffdiff_out]/genes_exp.diff. Use pipes instead!


If you ask yourself if it is worth to write a script or not, take a look at this matrix:

matrix


Back to Index

References

  1. SAM/BAM format

  2. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

  3. Analysis of the Human Tissue-specific Expression by Genome-wide Integration of Transcriptomics and Antibody-based Proteomics

  4. Cufflinks manual