Skip to content
merlinis12 edited this page Nov 6, 2024 · 37 revisions

RNA-Seq Data Analysis in R Workshop

Date: November 12th
Instructor: Simona Merlini
Level: Basic/Intermediate
Requirements: R, RStudio (download link)

Overview

RNA-Seq is a widely used method for analyzing gene expression. This workshop focuses on differential expression analysis using R, guiding participants through essential steps from raw count data to identifying differentially expressed genes.

Objectives

  • Understand the basics of RNA-seq data analysis
  • Perform normalization, model fitting, and hypothesis testing
  • Identify differentially expressed genes
  • Interpret and visualize results

Prerequisites

  • Basic knowledge of R and RStudio
  • install required R packages listed in requirements.R

Welcome, everyone! Today, we'll be diving into RNA-Seq data analysis, specifically looking at differential gene expression analysis using R. RNA sequencing, or RNA-seq, is a powerful technique that allows us to study gene expression at a depth and resolution far beyond what we could achieve before. It has transformed our ability to understand how genes are expressed under various conditions, whether in healthy tissue versus disease, different time points, or across various experimental treatments. Throughout this session, we'll go from raw sequencing counts to identifying differentially expressed genes, covering the essential steps and statistical methods along the way.

The goal of today’s workshop is to equip you with the knowledge to process, analyze, and interpret RNA-seq data. We’ll also cover practical tips and hands-on R code, so by the end, you’ll have both the foundational understanding and some practical tools you can use in your own projects.

📆 Agenda

  1. 🧬 Introduction to RNA-Seq and Differential Expression
    Overview of RNA-seq technology, data structure, and objectives.

  2. 🔬 Data Preprocessing and Normalization
    Steps to preprocess and normalize gene count data.

  3. 🛠️ Differential Expression Analysis
    Perform differential expression analysis with DESeq2.

  4. 📊 Results Interpretation and Visualization
    Practical tips for visualizing and interpreting differential expression results.


🧬 Introduction to RNA-Seq and Differential Expression Analysis

RNA-Seq Analysis Image credit CC Wikimedia Commons

RNA sequencing (RNA-seq) is a powerful high-throughput technique that captures a snapshot of the transcriptome by sequencing cDNA from RNA. This allows for quantifying gene expression across conditions, comparing transcript abundance between samples, and identifying differentially expressed genes. RNA-seq is widely preferred over microarrays due to its sensitivity, specificity, and ability to detect novel transcripts.

Key Aspects of RNA-Seq

  • Technology: RNA molecules are converted to cDNA and sequenced on platforms like Illumina, producing millions of reads.
  • Data Structure:
    • Raw Reads: Short sequence fragments.
    • Aligned Reads: Mapped to a reference genome for quantification.
    • Expression Levels: Transformed into values like CPM (Counts Per Million).
  • Objectives:
    • Gene expression quantification.
    • Differential expression analysis.
    • Transcript isoform analysis.
    • Novel transcript discovery.
    • RNA modification analysis.

Steps in RNA-Seq Analysis

  1. Sample Preparation: RNA extraction, fragmentation, cDNA synthesis, and library preparation.
  2. Sequencing: High-throughput sequencing of cDNA libraries.
  3. Read Alignment: Mapping reads to a reference genome.
  4. Quantification: Counting reads for expression estimation.
  5. Differential Expression Analysis: Statistical testing to identify genes with significant expression changes.

Differential Gene Expression Analysis Workflow

  1. Sequencing (Biochemistry)
    • RNA extraction
    • Library preparation (including mRNA enrichment)
    • Sequencing
  2. Bioinformatics
    • Processing sequencing reads (alignment)
    • Estimating gene expression levels
    • Normalization
    • Identifying differentially expressed (DE) genes

Experimental Design Considerations

  • Biological Replicates: Include multiple replicates per condition to account for variability.
  • Sequencing Depth: Ensure sufficient depth for accurate expression capture.
  • Normalization: Apply methods to correct for library size and technical biases.

Most RNA-seq experiments aim to identify genes with expression differences across conditions. In downstream analysis, each gene’s expression is tested for changes between conditions, but reliable results require more than one measurement per condition. Gene expression can vary due to multiple factors (e.g., temperature, sex, time of day), not just the experimental condition (e.g., genotype or drug treatment). To isolate expression changes due to the condition of interest, RNA-seq experiments must include:

  • Biological Replicates: Variability due to individual differences in organisms or cell populations.
  • Technical Replicates: Variability due to procedural or instrument-related factors.
  • Experimental Controls: Addressing confounding and minimizing bias.

A robust experimental design ensures that observed effects are reproducible and primarily due to the treatment, accounting for variability across similar but distinct samples (Altman and Krzywinski, 2014).

❗ Capturing Variability in RNA-Seq Experiments ❗

Accurately capturing variability is essential for reliable statistical inference in RNA-seq experiments. Without a realistic estimate of variance, statistical tests struggle to detect true gene expression differences. If the sample subset is too limited or unrepresentative, results may only reflect specific instances (e.g., the particular animals used or the lab environment).

An adequate number of replicates is crucial to:

  • Capture variability: Reflect the natural breadth of expression variance.
  • Identify and remove outliers: Ensure that outliers, if present due to technical or unrelated biological factors, can be removed without significantly impacting the analysis of background variation.

Only samples with valid technical or biological reasons for outlier status should be excluded, ensuring that the results remain representative of the population of interest.

Note

Types and Importance of Replicates in RNA-Seq Experiments:

  • Technical Replicates: help account for random noise associated with experimental protocols or equipment variations (Blainey et al., 2014). In RNA->seq, technical replicates involve repeated library preparations from the same RNA sample to control for batch effects during library prep (e.g., >reverse transcription and PCR amplification) and multiplexing across different lanes on the same flowcell to mitigate lane effects, such as >variations in sample loading or sequencing efficiency.

While technical variability in RNA-seq is typically low, using the same protocol and sequencing center minimizes the need for technical replicates >focusing solely on library preparation.

  • Biological Replicates:capture natural biological variation by representing distinct samples, allowing for more accurate estimates of the mean and variance of gene expression (Blainey et al., 2014). Biological replicates should derive from independent cell or tissue growths.

Tip

Recommended Number of Replicates Most RNA-seq studies use three biological replicates. However, Schurch et al. (2016) suggest:

  • At least six replicates for robust profiling of a condition's transcriptome or for detecting significant gene expression changes.
  • Up to twelve replicates if the aim is to detect a broad range of differentially expressed genes, including those with subtle changes or low expression levels.

Important

The generalizability of RNA-seq findings depends on how well the experiment captures representative samples of the population under study.

Avoiding Bias in RNA-Seq Experiments

A well-planned experiment aims to improve the precision of the results by addressing sources of bias and variability. The following steps should be considered during experimental design: Key Steps in Reducing Bias:

  1. Identify the question of interest: Clearly define the effect you aim to study.
  2. Identify possible sources of variability (nuisance factors): Recognize and account for factors that could introduce unwanted variation.
  3. Plan to minimize the effect of nuisance factors: Design your experiment to reduce the influence of these factors.
  4. Protect against unknown sources of variation: Safeguard against unforeseen biases that may emerge during the experiment.

If the list of potential nuisance factors becomes overwhelming, return to step 1 and prioritize them. Conducting a pilot experiment may help in refining your experimental design.

Randomization:

True randomization is essential for reducing unconscious selection bias caused by factors like animal activity, appearance, or growth patterns of cell lines. It ensures that factors of interest, such as drug treatments or control conditions, are assigned randomly (Honaas et al., 2016). This approach minimizes subtle biases and enhances the reliability of the results.


Quality Control of Raw Sequencing Data

Image source RNA-Seq intro Typical bioinformatics workflow of differential gene expression analysis with commonly used tools (shown in blue). Tools for quality control are marked in orange (with MultiQC allowing the convenient combination of numerous QC results). The most commonly used file formats to store the results of each processing step are indicated in gray.

Quality control (QC) should be conducted at every step of RNA-seq data analysis. Proactive and comprehensive QC helps to better understand your data and ensures appropriate assumptions and parameters are used in downstream analyses. Identifying and correcting flaws early in the process improves the overall analysis.

QC Process for Raw Sequencing Data:

The analysis typically begins with raw reads stored in FASTQ files. To evaluate the overall quality of sequenced reads, several issues need to be assessed, including:

  • PCR duplicates
  • Adapter contamination
  • rRNA and tRNA reads
  • Unmappable reads (e.g., from contaminating nucleic acids)

Tools for Quality Control:

One widely-used tool for QC is FASTQC, a program developed by the Babraham Institute. It evaluates quality scores and sequence composition of reads stored in FASTQ files. FASTQC performs various tests, each flagged as:

  • Pass: Indicates good quality.
  • Warning: Suggests potential issues, but not necessarily critical.
  • Fail: Indicates significant problems with the data.

FASTQC Tool can be freely downloaded.--> MICHELE&CARLOS workshop

While FASTQC identifies problems like PCR duplicates and adapter contamination, unmappable reads (from contaminating nucleic acids) often require additional steps. Remember that some sample types may inherently exhibit certain biases, so not all “fail” results in FASTQC necessarily mean the sequencing should be repeated.

Common FASTQC Tests:

  • Per Base Sequence Quality
  • Per Sequence Quality Scores
  • GC Content
  • Adapter Content
  • Kmer Content

Read Alignment --> Michele&Carlos workshop

Read Quantification--> Michele&Carlos workshop --> The main caveats of assigning reads to transcripts are:

  • inconsistent annotation of transcripts
  • multiple isoforms of widely differing lengths
  • anti-sense/overlapping transcripts of different genes There is no really good solution yet! Be careful with your conclusions and if possible, limit your analyses to gene-based approaches.

🔬 Data Preprocessing and Normalization

Normalizing and Transforming Read Counts

In RNA-seq experiments, the number of sequenced reads mapped to a gene depends on multiple factors:

  • The gene's own expression level
  • The gene's length
  • The sequencing depth
  • The expression of all other genes within the sample

To accurately compare gene expression across different conditions, normalization is essential to account for the variability in read counts. This variability arises from differences in total RNA content, library complexity, and contamination. Normalization aims to remove systematic effects not associated with the biological differences of interest.

Common Normalization Methods

There are various normalization methods that address different aspects of variability. See Table 13 for a detailed comparison of the most commonly used normalization methods.

5.1 Normalization for Sequencing Depth Differences

The size factor method in the R package DESeq is commonly used to achieve a relatively similar read count distribution across libraries, as shown in Figure 14. This process utilizes the raw read counts from featureCounts, which are then read into R and normalized to adjust for differences in sequencing depth.

5.1.1 DESeq’s Specialized Data Set Object

DESeq organizes experiment data in a specialized object called DESeqDataSet, an extension of the SummarizedExperiment class. This object contains all relevant experimental data in a structured way:

  • colData: A data.frame holding information about each sample, such as experimental conditions, sequencing type, and date. The row.names in colData should correspond to unique sample names.
  • rowRanges: Stores details about each gene, such as chromosome, start and end positions, strand, and gene ID.
  • assay: Contains a matrix of values for each gene and sample. In DESeq, this typically represents countData.

Example Workflow:

  1. Import raw read counts from featureCounts.
  2. Load counts into R as a DESeqDataSet object.
  3. Apply the size factor normalization to adjust for sequencing depth differences.

This structured approach in DESeq simplifies the management and normalization of RNA-seq data, allowing for efficient and accurate downstream analysis.


Reference

Resources

For questions or further guidance, feel free to reach out!