-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsm8_gsa.Rmd
861 lines (641 loc) · 33 KB
/
sm8_gsa.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
---
title: "STAT 540 - Seminar 8: Gene Set Enrichment Analysis"
date: "Last update: `r format(Sys.time(), '%d %B, %Y')`"
output:
github_document
---
```{r include=FALSE}
knitr::opts_chunk$set(size="scriptsize", fig.align="center")
knitr::opts_chunk$set(message=FALSE, warning=FALSE)
```
## Attributions
This seminar was developed by Yongjin Park and modified by Keegan Korthauer
## Learning Objectives
1. Testing the over-representation of gene sets for your list of top significant genes: `goseq`
2. Rank-based Gene Set Enrichment Analysis based on your list of gene-level scores: `fgsea`
## Packages required
Please make sure you are able to load the following libraries before starting the seminar
```{r load, echo=TRUE, results='hide', message=FALSE, warning=FALSE}
library(tidyverse) # our good friend
library(patchwork)
library(data.table) # very handy with large tables
library(R.utils) # need this to handle gzipped data
# Please make sure you have biomaRt installed but don't load it yet
# biomaRt creates namespace conflicts with dplyr. So we just want load
# it when required
# library(biomaRt) # annotate gene information
# The following libraries can be installed using Bioconductor:
library(msigdbr) # curated source of gene sets and functions
library(goseq) # tool for GO term enrichment analysis
library(fgsea) # tool for gene set enrichment analysis
## `goseq` and `fgseq` can be installed by bioconductor package manager:
## if (!require("BiocManager", quietly = TRUE))
## install.packages("BiocManager")
## BiocManager::install("goseq")
## BiocManager::install("fgsea")
theme_set(theme_bw()) # classic theme ggplot
```
*Remark on a technical issue (Ubuntu/Debian Linux only)*: You may not be able to install everything from the scratch depending on your machine/platform. If you're using `Ubuntu/Debian` Linux, you might want to install some dependent `R` packages via `apt-get`:
```{sh eval=FALSE}
# for example, to install R package "BiasedUrn" from cran
sudo apt install r-cran-biasedurn
# for example, to install R package "geneLenDataBase" from Bioconductor
sudo apt install r-bioc-genelendatabase
```
*Remark on a technical issue*: You need `C++` compiler installed in your machine if you want to use `fgsea` with best performance.
```{r results='hide', message=FALSE, warning=FALSE, echo=FALSE}
###########################
## some helper functions ##
###########################
#' Convert a number into a scientifically-formatted string
#' @param x number
num.sci <- function(x) {
format(x, scientific=TRUE, digits = 2)
}
#' Convert a number into the comma-separated one
#' @param x number
num.int <- function(x) {
format(x, big.mark = ",")
}
#' Run code if we don't have a `.file`
#' @param .file result file
#' @param .code code to run
run.if.needed <- function(.file, .code) {
if(!file.exists(.file)) { .code }
stopifnot(file.exists(.file))
}
```
Outline:
- [Recap: Differential Expression analysis](#recap-Differential-Expression-Analysis)
- [Part 1: When we want to draw lines between the significant and the insignificant](#part-1-gene-set-analysis-testing-over-representation)
- [Part 2: When we want to use gene-level scores to rank all genes](#part-2-rank-based-gene-set-enrichment-analysis)
- [Exercises](#deliverables)
# Recap Differential Expression Analysis
## Run Differential Expression Gene analysis
Let's use the same data used in [`seminar-05`](https://github.com/STAT540-UBC/seminar-05/blob/main/sm5_differential_expression_analysis.md). This code is just repeating some code from `seminar-05` for convenience, where we'll fit the model with additive factors for genotype and developmental stage. We'll pull out the test for every gene for the genotype (NrlKO vs WT) coefficient which we'll use to illustrate several gene set analyses throughout the seminar. We'll also save the results of testing for any effect of developmental stage in the same model, and save the results (to be loaded for the deliverable).
```{r}
deg.genotype.file <- "data/DEG_genotype.rds"
deg.stage.file <- "data/DEG_stage.rds"
run.if.needed(deg.genotype.file, {
eset.file <- "data/GSE4051.rds"
run.if.needed(eset.file,{
eset <- GEOquery::getGEO("GSE4051", getGPL = FALSE)[[1]]
saveRDS(eset, file=eset.file)
})
eset <- readRDS(eset.file)
meta.data <-
Biobase::pData(eset) %>%
mutate(sample_id = geo_accession) %>%
mutate(dev_stage = case_when(
grepl("E16", title) ~ "E16",
grepl("P2", title) ~ "P02",
grepl("P6", title) ~ "P06",
grepl("P10", title) ~ "P10",
grepl("4 weeks", title) ~ "P28"
)) %>%
mutate(genotype = case_when(
grepl("Nrl-ko", title) ~ "NrlKO",
grepl("wt", title) ~ "WT"
)) %>%
dplyr::select(sample_id, genotype, dev_stage)
expr.mat <- Biobase::exprs(eset)
design.mat <- model.matrix( ~ genotype + dev_stage, meta.data)
DEG.stat.dt <- limma::lmFit(expr.mat, design.mat) %>%
limma::eBayes() %>%
limma::topTable(coef = c("dev_stageP02", "dev_stageP06", "dev_stageP10", "dev_stageP28"), number = Inf) %>%
(function(x) { x %>% mutate(probe = rownames(x)) }) %>%
as.data.table
saveRDS(DEG.stat.dt, deg.stage.file)
DEG.genotype.stat.dt <- limma::lmFit(expr.mat, design.mat) %>%
limma::eBayes() %>%
limma::topTable(coef = "genotypeWT", number = Inf, sort.by = "p") %>%
(function(x) { x %>% mutate(probe = rownames(x)) }) %>%
as.data.table
saveRDS(DEG.genotype.stat.dt , deg.genotype.file)
})
DEG.stat.dt <- readRDS(deg.genotype.file)
```
As a result, we have:
```{r}
head(DEG.stat.dt)
```
## We want to convert the probe names to gene symbols used in gene set databases
Sometimes, feature annotation information is not included in publicly available data. In that case, you must manually map these to gene symbols or ENSEMBL IDs commonly used in pathway annotation databases. We will use `biomaRt` package to construct this map.
```{r}
probe.info.file <- "data/mouse_human_map.rds"
```
Although running `biomaRt` (or a similar kind) is essential in many researches (especially in cross-species analysis), we will not cover the details in this seminar. However, you are welcomed to explore the source code as much as you like [`here`](https://github.com/STAT540-UBC/seminar-05/blob/main/sm7_gsa_network.Rmd). We share `.rds` file for a quick data access.
```{r }
run.if.needed(probe.info.file, {
## Will use biomaRt that can access ENSEMBL database
## and make query to retrive what you need
library(biomaRt)
## Loading biomaRt library can mess up dplyr namespace
## not sure about the current version of R and the package
## 0. Make an access point to mouse and human databases
human.db <- biomaRt::useMart("ensembl", dataset = "hsapiens_gene_ensembl",
host = "https://dec2021.archive.ensembl.org/")
mouse.db <- biomaRt::useMart("ensembl", dataset = "mmusculus_gene_ensembl",
host = "https://dec2021.archive.ensembl.org/")
## 1. Link mouse and human databases
## This may take sometime depending on your internet speed
mouse.human.map <-
biomaRt::getLDS(
attributes = c("affy_mouse430_2","mgi_symbol"),
filters = "affy_mouse430_2",
values = unique(DEG.stat.dt$`probe`),
mart = mouse.db,
attributesL = c("hgnc_symbol","ensembl_gene_id"),
martL = human.db,
uniqueRows = TRUE,
bmHeader=FALSE) %>%
as.data.table() %>%
unique()
## 2. Take some attributes for human genes
## Read some human gene-specific information
## This will also take a while...
human.prop <- biomaRt::getBM(attributes = c("ensembl_gene_id",
"chromosome_name",
"transcription_start_site",
"transcript_length"),
filters = "ensembl_gene_id",
values = unique(mouse.human.map$`ensembl_gene_id`),
mart = human.db,
bmHeader=FALSE,
useCache=FALSE) %>%
as.data.table() %>%
unique()
## 3. Match them up
.map <- mouse.human.map %>%
left_join(human.prop) %>%
as.data.table() %>%
unique() %>%
dplyr::rename(probe = affy_mouse430_2) %>% # Change the var. name
dplyr::rename(chr = chromosome_name) %>% # shorten the chr name
dplyr::rename(gene_symbol = hgnc_symbol) %>% # Will use human gene symbol
as.data.table()
saveRDS(.map, probe.info.file)
})
```
```{r}
probe.info.map <-
readRDS(probe.info.file) # read RDS
```
It looks like this:
```{r}
probe.info.map %>%
head() %>%
knitr::kable()
```
We could have multiple transcripts within a gene, so let's take the longest one:
```{r}
n0 <- nrow(probe.info.map) # save how many rows before filtering
probe.info.map <-
probe.info.map[order(probe.info.map$transcript_length, decreasing = TRUE),
head(.SD, 1),
by = .(probe, gene_symbol, chr)]
```
The same map, but a shorter list: `r num.int(n0)` to `r num.int(nrow(probe.info.map))`.
```{r}
probe.info.map %>%
head() %>%
knitr::kable()
```
When we match the probe names with the mouse and human gene symbols, we can directly use this DEG table for downstream enrichment analysis of gene sets annotated by human gene symbols.
```{r}
DEG.stat.dt %>%
left_join(probe.info.map) %>%
dplyr::select(probe, gene_symbol, P.Value, adj.P.Val) %>%
na.omit() %>%
head(10) %>%
mutate(P.Value = num.sci(P.Value)) %>%
mutate(adj.P.Val = num.sci(adj.P.Val)) %>%
knitr::kable()
# num.sci is a convenient function defined in the seminar's Rmd file to convert long numbers into scientific notation.
```
- **Why human gene names for the mouse study?** We would like to do some exploration to link the genes found in this study to human genes that have been found to harbour genetic variation that is associated with certain phenotypes (including disease) - notably, Genome-wide Association studies. Note that gene ontologies and pathways are also well-annotated for mouse, so this step is not required (i.e. we could instead perform gene set analysis on the mouse genes, and indeed this type of analysis is very common).
- **Why do some Affymetrix probes correspond to multiple human genes?** The mapping from probe to mouse gene is not necessarily 1:1, and neither is the mapping from a mouse gene to a human gene.
# How/where to obtain biologically meaningful gene sets
Here, we will use the Molecular Signature Database (MSigDB) database and GWAS catalogue, but there are many ways you can obtain gene sets. Traditionally, Gene Ontology (GO) terms have been used as a go-to database for gene sets. However, the enrichment of GO terms might be slightly different from what we want to achieve in a set-based analysis. Unlike pathway annotations and GWAS catalogue, GO terms have hierarchical relationships (or directed acyclic graph) with one another, which we would need to consider in calibrating the null distribution.
## MSigDB: Molecular Signature Database
[MSigDB](http://www.gsea-msigdb.org/gsea/msigdb/) provides a comprehensive list of gene sets and pathways manually curated by experts or derived from previous experimental results, including GO terms. We do not need to download them one by one since someone made a convenient package to retrieve a current version of MSigDB into R. You can download the raw text files from the website, too.
**Note**: We will interchangeably use a pathway and a gene set since we do not deal with gene-gene interactions within a pathway.
There are many species:
```{r msigdb_show_species}
knitr::kable(msigdbr::msigdbr_species())
```
There are many collections available:
```{r}
knitr::kable(msigdbr::msigdbr_collections())
```
We will focus on the [Kyoto Encyclopedia of Genes and Genomes (KEGG)](https://www.genome.jp/kegg/) pathways named by human gene symbols:
```{r}
KEGG.human.db <- msigdbr::msigdbr(species = "human",
category = "C2",
subcategory = "CP:KEGG")
```
## GWAS catalog
We can download public data mapping SNPs to associated diseases/phenotypes in large human cohort studies from [the NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/). The GWAS catalog file is already processed and stored in this repository for your convenience, but you're welcome to try out the code below which generates this file.
```{r download_full_gwas_catalog}
gwas.tidy.file <- "data/gwas_catalog_tidy.tsv.gz"
## make it tidy by taking only what we need
run.if.needed(gwas.tidy.file, {
## It make take some time
gwas.file <- "data/gwas_catalog_v1.0-associations_e105_r2022-02-02.tsv.gz"
run.if.needed(gwas.file, {
url <- "https://www.ebi.ac.uk/gwas/api/search/downloads/full"
.file <- str_remove(gwas.file, ".gz$")
download.file(url, destfile = .file, timeout = 600) # this step can take several minutes
gzip(.file)
})
.dt <-
fread(gwas.file, sep="\t", quote="") %>%
dplyr::select(`MAPPED_GENE`, `DISEASE/TRAIT`, `PVALUE_MLOG`)
## remove redundant associations
.dt <- .dt[order(.dt$PVALUE_MLOG, decreasing = TRUE),
head(.SD, 1),
by = .(`MAPPED_GENE`, `DISEASE/TRAIT`)]
## remove traits with too few associations
.count <- .dt[, .(.N), by = .(`DISEASE/TRAIT`)]
.dt <- left_join(.count[`N` >= 100, ], .dt)[nchar(`MAPPED_GENE`)> 0,]
## simply split gene lists and unlist
.dt <- .dt[,
.(gene_symbol = unlist(strsplit(`MAPPED_GENE`, split="[ ,.-]+"))),
by = .(`DISEASE/TRAIT`, PVALUE_MLOG)]
.dt[, p.value := 10^(-PVALUE_MLOG)]
fwrite(.dt, file=gwas.tidy.file)
file.remove(gwas.file) # remove large file
})
```
```{r}
gwas.db <- fread(gwas.tidy.file) # fread to read gzipped txt file
gwas.db[, gs_name := `DISEASE/TRAIT`] # will use gs_name for a gene set name
# The := annotation is specific to data.table, in this case we will create a column called 'gs_name' which will reference to the column "DISEASE/TRAIT"
```
- *Note*: `fread`, `strsplit`, and `by=.()` operations are usually much faster than the `tidyverse` counterparts--`read_tsv`, `separate`, and `group_by()`. The GWAS catalog is pretty big; the number of rows after pruning can increase to `r format(nrow(gwas.db), big.mark = ",")` gene-level associations.
For instance, we can take GWAS genes associated with Alzheimer's disease-related disorders:
```{r results="asis"}
gwas.db[str_detect(`DISEASE/TRAIT`, "[Aa]lzheimer") & !is.na(gene_symbol)] %>%
head() %>%
mutate(p.value = num.sci(p.value)) %>%
dplyr::select(`gs_name`, `gene_symbol`, `p.value`, `PVALUE_MLOG`) %>%
knitr::kable()
```
- *Note*: `PVALUE_MLOG`: -log10(p-value).
Lets do a quick recap before moving on the interesting part; what have we done so far?
1. Standardized gene nomenclature in results from [`seminar-05`](https://github.com/STAT540-UBC/seminar-05/blob/main/sm5_differential_expression_analysis.md) and matched probes to human gene symbols.
2. Loaded MSigDB information for human KEGG pathways.
3. Obtained known GWAS information about genes associated with disease/phenotype.
# Part 1 Gene set analysis testing over-representation
### Hypergeometric test assumes a theoretical null distribution
#### What are the over-represented KEGG pathways?
How do they look like? For instance, let's take a look at some pathways (`gs_name`).
```{r fig_overrep_kegg_pathways, fig.width=8, fig.height=3}
.gs <- c("KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_KERATAN_SULFATE",
"KEGG_NOD_LIKE_RECEPTOR_SIGNALING_PATHWAY",
"KEGG_RIBOSOME")
CUTOFF <- 1e-2 # q-value cutoff
.dt <-
KEGG.human.db %>%
filter(`gs_name` %in% .gs) %>%
left_join(probe.info.map) %>%
left_join(DEG.stat.dt) %>%
group_by(gs_name) %>% # for each gene set
arrange(adj.P.Val) %>% # sort genes by p-value
mutate(g = 1:n()) %>% # add gene order index
ungroup() %>%
filter(!is.na(adj.P.Val))
ggplot(.dt, aes(g, -log10(adj.P.Val), colour=`gs_name`)) +
geom_hline(yintercept = -log10(CUTOFF), colour = "red", lty = 2) +
geom_point() + xlab("genes sorted by p-value in each pathway") +
scale_y_continuous("adjusted p-value",
labels = function(x) num.sci(10^(-x)))
```
- Which ones are significant?
#### Remember that we learned about the hypergeometric test in the class!
What are the counts? Let's match gene counts with the black/white ball analogy used in `R`'s `phyper` manual page.
- `q`: the number of genes in the pathway (**white** balls) overlapping with the genes in the DEG list (balls **drawn**)
- `m`: the number of genes in this pathay (**white** balls)
- `n`: the number of genes not in this pathway (**black** balls)
- `k`: the number of genes in the DEG list (balls **drawn**)
Under the null hypothesis:
$$H_{0} : q \le q^{\star}$$
We may observe $q^{\star}$ (out of $k$) genes overlapping with a gene set of interest by random sampling of $k$ genes **without** replacement.
Therefore, we can calculate the p-value:
$$P(q > q^{\star}\|n, m, k) = 1 - \sum_{q = 0}^{q^{\star}} {m \choose q } {n \choose k-q} / {n+m \choose k}$$
In `R`'s `phyper`, it's as simple as:
```{r eval=FALSE}
phyper(q, m, n, k, lower.tail = FALSE)
```
#### Let's test all the genes and pathways more systematically
```{r}
#' @param gene.dt gene-level statitsics with `gene_symbol` and `adj.P.Val`
#' @param geneset.dt gene set membership with `gene_symbol` and `gs_name`
run.hyper.test <- function(gene.dt, geneset.dt, cutoff = CUTOFF) {
.genes <- unique(gene.dt[, .(gene_symbol, adj.P.Val)]) %>%
na.omit()
.sets <- as.data.table(geneset.dt)
.sets <- .sets[gene_symbol %in% .genes$gene_symbol,
.(gene_symbol, gs_name)]
.dt <- left_join(.sets, .genes, by = "gene_symbol", relationship = "many-to-many") %>%
as.data.table()
## Total number of genes
ntot <- length(unique(.dt$gene_symbol))
## Total number of significant DEGs
nsig <- nrow(unique(.dt[adj.P.Val < cutoff, .(gene_symbol)]))
## Gene set size
gs.size <- .dt[,
.(m = length(unique(gene_symbol))),
by = .(gs_name)]
## Gene set overlap size
overlap.size <- .dt[adj.P.Val < cutoff,
.(q = length(unique(gene_symbol))),
by = .(gs_name)]
left_join(gs.size, overlap.size, by = "gs_name") %>%
mutate(`q` = if_else(is.na(`q`), 0, as.numeric(`q`))) %>%
mutate(n = `ntot` - `m`) %>%
mutate(k = `nsig`) %>%
mutate(p.val = phyper(`q`, `m`, `n`, `k`, lower.tail=FALSE)) %>%
arrange(p.val) %>%
as.data.table
}
```
```{r}
deg.dt <- DEG.stat.dt[, .(probe, adj.P.Val)] %>%
left_join(probe.info.map, by = "probe") %>%
as.data.table()
```
#### KEGG pathway enrichment
```{r}
hyper.kegg.dt <- run.hyper.test(deg.dt,
KEGG.human.db,
cutoff=1e-2)
```
```{r echo=FALSE}
knitr::kable(head(hyper.kegg.dt, 10))
```
#### GWAS catalogue enrichment
```{r}
hyper.gwas.dt <- run.hyper.test(deg.dt,
gwas.db,
cutoff=1e-2)
```
```{r echo=FALSE}
knitr::kable(head(hyper.gwas.dt, 10))
```
For the DEGs associated with Nrl knockout, some of them make sense! For example, keratan sulfate seems to be important for the cornea, and Refractive error is a type of vision problem that makes it hard to see clearly. What are your interpretations? But, let's take further precautions.
### Gene set analysis based on empirical null distribution
#### Why do we need another GSA method?
We decided to use a hypergeometric test, we implicitly take the assumption of *a uniform sampling* of balls in urns *without replacement*. Well, the "without replacement" part makes sense because we don't draw a DEG more than once in our analysis. However, the "uniform sampling" part may not hold in practice considering that some balls can be bigger than the other. Those bigger balls can be colored differently.
In the gene set terminology, we need to ask the following questions:
- Are all the genes equally distributed in the gene sets?
- Did our DEG analysis tend to hit genes and pathways "uniformly?"
Let's use all the canonical pathways in the MSig database to demonstrate a potential bias.
```{r}
C2.human.db <- msigdbr::msigdbr(species = "human", category = "C2")
```
Mesh up with our DEG list:
```{r}
gene.level.stat <-
DEG.stat.dt %>%
left_join(probe.info.map, by = "probe") %>%
left_join(C2.human.db, by = "gene_symbol", relationship = "many-to-many") %>%
dplyr::select(gene_symbol, gs_name, P.Value, transcript_length) %>%
filter(!is.na(gene_symbol)) %>%
as.data.table() %>%
(function(.dt){
.dt[, .(transcript_length = max(transcript_length),
P.Value = min(P.Value),
num.gs = length(unique(gs_name))),
by = .(gene_symbol)]
})
```
Let's visualize them in regular intervals.
```{r}
gene.level.stat[, num.gs.tick := round(num.gs/10)*10]
.dt <- gene.level.stat[,
.(mean.length = mean(transcript_length),
sd.length = sd(transcript_length),
mean.pval = mean(-log10(P.Value)),
sd.pval = mean(-log10(P.Value))),
by = .(num.gs.tick)]
```
What do you think?
```{r fig_gene_level_bias, fig.width=8, fig.height=4, results="hide"}
.aes <- aes(num.gs.tick, mean.pval,
ymin=pmax(mean.pval - sd.pval, 0),
ymax=mean.pval + sd.pval)
p1 <-
ggplot(gene.level.stat, aes(num.gs, -log10(P.Value))) +
geom_linerange(.aes, data = .dt, linewidth=.5) +
geom_point(.aes, data=.dt, pch=21, fill = "white", size = 2) +
geom_smooth(method="lm", colour="blue") +
xlab("Number of participating sets per gene")
.aes <- aes(num.gs.tick, mean.length,
ymin=pmax(mean.length - sd.length, 0),
ymax=mean.length + sd.length)
p2 <-
ggplot(gene.level.stat, aes(num.gs, transcript_length)) +
geom_linerange(.aes, data = .dt, linewidth=.5) +
geom_point(.aes, data=.dt, pch=21, fill = "white", size = 2) +
geom_smooth(se=FALSE, method="lm", colour="red") +
xlab("Number of participating sets per gene")
p1 | p2
```
Anecdotally, long genes tend to participate in multiple functions as a result of an evolutionary process. They can be more recent, a constituent of multiple gene duplicates relating to the brain or mammalian-specific functions, such as neurons and energy consumption.
You might want to read this paper: [Gene Size Matters: An Analysis of Gene Length in the Human Genome](https://www.frontiersin.org/articles/10.3389/fgene.2021.559998/full)
### We can handle such a gene-level bias by calibrating a better null model before enrichment analysis
We will use [`goseq`](https://bioconductor.org/packages/release/bioc/html/goseq.html) method--[Gene ontology analysis for RNA-seq: accounting for selection bias](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-2-r14). Even though the method was initially developed for GO term enrichment analysis, it can handle any other types of gene sets.
### Run `goseq` analysis to take into account gene-level biases
Prepare our DEG list for the `goseq` analysis
```{r}
deg.vec <- DEG.stat.dt %>%
left_join(probe.info.map) %>%
filter(!is.na(ensembl_gene_id)) %>%
arrange(P.Value) %>%
as.data.table %>%
(function(.dt) { .dt[, head(.SD,1), by = .(ensembl_gene_id)] }) %>%
mutate(v = as.integer(adj.P.Val < CUTOFF)) %>%
(function(.dt) { v <- .dt$v; names(v) <- .dt$ensembl_gene_id; v })
```
It will look like this:
```{r}
head(deg.vec)
```
or this:
```{r}
tail(deg.vec)
```
The `goseq` method addresses a potential gene-level bias by estimating gene-level weights (a null distribution of gene lengths) in our DEG study. The following is an excerpt from the `goseq` vignette:
> "We first need to obtain a weighting for each gene, depending on its length, given by the PWF. As you may have noticed when running supportedGenomes or supportedGeneIDs, length data is available in the local database for our gene ID, "ensGene" and our genome, "hg19". We will let goseq automatically fetch this data from its databases."
```{r fig_goseq_nullp, fig.width=7, fig.height=5}
pwf <- goseq::nullp(deg.vec,"hg19","ensGene")
```
Using this null PWF object, we can simply run tests for all the GO terms. As the name suggests, we do not need to prepare anything else for GO enrichment analysis. To save some time, the results are provided for you, but you're free to run them yourself by commenting out the `run.if.needed` function.
```{r}
goseq.results <- goseq::goseq(pwf,"hg19","ensGene")
```
What do you think?
```{r}
head(goseq.results, 10) %>%
mutate(over_represented_pvalue = num.sci(over_represented_pvalue)) %>%
knitr::kable()
```
How about other custom gene sets, e.g., KEGG pathways or all the canonical pathways? Let's take a look at the following excerpt in the manual page.
Usage:
goseq(pwf, genome, id, gene2cat = NULL,
test.cats=c("GO:CC", "GO:BP", "GO:MF"),
method = "Wallenius", repcnt = 2000, use_genes_without_cat=FALSE)
Arguments:
pwf: An object containing gene names, DE calls, the probability
weighting function. Usually generated by nullp.
genome: A string identifying the genome that ‘genes’ refer to. For a
list of supported organisms run supportedGenomes.
id: A string identifying the gene identifier used by ‘genes’.
For a list of supported gene IDs run supportedGeneIDs.
gene2cat: A data frame with two columns containing the mapping between
genes and the categories of interest.
We can build a custom `gene2cat` `data.frame`.
```{r}
canonical.gene2cat <- C2.human.db %>%
dplyr::select(ensembl_gene, gs_name) %>%
distinct() %>%
as.data.frame()
canonical.goseq.results <- goseq::goseq(pwf, "hg19", gene2cat = canonical.gene2cat)
```
```{r}
head(canonical.goseq.results, 10) %>%
mutate(over_represented_pvalue = num.sci(over_represented_pvalue)) %>%
knitr::kable()
```
Do you see any pathways or GO categories here that look like they could be retina-related?
# Part 2 Rank-based Gene Set Enrichment Analysis
Let's take a step back. What is the premise of (discrete) gene set analysis? In our DEG analysis, we discovered `r nrow(DEG.stat.dt[adj.P.Val < CUTOFF])` probes/genes (of a total of `r nrow(DEG.stat.dt)` probes) significantly different after Nrl knockout. It can be tricky to determine when to draw the line between the significant and the insignificant genes.
```{r fig_fdr_line, fig.width=7, fig.height=3}
ggplot(DEG.stat.dt, aes(1:nrow(DEG.stat.dt), -log10(adj.P.Val))) +
geom_point(stroke=0) +
xlab("probes") +
geom_hline(yintercept = 1, lty = 2, colour="red") +
geom_hline(yintercept = 2, lty = 2, colour="green") +
geom_hline(yintercept = 4, lty = 2, colour="blue") +
scale_y_continuous("adjusted p-value",
breaks = c(1, 2, 4),
labels=function(x) num.sci(10^(-x)))
```
The original Gene Set Enrichment Analysis method is not so scalable and gunky in permutation steps. To calibrate low P-values, e.g., 1e-6, you would literally need to sample a million permutations for all the gene sets (with some exaggeration). The authors of `fgsea` come up with smart computation tricks and approximate p-value calculation quite accurately. You can check out this preprint: <https://www.biorxiv.org/content/10.1101/060012v3>
## Step 1. Prepare your gene sets to feed them in to `fgsea` arguments
`fgsea` takes a list of lists of genes. We can construct them by looping through gene sets, i.e., `for`, but that's not `R`'s way and would be pretty slow. Here is a helper function to do that for us.
```{r}
make.gs.lol <- function(.dt) {
.dt <- as.data.table(.dt) %>% unique()
.list <-
.dt[, .(gene = .(gene_symbol)), by = .(gs_name)] %>%
as.list()
.names <- .list$gs_name
.ret <- .list$gene
names(.ret) <- .names
return(.ret)
}
```
For instance, we can convert KEGG pathway `tibble` in this way:
```{r}
KEGG.lol <- KEGG.human.db %>% dplyr::select(gene_symbol, gs_name) %>% make.gs.lol()
```
In case you were wondering about the shape, it looks like this:
```{r}
KEGG.lol[1:2]
```
We will also need a named vector of gene-level scores. Here, we use `-log10(P.Value)` as our gene-level score:
```{r}
deg.scores <- DEG.stat.dt %>%
left_join(probe.info.map) %>%
filter(!is.na(ensembl_gene_id)) %>%
arrange(P.Value) %>%
as.data.table %>%
# Here .SD is a reference to the data itself. So in this case we are
# taking the most significant result from every gene symbol
(function(.dt) { .dt[, head(.SD,1), by = .(gene_symbol)] }) %>%
# Convert the adjusted pvalue to a score
mutate(v = -log10(P.Value)) %>%
# Reduce the data frame to a named vector of scores
(function(.dt) { v <- .dt$v; names(v) <- .dt$gene_symbol; v })
```
## Step 2. Run `fgsea` and interpreter the results
Now we have all the ingredients.
```{r}
kegg.fgsea <- fgsea::fgsea(pathways = KEGG.lol, stats = deg.scores, scoreType = "pos")
```
Show top 3 genes for each pathway:
```{r}
# Here, we are "pasting" the first three genes in the leadingEdge column
kegg.fgsea[,
topGenes := paste0(head(unlist(`leadingEdge`), 3), collapse=", "),
by = .(pathway)]
```
Do we see the same results?
```{r}
kegg.fgsea %>%
arrange(pval) %>%
head(10) %>%
dplyr::select(-leadingEdge) %>%
knitr::kable()
```
Note that [NOD-like receptors have been implicated in diabetic retinopathy](https://pubmed.ncbi.nlm.nih.gov/32019187/).
We can do the same thing for the GWAS catalog.
```{r}
gwas.lol <- gwas.db %>% dplyr::select(gene_symbol, gs_name) %>% make.gs.lol()
gwas.fgsea <- fgsea::fgsea(pathways = gwas.lol, stats = deg.scores, scoreType = "pos")
gwas.fgsea[,
topGenes := paste0(head(unlist(`leadingEdge`), 3), collapse=", "),
by = .(pathway)]
gwas.fgsea %>%
arrange(pval) %>%
head(10) %>%
dplyr::select(-leadingEdge) %>%
knitr::kable()
```
# Deliverables (2pt)
Run the discrete version of gene set analysis for the genes significantly associated with the effect of developmental stage (DEG results provided for your convenience). Show your results as the top 10 KEGG pathways or GWAS disease/traits.
First, we'll prepare input: the provided code will fetch DEGs which are associated with the interaction effect.
```{r}
# No need to modify anything in this code chunk
# Loading precomputed limma DEG results for any effect of developmental stage
DEG.devstage.stat.dt <- readRDS(deg.stage.file)
head(DEG.devstage.stat.dt)
# join the results with the variable `probe.info.map`
deg.stage.dt <- DEG.devstage.stat.dt[, .(probe, adj.P.Val)] %>%
left_join(probe.info.map, by = "probe") %>%
as.data.table()
```
- (1 pt) Run hypergeometric tests for `KEGG.human.db` and `gwas.db` we created above using the `run.hyper.test` function we defined earlier. Print the first 10 rows of the results for each.
```{r}
## your code here
```
Next we'll run a `goseq` analysis. First, we need to create the null distribution; this code is provided for you.
```{r}
## creation of the null dist (no need to modify anything here for the deliverable)
deg.stage.vec <- DEG.devstage.stat.dt %>%
left_join(probe.info.map) %>%
filter(!is.na(ensembl_gene_id)) %>%
arrange(P.Value) %>%
as.data.table %>%
(function(.dt) { .dt[, head(.SD,1), by = .(ensembl_gene_id)] }) %>%
mutate(v = as.integer(adj.P.Val < CUTOFF)) %>%
(function(.dt) { v <- .dt$v; names(v) <- .dt$ensembl_gene_id; v })
pwf_stage <- goseq::nullp(deg.stage.vec,"hg19","ensGene")
```
- (1 pt) Run `goseq` analysis using the object with the null distribution created above. Print out the first 10 rows of the results.
```{r}
## your code here
```
# Optional exercise (not marked)
- Run `fgsea` or [`GSEABase`](https://bioconductor.org/packages/release/bioc/html/GSEABase.html) for the same DEGs testing the effect of developmental stage (0.5 pt). What will be your choice of gene-level scores?
For reproducibility, please set the parameter `nproc = 1` from `fgsea()` function and don't change the seed in the following chunk:
```{r}
set.seed(123)
```
*Hint: Remember to use list of lists (lol) for one of the inputs.*
```{r}
## your code here
```
- **Discussion**: Can we handle gene-level bias in `fgsea` or other GSEA methods? Justify your answer and suggest possible improvement/correction ideas (0.5 pt).