Genesets are simply a named list of character vectors which can be directly passed to hyper()
. Alternatively, one can pass a gsets
object, which can retain the name and version of the genesets one uses. This versioning will be included when exporting results or generating reports, which will ensure your results are reproducible.
genesets <- list("GSET1" = c("GENE1", "GENE2", "GENE3"),
"GSET2" = c("GENE4", "GENE6"),
"GSET3" = c("GENE7", "GENE8", "GENE9"))
Creating a gsets
object is easy…
Example Genesets v1.0
GSET1 (3)
GSET2 (2)
GSET3 (3)
And can be passed directly to hyper()
…
hypeR(signature, genesets)
To aid in workflow efficiency, hypeR enables users to download genesets, wrapped as gsets
objects, from multiple data sources.
Most researchers will find the genesets hosted by msigdb are adequate to perform geneset enrichment analysis. There are various types of genesets available across multiple species.
Please pay attention to the versioning - hypeR
will default to the msigdbr
version installed on your machine which updates with the curation version of the genesets frequently done by the Broad. Check to make sure you are using the genesets you expect.
If you want a specific version, you need to reinstall the dependency.
devtools::install_version("msigdbr", version="7.2.1", repos="http://cran.us.r-project.org")
Here we download the Hallmarks genesets…
HALLMARK <- msigdb_gsets(species="Homo sapiens", category="H")
print(HALLMARK)
H v7.2.1
HALLMARK_ADIPOGENESIS (200)
HALLMARK_ALLOGRAFT_REJECTION (200)
HALLMARK_ANDROGEN_RESPONSE (100)
HALLMARK_ANGIOGENESIS (36)
HALLMARK_APICAL_JUNCTION (200)
HALLMARK_APICAL_SURFACE (44)
We can also clean them up by removing the first leading common substring…
HALLMARK <- msigdb_gsets(species="Homo sapiens", category="H", clean=TRUE)
print(HALLMARK)
H v7.2.1
Adipogenesis (200)
Allograft Rejection (200)
Androgen Response (100)
Angiogenesis (36)
Apical Junction (200)
Apical Surface (44)
This can be passed directly to hypeR()
…
hypeR(signature, genesets=HALLMARK)
Other commonly used genesets include Biocarta, Kegg, and Reactome…
BIOCARTA <- msigdb_gsets(species="Homo sapiens", category="C2", subcategory="CP:BIOCARTA")
KEGG <- msigdb_gsets(species="Homo sapiens", category="C2", subcategory="CP:KEGG")
REACTOME <- msigdb_gsets(species="Homo sapiens", category="C2", subcategory="CP:REACTOME")
If msigdb genesets are not sufficient, we have also provided another set of functions for downloading and loading other publicly available genesets. This is facilitated by interfacing with the publicly available libraries hosted by enrichr.
available <- enrichr_available()
reactable(available)
ATLAS <- enrichr_gsets("Human_Gene_Atlas")
print(ATLAS)
Human_Gene_Atlas (Enrichr) Downloaded: 2021-06-30
Thyroid (388)
CD33+ Myeloid (679)
Testis (384)
Salivarygland (57)
CD14+ Monocytes (383)
TestisLeydigCell (107)
Note: These libraries do not have a systematic versioning scheme, however the date downloaded will be recorded.
Additionally download other species if you aren’t working with human or mouse genes!
yeast <- enrichr_gsets("GO_Biological_Process_2018", db="YeastEnrichr")
worm <- enrichr_gsets("GO_Biological_Process_2018", db="WormEnrichr")
fish <- enrichr_gsets("GO_Biological_Process_2018", db="FishEnrichr")
fly <- enrichr_gsets("GO_Biological_Process_2018", db="FlyEnrichr")
When dealing with hundreds of genesets, it’s often useful to understand the relationships between them. This allows researchers to summarize many enriched pathways as more general biological processes. To do this, we rely on curated relationships defined between them. For example, Reactome conveniently defines their genesets in a hiearchy of pathways. This data can be formatted into a relational genesets object called rgsets
.
We currently curate some relational genesets for use with hypeR and plan to add more continuously.
source gsets
1 KEGG KEGG_v92.0.rds
2 METABOANALYST METABOANALYST_DISEASE_CSF_v5.0.rds
3 METABOANALYST METABOANALYST_DISEASE_FECAL_v5.0.rds
4 METABOANALYST METABOANALYST_DISEASE_URINE_v5.0.rds
5 METABOANALYST METABOANALYST_DRUG_v5.0.rds
6 METABOANALYST METABOANALYST_KEGG_v5.0.rds
7 METABOANALYST METABOANALYST_SMPDB_v5.0.rds
8 METABOANALYST METABOANALYST_WITH_HMDB
9 METABOANALYST METABOANALYST_WITH_HMDB
10 METABOANALYST METABOANALYST_WITH_HMDB
11 METABOANALYST METABOANALYST_WITH_HMDB
12 METABOANALYST METABOANALYST_WITH_HMDB
13 METABOANALYST METABOANALYST_WITH_HMDB
14 METABOANALYST METABOANALYST_WITH_HMDB
15 METABOANALYST README.md
16 REACTOME REACTOME_v70.0.rds
17 SMPDB SMPDB_v2.75.rds
Downloading relational genesets is easy…
genesets <- hyperdb_rgsets("REACTOME", "70.0")
And can be passed directly to hyper()
…
hypeR(signature, genesets)
We try to provide relational genesets for popular databases that include hierarchical information. For users who want to create their own, we provide this example.
Raw data for gsets, nodes, and edges can be directly downloaded.
genesets.url <- "https://reactome.org/download/current/ReactomePathways.gmt.zip"
nodes.url <- "https://reactome.org/download/current/ReactomePathways.txt"
edges.url <- "https://reactome.org/download/current/ReactomePathwaysRelation.txt"
# Genesets
genesets.tmp <- tempfile(fileext=".gmt.zip")
download.file(genesets.url, destfile = genesets.tmp, mode = "wb")
genesets.raw <- genesets.tmp %>%
unzip() %>%
read.gmt() %>%
lapply(function(x) {
toupper(x[x != "Reactome Pathway"])
})
# Nodes
nodes.raw <- nodes.url %>%
read.delim(sep="\t",
header=FALSE,
fill=TRUE,
col.names=c("id", "label", "species"),
stringsAsFactors=FALSE)
# Edges
edges.raw <- edges.url %>%
read.delim(sep="\t",
header=FALSE,
fill=TRUE,
col.names=c("from", "to"),
stringsAsFactors=FALSE)
# Species-specific nodes
nodes <- nodes.raw %>%
dplyr::filter( label %in% names(genesets.raw) ) %>%
dplyr::filter( species == "Homo sapiens" ) %>%
dplyr::filter(! duplicated(id) ) %>%
magrittr::set_rownames( .$id ) %>%
{ .[, "label", drop=FALSE] }
# Species-specific edges
edges <- edges.raw %>%
dplyr::filter( from %in% rownames(nodes) ) %>%
dplyr::filter( to %in% rownames(nodes) )
# Leaf genesets
genesets <- nodes %>%
rownames() %>%
.[! . %in% edges$from] %>%
sapply( function(x) nodes[x, "label"] ) %>%
genesets.raw[.]
A single-column data frame of labels where the rownames are unique identifiers. Leaf node labels should have an associated geneset, while internal nodes do not have to. The only genesets tested, will be those in the list of genesets.
head(nodes)
label
R-HSA-164843 2-LTR circle formation
R-HSA-73843 5-Phosphoribose 1-diphosphate biosynthesis
R-HSA-1971475 A tetrasaccharide linker sequence is required for GAG synthesis
R-HSA-5619084 ABC transporter disorders
R-HSA-1369062 ABC transporters in lipid homeostasis
R-HSA-382556 ABC-family proteins mediated transport
A dataframe with two columns of identifiers, indicating directed edges between nodes in the hierarchy.
head(edges)
from to
1 R-HSA-109581 R-HSA-109606
2 R-HSA-109581 R-HSA-169911
3 R-HSA-109581 R-HSA-5357769
4 R-HSA-109581 R-HSA-75153
5 R-HSA-109582 R-HSA-140877
6 R-HSA-109582 R-HSA-202733
A list of character vectors, named by the geneset labels. Typically, genesets will be at the leaves of the hierarchy, while not required.
[1] "2-LTR circle formation"
[2] "5-Phosphoribose 1-diphosphate biosynthesis"
[3] "A tetrasaccharide linker sequence is required for GAG synthesis"
[4] "ABC transporters in lipid homeostasis"
[5] "ABO blood group biosynthesis"
[6] "ADORA2B mediated anti-inflammatory cytokines production"
rgsets
ObjectREACTOME v70.0
Genesets
2-LTR circle formation (13)
5-Phosphoribose 1-diphosphate biosynthesis (3)
A tetrasaccharide linker sequence is required for GAG synthesis (26)
ABC transporters in lipid homeostasis (18)
ABO blood group biosynthesis (3)
ADORA2B mediated anti-inflammatory cytokines production (134)
Nodes
label
R-HSA-164843 2-LTR circle formation
R-HSA-73843 5-Phosphoribose 1-diphosphate biosynthesis
R-HSA-1971475 A tetrasaccharide linker sequence is required for GAG synthesis
R-HSA-5619084 ABC transporter disorders
R-HSA-1369062 ABC transporters in lipid homeostasis
R-HSA-382556 ABC-family proteins mediated transport
id length
R-HSA-164843 R-HSA-164843 13
R-HSA-73843 R-HSA-73843 3
R-HSA-1971475 R-HSA-1971475 26
R-HSA-5619084 R-HSA-5619084 78
R-HSA-1369062 R-HSA-1369062 18
R-HSA-382556 R-HSA-382556 22
Edges
from to
1 R-HSA-109581 R-HSA-109606
2 R-HSA-109581 R-HSA-169911
3 R-HSA-109581 R-HSA-5357769
4 R-HSA-109581 R-HSA-75153
5 R-HSA-109582 R-HSA-140877
6 R-HSA-109582 R-HSA-202733