Object Oriented Genesets

Genesets are simply a named list of character vectors which can be directly passed to hyper(). Alternatively, one can pass a gsets object, which can retain the name and version of the genesets one uses. This versioning will be included when exporting results or generating reports, which will ensure your results are reproducible.

genesets <- list("GSET1" = c("GENE1", "GENE2", "GENE3"),
                 "GSET2" = c("GENE4", "GENE6"),
                 "GSET3" = c("GENE7", "GENE8", "GENE9"))

Creating a gsets object is easy…

genesets <- gsets$new(genesets, name="Example Genesets", version="v1.0")
print(genesets)

Example Genesets v1.0 
GSET1 (3)
GSET2 (2)
GSET3 (3)

And can be passed directly to hyper()…

hypeR(signature, genesets)

To aid in workflow efficiency, hypeR enables users to download genesets, wrapped as gsets objects, from multiple data sources.

Downloading msigdb Genesets

Most researchers will find the genesets hosted by msigdb are adequate to perform geneset enrichment analysis. There are various types of genesets available across multiple species.

Please pay attention to the versioning - hypeR will default to the msigdbr version installed on your machine which updates with the curation version of the genesets frequently done by the Broad. Check to make sure you are using the genesets you expect.

msigdb_version()

If you want a specific version, you need to reinstall the dependency.

devtools::install_version("msigdbr", version="7.2.1", repos="http://cran.us.r-project.org")

msigdb_info()

Here we download the Hallmarks genesets…

HALLMARK <- msigdb_gsets(species="Homo sapiens", category="H")
print(HALLMARK)

H v7.2.1 
HALLMARK_ADIPOGENESIS (200)
HALLMARK_ALLOGRAFT_REJECTION (200)
HALLMARK_ANDROGEN_RESPONSE (100)
HALLMARK_ANGIOGENESIS (36)
HALLMARK_APICAL_JUNCTION (200)
HALLMARK_APICAL_SURFACE (44)

We can also clean them up by removing the first leading common substring…

HALLMARK <- msigdb_gsets(species="Homo sapiens", category="H", clean=TRUE)
print(HALLMARK)

H v7.2.1 
Adipogenesis (200)
Allograft Rejection (200)
Androgen Response (100)
Angiogenesis (36)
Apical Junction (200)
Apical Surface (44)

This can be passed directly to hypeR()…

hypeR(signature, genesets=HALLMARK)

Other commonly used genesets include Biocarta, Kegg, and Reactome…

BIOCARTA <- msigdb_gsets(species="Homo sapiens", category="C2", subcategory="CP:BIOCARTA")
KEGG     <- msigdb_gsets(species="Homo sapiens", category="C2", subcategory="CP:KEGG")
REACTOME <- msigdb_gsets(species="Homo sapiens", category="C2", subcategory="CP:REACTOME")

Downloading enrichr Genesets

If msigdb genesets are not sufficient, we have also provided another set of functions for downloading and loading other publicly available genesets. This is facilitated by interfacing with the publicly available libraries hosted by enrichr.

available <- enrichr_available()
reactable(available)

ATLAS <- enrichr_gsets("Human_Gene_Atlas")
print(ATLAS)

Human_Gene_Atlas (Enrichr) Downloaded: 2021-06-30 
Thyroid (388)
CD33+ Myeloid (679)
Testis (384)
Salivarygland (57)
CD14+ Monocytes (383)
TestisLeydigCell (107)

Note: These libraries do not have a systematic versioning scheme, however the date downloaded will be recorded.

Additionally download other species if you aren’t working with human or mouse genes!

yeast <- enrichr_gsets("GO_Biological_Process_2018", db="YeastEnrichr")
worm <- enrichr_gsets("GO_Biological_Process_2018", db="WormEnrichr")
fish <- enrichr_gsets("GO_Biological_Process_2018", db="FishEnrichr")
fly <- enrichr_gsets("GO_Biological_Process_2018", db="FlyEnrichr")

Relational Genesets

When dealing with hundreds of genesets, it’s often useful to understand the relationships between them. This allows researchers to summarize many enriched pathways as more general biological processes. To do this, we rely on curated relationships defined between them. For example, Reactome conveniently defines their genesets in a hiearchy of pathways. This data can be formatted into a relational genesets object called rgsets.

We currently curate some relational genesets for use with hypeR and plan to add more continuously.

hyperdb_available()

          source                                gsets
1           KEGG                       KEGG_v92.0.rds
2  METABOANALYST   METABOANALYST_DISEASE_CSF_v5.0.rds
3  METABOANALYST METABOANALYST_DISEASE_FECAL_v5.0.rds
4  METABOANALYST METABOANALYST_DISEASE_URINE_v5.0.rds
5  METABOANALYST          METABOANALYST_DRUG_v5.0.rds
6  METABOANALYST          METABOANALYST_KEGG_v5.0.rds
7  METABOANALYST         METABOANALYST_SMPDB_v5.0.rds
8  METABOANALYST              METABOANALYST_WITH_HMDB
9  METABOANALYST              METABOANALYST_WITH_HMDB
10 METABOANALYST              METABOANALYST_WITH_HMDB
11 METABOANALYST              METABOANALYST_WITH_HMDB
12 METABOANALYST              METABOANALYST_WITH_HMDB
13 METABOANALYST              METABOANALYST_WITH_HMDB
14 METABOANALYST              METABOANALYST_WITH_HMDB
15 METABOANALYST                            README.md
16      REACTOME                   REACTOME_v70.0.rds
17         SMPDB                      SMPDB_v2.75.rds

Downloading relational genesets is easy…

genesets <- hyperdb_rgsets("REACTOME", "70.0")

And can be passed directly to hyper()…

hypeR(signature, genesets)

Creating Relational Genesets

We try to provide relational genesets for popular databases that include hierarchical information. For users who want to create their own, we provide this example.

Raw data for gsets, nodes, and edges can be directly downloaded.

genesets.url <- "https://reactome.org/download/current/ReactomePathways.gmt.zip"
nodes.url <- "https://reactome.org/download/current/ReactomePathways.txt"
edges.url <- "https://reactome.org/download/current/ReactomePathwaysRelation.txt"

Loading Data

# Genesets
genesets.tmp <- tempfile(fileext=".gmt.zip")
download.file(genesets.url, destfile = genesets.tmp, mode = "wb")
genesets.raw <- genesets.tmp %>%
                unzip() %>%
                read.gmt() %>%
                lapply(function(x) {
                    toupper(x[x != "Reactome Pathway"])
                })
# Nodes
nodes.raw <- nodes.url %>%
             read.delim(sep="\t", 
                        header=FALSE, 
                        fill=TRUE, 
                        col.names=c("id", "label", "species"), 
                        stringsAsFactors=FALSE)
# Edges
edges.raw <- edges.url %>%
             read.delim(sep="\t", 
                        header=FALSE, 
                        fill=TRUE, 
                        col.names=c("from", "to"),
                        stringsAsFactors=FALSE)

Organizing a Hierarchy

# Species-specific nodes
nodes <- nodes.raw %>%
         dplyr::filter( label %in% names(genesets.raw) ) %>%
         dplyr::filter( species == "Homo sapiens" ) %>%
         dplyr::filter(! duplicated(id) ) %>%
         magrittr::set_rownames( .$id ) %>%
         { .[, "label", drop=FALSE] }

# Species-specific edges
edges <- edges.raw %>%
         dplyr::filter( from %in% rownames(nodes) ) %>%
         dplyr::filter( to %in% rownames(nodes) )

# Leaf genesets
genesets <- nodes %>%
            rownames() %>%
            .[! . %in% edges$from] %>%
            sapply( function(x) nodes[x, "label"] ) %>%
            genesets.raw[.]

nodes

A single-column data frame of labels where the rownames are unique identifiers. Leaf node labels should have an associated geneset, while internal nodes do not have to. The only genesets tested, will be those in the list of genesets.

head(nodes)

                                                                        label
R-HSA-164843                                           2-LTR circle formation
R-HSA-73843                        5-Phosphoribose 1-diphosphate biosynthesis
R-HSA-1971475 A tetrasaccharide linker sequence is required for GAG synthesis
R-HSA-5619084                                       ABC transporter disorders
R-HSA-1369062                           ABC transporters in lipid homeostasis
R-HSA-382556                           ABC-family proteins mediated transport

edges

A dataframe with two columns of identifiers, indicating directed edges between nodes in the hierarchy.

head(edges)

          from            to
1 R-HSA-109581  R-HSA-109606
2 R-HSA-109581  R-HSA-169911
3 R-HSA-109581 R-HSA-5357769
4 R-HSA-109581   R-HSA-75153
5 R-HSA-109582  R-HSA-140877
6 R-HSA-109582  R-HSA-202733

genesets

A list of character vectors, named by the geneset labels. Typically, genesets will be at the leaves of the hierarchy, while not required.

head(names(genesets))

[1] "2-LTR circle formation"                                         
[2] "5-Phosphoribose 1-diphosphate biosynthesis"                     
[3] "A tetrasaccharide linker sequence is required for GAG synthesis"
[4] "ABC transporters in lipid homeostasis"                          
[5] "ABO blood group biosynthesis"                                   
[6] "ADORA2B mediated anti-inflammatory cytokines production"

The `rgsets` Object

genesets <- rgsets$new(genesets, nodes, edges, name="REACTOME", version="v70.0")
print(genesets)

REACTOME v70.0 

Genesets

2-LTR circle formation (13)
5-Phosphoribose 1-diphosphate biosynthesis (3)
A tetrasaccharide linker sequence is required for GAG synthesis (26)
ABC transporters in lipid homeostasis (18)
ABO blood group biosynthesis (3)
ADORA2B mediated anti-inflammatory cytokines production (134)

Nodes

                                                                        label
R-HSA-164843                                           2-LTR circle formation
R-HSA-73843                        5-Phosphoribose 1-diphosphate biosynthesis
R-HSA-1971475 A tetrasaccharide linker sequence is required for GAG synthesis
R-HSA-5619084                                       ABC transporter disorders
R-HSA-1369062                           ABC transporters in lipid homeostasis
R-HSA-382556                           ABC-family proteins mediated transport
                         id length
R-HSA-164843   R-HSA-164843     13
R-HSA-73843     R-HSA-73843      3
R-HSA-1971475 R-HSA-1971475     26
R-HSA-5619084 R-HSA-5619084     78
R-HSA-1369062 R-HSA-1369062     18
R-HSA-382556   R-HSA-382556     22

Edges

          from            to
1 R-HSA-109581  R-HSA-109606
2 R-HSA-109581  R-HSA-169911
3 R-HSA-109581 R-HSA-5357769
4 R-HSA-109581   R-HSA-75153
5 R-HSA-109582  R-HSA-140877
6 R-HSA-109582  R-HSA-202733

Downloading Genesets