devtools::load_all(".")
##  Loading hypeR.GEM

What are genome-scale metabolic models(GEMs)?

Genome-scale metabolic models(GEMs) are curated metabolic models collect all known metabolic information of a biological system(e.g. bacteria, human cell, etc.). There are three essential components in a GEM:

  • metabolites: all known metabolites of a biological system.
  • reactions: all known reactions of a biological system.
  • enzyme-coding genes: enzymes that catalyze the biological reactions.

Within a GEM, each metabolite is associated with one or more reactions, in which it participates as either a reactant or a product. Each reaction, in turn, is catalyzed by one or more enzyme-coding genes, with exceptions such as exchange reactions and certain transport reactions that are not catalyzed by enzymes. HypeR-GEM leverages this metabolite–reaction–gene triad to establish a framework for mapping metabolites to genes.

For more details, see here.

A schematic representation of a GEM is shown below:

In the diagram, capital letters denote metabolites, and rir_{i} denotes the ithi^{th} reaction. A prototypical metabolic reaction consists of three components:

  • reactants(substrates): one or more metabolites consumed by the reaction (left-hand side).
  • products: one or more metabolites produced by the reaction (right-hand side).
  • catalyst: enzyme-coding genes whose protein products catalyze the reaction.

For example, reaction r1r_{1} consumes metabolites AA and DD and produces metabolite BB, and it may be catalyzed by one or more enzyme-coding genes.

Mapping workflow

To map a metabolite to its associated enzyme-coding genes, hypeR-GEM defines three types of associations:

  • metabolite-reaction association: A metabolite is associated with a reaction if:

    • The metabolite is either the reactant or the product of the reaction (undirectional)
    • The metabolite is is the product of the reaction (directional)
  • reaction-enzyme(-coding) gene association: A reaction is associated with the enzyme-coding genes that encode enzymes catalyzing that reaction,

  • metabolite-enzyme(-coding) gene association: A metabolite is associated with all enzyme-coding genes that catalyze the reactions in which the metabolite participates.

The overall workflow is illustrated in the following diagram:

Given an input metabolic signature, a set of metabolites annotated using the RefMet nomenclature, hypeR-GEM identifies all metabolic reactions in which each metabolite participates as either reactant or product. For reversible reactions, metabolites are treated as both reactants and products. For each reaction (excluding exchange reactions and a subset of transport reactions), the associated enzyme-coding genes are then mapped to the metabolites according to two configurable rules: a non-directional mapping that considers all reactions involving the metabolite, and a directional mapping that restricts associations to reactions producing the metabolite based on the stoichiometry encoded in the model.

Accounting for Gene Specificity

Enzyme-coding genes differ substantially in their reaction specificity. Non-specific (promiscuous) enzymes can catalyze a broad range of reactions, giving rise to dense gene–reaction–metabolite connections and consequently weaker, less specific associations with a given metabolic signature. To quantify gene-level association specificity and mitigate the influence of such “promiscuous” genes, we compute a gene-specificity score based on the hypergeometric distribution.

For each gene gig_{i}, let MM denote he total number of metabolites represented in the GEM, SS the number of metabolites in the metabolic signature, AiA_{i} the number of metabolites associated with gig_{i} and aia_{i} the number of those metabolites that overlap with the signature. The significance score sis_{i} is defined as the upper-tail hypergeometric probability of observing at least aia_{i} overlapping metabolites by chance:

si=P(Xai)=x=aimin(Ai,S)(Aix)(MAiSx)(MS) s_i = P(X \ge a_i) = \sum_{x = a_i}^{\min(A_i, S)} \frac{\binom{A_i}{x}\binom{M - A_i}{S - x}}{\binom{M}{S}}

Smaller values of sis_{i} indicate a more specific association between gene gig_{i} and the metabolic signature.

Sigmoid-based weighting of gene specificity

To incorporate gene-specificity information into downstream enrichment analyses, each gene is assigned a continuous weight wi[0,1]w_{i} \in [0, 1] via a sigmoid transformation of its significance score:

wi=11+ea(log10(si)b) w_i = \frac{1}{1 + e^{a\left(\log_{10}(s_i) - b\right)}} This transformation is monotonic and bounded, ensuring that genes with stronger specificity (smaller sis_{i}) receive larger weights, while genes with weak or nonspecific associations are smoothly down-weighted.

The parameter bb determines the inflection point of the curve (the half-point, where wi=0.5w_{i}=0.5), with default value b=1b=-1 allowing users to control the weight threshold. By default, b=1b=-1, corresponding to a half-point at si=0.1s_{i} = 0.1.

The parameter a>0a>0 regulates the steepness of the curve. Larger values of aa produce a sharper transition, whereas smaller values yield a smoother weighting scheme. The default value a=1a=1 provides a balanced compromise between sensitivity and smoothness.

Alternative weighting scheme

For additional flexibility, hypeR-GEM also supports a linear weighting scheme based directly on the significance score:

wi=1si w_i = 1 - s_{i} which assigns weights proportional to the complement of the gene-specificity p-value (adjusted p-value).

Weighted hypergeometric test

The standard hypergeometric test used in over-representation analysis (ORA) evaluates whether a gene set contains more genes associated with a given category (e.g., pathway or ontology term) than expected by chance, implicitly assuming that all genes contribute equally to the enrichment. However, as described in the “Accounting for Gene Specificity” section, enzyme-coding genes mapped through genome-scale metabolic models (GEMs) can differ substantially in their association specificity with a given metabolic signature. To incorporate this heterogeneity, hypeR-GEM extends the standard hypergeometric test to a weighted formulation that accounts for gene-specific association strengths.

Formally, the probability of observing at least kk overlapping genes between a gene set and the input signature under the standard hypergeometric model is given by:

P(Xk)=j=kn(Kj)(NKnj)(Nn), P(X \ge k) = \sum_{j = k}^{n} \frac{\binom{K}{j}\binom{N - K}{n - j}}{\binom{N}{n}},

where NN denotes the background size, nn is the number of GEM-mapped enzyme-coding genes in the signature, KK is the size of the gene set, and kk is the number of overlapping genes.

In the weighted formulation, each gene ii is assigned a weight wi[0,1]w_i \in [0,1] reflecting its association specificity. The effective signature size and overlap are then defined as continuous, weight-adjusted counts:

nw=inwi,kw=ikwi. n_w = \sum_{i \in n} w_i, \qquad k_w = \sum_{i \in k} w_i.

Because the hypergeometric distribution is defined for integer-valued inputs, these quantities are rounded to the nearest integer using standard rounding:

nw=12+inwi,kw=12+ikwi, n_w = \left\lfloor \tfrac{1}{2} + \sum_{i \in n} w_i \right\rfloor, \qquad k_w = \left\lfloor \tfrac{1}{2} + \sum_{i \in k} w_i \right\rfloor,

where \lfloor \cdot \rfloor denotes the floor operator.

Substituting nwn_w and kwk_w into the hypergeometric test yields a weighted enrichment score that down-weights non-specific genes, thereby reducing noise and limiting spurious enrichments.

Workflow Illustration

Here, we illustrate the workflow using metabolite signatures associated with coronavirus disease 2019 (COVID-19), derived from human urine samples(See reference). This study profiled the urinary metabolome of 71 patients with COVID-19 and 27 healthy controls. The publicly available COVID-19–associated metabolites were classified into two signatures: up-regulated and down-regulated.

Load metabolite signature

data(COVID_urine)

HypeR-GEM mapping

The core function in hypeR-GEM is signature2gene(), which maps metabolite signatures to enzyme-coding genes using GEMs. The function takes the following key arguments:

  • signature: A list of metabolic signatures. Each element must be a data frame containing a column whose name matches the reference_key argument, , which specifies the RefMet annotation for each metabolite.

  • directional: Logical argument specifying the metabolite–reaction mapping rule.

    • If TRUE, metabolites are mapped only to reactions in which they appear as products (directional mapping).
    • If FALSE, metabolites are mapped to reactions in which they appear as either reactants or products (non-directional mapping).
  • promiscuous_threshold: Maximum allowable number of metabolites associated with a gene. Genes exceeding this threshold are classified as promiscuous and excluded from downstream analysis.

  • background: Background used in the gene-specific hypergeometric test. By default, this is set to the total number of metabolites represented in the GEM.

  • reference_key: A character string indicating the column name in each signature data frame that contains RefMet metabolite identifiers.

## Undirectional mapping
hypeR_GEM_obj <- hypeR.GEM::signature2gene(signatures = COVID_urine, 
                                           directional = FALSE,
                                           promiscuous_threshold = 100, 
                                           reference_key = 'refmet_name',
                                           background = NULL)

##Directional mapping
hypeR_GEM_obj_di <- hypeR.GEM::signature2gene(signatures = COVID_urine,
                                              directional = TRUE,
                                              promiscuous_threshold = 100,
                                              reference_key = 'refmet_name',
                                              background = NULL)

HypeR-GEM output

Here, we illustrate the output of hypeR-GEM using the results from non-directional mapping. The output of hypeR.GEM::signature2gene() is a structured list comprising three major components:

  • promiscuous_met: A data frame listing metabolites considered promiscuous, i.e., those associated with an excessively large number of genes. The filtering threshold is defined by the promiscuous_threshold parameter in signature2gene().

  • mapped_metabolite_signatures: A list in which each element corresponds to a metabolite signature. Each element is a data frame that provides the identifier of the metabolite in the GEM (name), its full name (fullname), the corresponding RefMet annotation (refmet_name), and the number of genes associated with that metabolite (gene_association).

  • gene_table: The core output of the mapping procedure. This is a list where each element corresponds to a metabolite signature and contains a data frame of the mapped enzyme-coding genes. For each gene, the data frame reports:

    • the number of catalyzed reactions (associated_reactions)
    • the total number of associated metabolites in the GEM (total_association)
    • the number of associated metabolites within the given signature (signature_association)
    • the p-value quantifying the mapping specificity (p_value)

promiscuous_met

head(hypeR_GEM_obj$promiscuous_met)
##            refmet_name gene_association
## 1        Glutamic acid              109
## 2 S-Adenosylmethionine              152

mapped_metabolite_signatures

Given that we have two metabolite signatures, namely up-regulated and down-regulated, the mapped_metabolite_signatures object is a list of length two. For illustration, we focus here on the up-regulated signature as an example.

hypeR_GEM_obj$mapped_metabolite_signatures$up %>% 
  head(.)
##       name               fullname                refmet_name gene_association
## 1 MAM00654 2-hydroxyphenylacetate 2-Hydroxyphenylacetic acid                2
## 2 MAM00664 2-methylbutyrylglycine     2-Methylbutyrylglycine                4
## 3 MAM00672       2-oxoglutaramate     2-Keto-glutaramic acid                4
## 4 MAM00775  3-hydroxyanthranilate  3-Hydroxyanthranilic acid                3
## 5 MAM00788 3-hydroxy-L-kynurenine          Hydroxykynurenine                7
## 6 MAM00952   4-acetamidobutanoate   4-Acetamidobutanoic acid                6
##      formula exact_mass   super_class               main_class
## 1     C8H8O3   152.0473 Organic acids         Phenylpropanoids
## 2   C7H13NO3   159.0895 Organic acids Amino acids and peptides
## 3    C5H7NO4   145.0375 Organic acids               Keto acids
## 4    C7H7NO3   153.0426    Benzenoids                 Benzenes
## 5 C10H12N2O4   224.0797 Organic acids Amino acids and peptides
## 6   C6H11NO3   145.0739 Organic acids Amino acids and peptides
##                sub_class
## 1         Cinnamic acids
## 2            Amino acids
## 3 Short-chain keto acids
## 4   Hydroxybenzoic acids
## 5            Amino acids
## 6            Amino acids

gene_table

Similarly, the gene_table is also a list object of length two. For illustration, we focus here on the mapped enzyme-coding genes derived from the up-regulated metabolite signature as an example.

hypeR_GEM_obj$gene_tables$up %>% 
  head(.) 
##              name  symbol associated_reactions total_association
## 1 ENSG00000158104     HPD                    2                 6
## 2 ENSG00000155380 SLC16A1                   80                20
## 3 ENSG00000149124   GLYAT                   11                23
## 4 ENSG00000156689 GLYATL2                    5                13
## 5 ENSG00000166840 GLYATL1                    5                13
## 6 ENSG00000165970  SLC6A5                    5                 6
##   signature_association signature_size background      p_value         fdr
## 1                     1             49       4112 0.0020241802 0.003844373
## 2                     1             49       4112 0.0230551953 0.025443797
## 3                     3             49       4112 0.0001334704 0.001090009
## 4                     2             49       4112 0.0004183142 0.002440166
## 5                     2             49       4112 0.0004183142 0.002440166
## 6                     1             49       4112 0.0020241802 0.003844373
##   one_minus_p_value one_minus_fdr
## 1         0.9979758     0.9961556
## 2         0.9769448     0.9745562
## 3         0.9998665     0.9989100
## 4         0.9995817     0.9975598
## 5         0.9995817     0.9975598
## 6         0.9979758     0.9961556
##                                    associated_metabolites
## 1                              2-Hydroxyphenylacetic acid
## 2                              2-Hydroxyphenylacetic acid
## 3 2-Methylbutyrylglycine;Isobutyrylglycine;Suberylglycine
## 4                2-Methylbutyrylglycine;Isobutyrylglycine
## 5                2-Methylbutyrylglycine;Isobutyrylglycine
## 6                                  2-Methylbutyrylglycine

Enrichment analysis

Here, we use the REACTOME gene sets as an example to illustrate the downstream enrichment analysis of enzyme-coding genes mapped by hypeR-GEM.

data(reactome)

In this example, the background is defined as all enzyme-coding genes represented in the Human GEM model (see reference). The enrichment() function performs gene-set enrichment analysis and accepts the following key arguments:

  • hypeR_GEM_obj: The object returned by signature2gene() function.

  • genesets: A list of gene-sets

  • genesets_name: A descriptive name for the gene-set compendium (e.g., “REACTOME”).

  • method: Enrichment method. “unweighted” applies the standard Fisher/hypergeometric test, while “weighted” applies the weighted hypergeometric test.

  • weighted_by: Used only when method = "weighted". Specifies the column in hyper_GEM_obj[["gene_table"]] containing the significance score sis_{i} for each gene gig_{i}.

  • sigmoid_transformation: Logical. If TRUE, applies the sigmoid transformation to sis_{i}, if FALSE, 1si1-s_{i} is used.

  • min_metabolite: Additional filtering criterion that specifies the minimum number of metabolites required for enrichment for account for metabolite specificity.

Downstream visualization is supported by the enrichment_plot() and rctbls() functions.

In our example, results obtained using the non-directional mapping rule show that pathways such as Sialic acid metabolism, Trafficking and processing of endosomal TLR, and Choline catabolism are enriched only under the unweighted test, suggesting that these signals are primarily driven by non-specific genes.

Unweighted (standard) hypergeometric test

max_fdr <- 0.05

## Enrichment analysis from undirectional mapping
enrichment_obj <- hypeR.GEM::enrichment(hypeR_GEM_obj,
                                        genesets = reactome,
                                        genesets_name = "REACTOME",
                                        method='unweighted',
                                        min_metabolite = 2,
                                        background=3068)

p <- hypeR.GEM::enrichment_plot(enrichment_obj,
                                top=40,
                                abrv=50,
                                size_by="significance",
                                fdr_cutoff= max_fdr,
                                val='fdr')+
  ggplot2::ggtitle(paste("FDR ≤", max_fdr)) +
  ggplot2::theme(axis.text.y = element_text(size = 7))

p

Reactable table

rctbls(enrichment_obj)

Weighted hypergeometric test

max_fdr <- 0.05

## Enrichment analysis from undirectional mapping
enrichment_obj_wt <- hypeR.GEM::enrichment(hypeR_GEM_obj,
                                          genesets = reactome,
                                          genesets_name = "REACTOME",
                                          method='weighted',
                                          weighted_by = 'fdr',
                                          sigmoid_transformation = TRUE,
                                          min_metabolite = 2,
                                          background=3068)



p <- hypeR.GEM::enrichment_plot(enrichment_obj_wt,
                                top=40,
                                abrv=50,
                                size_by="significance",
                                fdr_cutoff= max_fdr,
                                val='fdr')+
  ggplot2::ggtitle(paste("FDR ≤", max_fdr)) +
  ggplot2::theme(axis.text.y = element_text(size = 7))

p

Reactable table

rctbls(enrichment_obj_wt)

Sankey diagram

The sankey_plot() function generates a three-layer Sankey diagrams connecting metabolites, the metabolite sets (MSets), and the gene sets (GSets) derived from hypeR-GEM. Edge weights between the metabolite and Msets layers reflect the number of shared metabolites, while edges between the MSets and GSets layers reflect the overlap between hypeR-GEM–mapped genes and genes in each GSets. The functions takes the following key arguments:

  • hypeR_GEM_obj: The object returned by signature2gene() function.

  • msets_list: A list of metabolite-sets. Each element should be a named list where names are set names and values are character vectors of metabolite identifiers matching the key argument.

  • gsets_list: A list of gene-sets. Each element should be a named list where names are set names and values are character vectors of metabolite identifiers matching the key argument.

  • key: Column name in hypeR_GEM_obj[["mapped_metabolite_signatures"]] used as the metabolite identifier.

As an illustrative example, we applied this function to the hypeR-GEM_obj constructed from age-associated metabolites identified in the New England Centenarian Study (NECS) (See reference). A curated subset of enriched MSets from these metabolites, together with a curated subset of enriched GSets derived from hypeR-GEM–mapped enzyme-coding genes, were visualized in an integrative, multi-omics context.

data("hypeR_object_NECS")
data("metabolon_msets_list")
data("gsets_list")

p <- sankey_plot(hypeR_object_NECS, 
                 msets_list = list(metabolon_msets_list$metabolonSupPthwy), 
                 gsets_list = list(gsets_list$hallmark), 
                 key = "refmet_name",
                 font_size=16,
                 node_width=20)

p

Export to Excel files

Mapping results can be exported to Excel files using the hypGEM_to_excel() function:

hypGEM_to_excel(hypeR_GEM_obj, file = "hypeR_GEM_non_directional.xlsx")
hypGEM_to_excel(hypeR_GEM_obj_di, file = "hypeR_GEM_directional.xlsx")