Create OmicSignature
Vanessa Mengze Li
05/15/2025
CreateOmS.Rmd
1. Cheatsheet
An OmicSignature
object contains three parts:
metadata, a list.
Required fields:
“signature_name”, “organism”, “direction_type”, “assay_type”, “phenotype”.
Recommended optional fields, if applicable:
“platform”, “sample_type”, “description”, “covariates”, “score_cutoff”, “adj_p_cutoff”.signature, a data frame.
Required columns:
“probe_id” (unique identifier for each feature)
“feature_name” (e.g. ENSEMBL ID, Uniprot ID)
Required for bi-directional and categorical signatures: “direction”
Recommended optional column, if applicable: “score”difexp (optional), a data frame of differential expression analysis results.
Required columns:
“probe_id” (unique identifier for each feature)
“feature_name” (e.g. ENSEMBL ID, Uniprot ID)
“score”
At least one of the following: “p_value”, “q_value”, or “adj_p”.
Recommended optional column, if applicable: “gene_symbol”
Create the object:
OmS <- OmicSignature$new(
metadata = metadata,
signature = signature,
difexp = difexp
)
2. Create an OmicSignature
Object Step-by-Step
The example provided below is from an experiment evaluating the
effects of Myc reduced expression by comparing liver profiles of
24-month old
Mycvs.
Myc
mice. This is a bi-directional signature example, since it
contains up and down regulated features (genes). For ease of exposition,
only the top 1000 genes are here included.
2.1. Metadata
A list with the following required fields:
“signature_name”, “organism”,
“direction_type”, “assay_type”,
“phenotype”.
Not required, but highly recommended fields:
“platform”, “sample_type”,
“description”, “covariates”,
“score_cutoff”, “adj_p_cutoff”,
“logfc_cutoff”.
Option 1: Create metadata
by hand. This is not
recommended, because typos can occur.
metadata <- list(
"signature_name" = Myc_reduce_mice_liver_24m,
"organism" = "Mus musculus",
"sample_type" = "liver",
"phenotype" = "Myc_reduce",
"direction_type" = "bi-directional",
"assay_type" = "transcriptomics",
"platform" = "transcriptomics by array"
)
Option 2: Use createMetadata()
(recommended).
This function helps remind you of the built-in attributes. The full list
of current built-in attributes is shown here.
You can also provide your own customized attributes using the “others”
field.
metadata <- createMetadata(
# required attributes:
signature_name = "Myc_reduce_mice_liver_24m",
organism = "Mus musculus",
direction_type = "bi-directional",
assay_type = "transcriptomics",
phenotype = "Myc_reduce",
# optional and recommended:
covariates = "none",
description = "mice Myc haploinsufficient (Myc(+/-))",
platform = "transcriptomics by array",
sample_type = "liver", # use BRENDA ontology
# optional cut-off attributes.
# specifying them can facilitate the extraction of signatures.
logfc_cutoff = NULL,
p_value_cutoff = NULL,
adj_p_cutoff = 0.05,
score_cutoff = 5,
# other optional built-in attributes:
keywords = c("Myc", "KO", "longevity"),
cutoff_description = NULL,
author = NULL,
PMID = 25619689,
year = 2015,
# example of customized attributes:
others = list("animal_strain" = "C57BL/6")
)
2.1.1 “phenotype”
“phenotype” is free text, one or two-word description of the
experimental conditions, such as a drug treatment, a gene knock out, or
a clinical trait. For example, age or breast
cancer.
Providing a detailed (free text) “description” is highly recommended.
For instance, it may include information about how the treatment was
administered and how each group was defined.
2.1.2 “organism
“organism” is free-text. We provide a list of common organisms, which
you can search using searchOrganism
. Other entries are
allowed, but please use standard naming conventions (e.g., “Homo
sapiens”, “Mus musculus”) to ensure consistency.
searchOrganism("homo")
#> [1] "Homo sapiens"
2.1.3 “sample_type” and “platform”
While both “sample_type” and “platform” are free text, it is
recommended that predefined terms are used. Thus, “sample_type” should
be a BRENDA ontology term, and “platform” should be one of a set of
predefined platforms, if appropriate. You can search for predefined
terms via the searchSampleType
and
searchPlatform
functions. See “Sample
Type & Platform Info” for details.
2.1.4 “direction_type”
direction_type
must be one of the following:
“uni-directional”. Only a list of significant feature names is available. Examples including “genes mutated in a disease” and “markers of a cell type”.
“bi-directional”. Significant features can be grouped into “up” and “down” categories. For example, when comparing treatment vs. control groups, some features will be higher (“up”, or “+”) and some will be lower (“down” or “-”) in the treatment group. When the phenotype is a continuous trait, such as age, some features will increase (“up”, or “+”) with age, while others will decrease (“down”, or “-”) with age.
“categorical”. Used with multi-valued categorical phenotypes (e.g., “A” vs. “B” vs. “C”), usually analyzed by ANOVA.
2.2. Signature
signature is a dataframe with columns
“probe_id” and “feature_name”. If the
signature is “bi-directional” or “categorical” (specified in
direction_type
in metadata
), then column
“direction” is also required.
An optional column “score” is recommended when applicable.
Option 1: Extract signature from a differential
analysis results table.
In this example, we extract a bi-directional signature from a
difexp
object (see next section) using the
score_cutoff
and adj_p_cutoff
specified in the
metadata. A difexp
object is simply a properly formatted
data frame reporting the results of a differential analysis (e.g., by
Limma). It looks something like this:
#> probe_id logfc mean score p_value adj_p b feature_name
#> 1 10345228 -0.167 7.106 -1.470 0.186 0.560 -5.866 ENSMUSG00000103746
#> 2 10354534 0.041 4.351 0.520 0.620 0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175 4.955 -0.941 0.379 0.731 -6.458 ENSMUSG00000043629
#> 4 10346337 0.025 8.621 0.188 0.857 0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025 6.063 -0.284 0.785 0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055 7.595 -0.630 0.549 0.836 -6.712 ENSMUSG00000049881
We can then extract the significant features as follows:
signature <- difexp %>%
dplyr::filter(abs(score) > metadata$score_cutoff & adj_p < metadata$adj_p_cutoff) %>%
dplyr::select(probe_id, feature_name, score) %>%
dplyr::mutate(direction = ifelse(score > 0, "+", "-"))
head(signature)
#> probe_id feature_name score direction
#> 1 10346882 ENSMUSG00000025964 -6.990 -
#> 2 10353878 ENSMUSG00000067653 -7.867 -
#> 3 10349648 ENSMUSG00000004552 14.762 +
#> 4 10355278 ENSMUSG00000062209 6.083 +
#> 5 10353192 ENSMUSG00000025932 10.487 +
#> 6 10345762 ENSMUSG00000026072 -13.543 -
Function standardizeSigDF()
can help remove duplicated
and empty names.
signature <- standardizeSigDF(signature)
head(signature)
#> probe_id feature_name score direction
#> 1 10349648 ENSMUSG00000004552 14.762 +
#> 2 10345762 ENSMUSG00000026072 -13.543 -
#> 3 10353192 ENSMUSG00000025932 10.487 +
#> 4 10355259 ENSMUSG00000061816 -10.315 -
#> 5 10351477 ENSMUSG00000102418 8.818 +
#> 6 10353878 ENSMUSG00000067653 -7.867 -
Option 2: Manually create signature dataframe.
For uni-directional signature:
signature <- data.frame("probe_id" = c(1, 2, 3), "feature_name" = c("gene1", "gene2", "gene3"))
For bi-directional signature:
signature <- data.frame(
"probe_id" = c(1, 2, 3),
"feature_name" = c("gene1", "gene2", "gene3"),
"score" = c(0.45, -3.21, 2.44),
"direction" = c("+", "-", "+")
)
For categorical signature:
signature <- data.frame(
"probe_id" = c(1, 2, 3, 4),
"feature_name" = c("gene1", "gene2", "gene3", "gene4"),
"score" = c(0.45, -3.21, 2.44, -2.45),
"direction" = c("group1", "group1", "group2", "group3")
)
2.3. Differential expression analysis results (difexp)
A differential expression dataframe is optional but
recommended if available. It facilitates downstream
signature extraction, and signature comparison by rank-based test.
difexp
is a dataframe with the following
required columns:
“probe_id”, “feature_name”,
“score”, along with at least one of the following:
“p_value”, “q_value”, or
“adj_p”.
“probe_id” is a unique identifier, usually a
platform-specific identifider (e.g., probe IDs for Affymetrix
microarrays, or aptamer IDs for SomaScan assays). If not provided, it
will be automatically generated.
“feature_name” is a name that identifies each feature,
examples include ENSEMBL IDs for transcripts, UniProt IDs for proteins,
and Refmet IDs for metabolites. To better identify the features, it is
recommended to add an additional annotation column(s), e.g.,
“gene_symbol”. For metabolite features, it is recommended to include
multiple annotation columns if available, e.g., HMDB ID and InChI
key.
“p_value”, “q_value”, or
“adj_p” refers to the p- or q-value representing the
significance of each feature.
“score” is a numeric value that indicates the importance or significance
of a feature. Depending on how the signature was derived, this can be
the t-statistics, log-fold change, Z-score, or other summary
statistics.
Here we use an example from the differential expression analysis
using the limma
package.
# Version reading from a txt file
# difexp <- read.table(
# file.path(
# system.file("extdata", package = "OmicSignature"),
# "difmatrix_Myc_mice_liver_24m.txt"
# ),
# header = TRUE, sep = "\t", stringsAsFactors = FALSE
# )
# Version reading from a binary file
difexp <- readRDS(
file.path(
system.file("extdata", package = "OmicSignature"),
"difmatrix_Myc_mice_liver_24m.rds"
)
)
head(difexp)
#> Probe.ID logFC AveExpr t P.Value adj.P.Val b ensembl
#> 1 10345228 -0.167 7.106 -1.470 0.186 0.560 -5.866 ENSMUSG00000103746
#> 2 10354534 0.041 4.351 0.520 0.620 0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175 4.955 -0.941 0.379 0.731 -6.458 ENSMUSG00000043629
#> 4 10346337 0.025 8.621 0.188 0.857 0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025 6.063 -0.284 0.785 0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055 7.595 -0.630 0.549 0.836 -6.712 ENSMUSG00000049881
#> gene_symbol
#> 1 1700001G17Rik
#> 2 1700019A02Rik
#> 3 1700019D03Rik
#> 4 1700066M21Rik
#> 5 1700101I19Rik
#> 6 2810025M15Rik
Manually change the column names to match the requirement. The
built-in function replaceDifexpCol()
is designed to replace
some frequently used alternative column names.
difexp <- difexp %>% rename(feature_name = ensembl)
colnames(difexp) <- replaceDifexpCol(colnames(difexp))
head(difexp)
#> probe_id logfc mean score p_value adj_p b feature_name
#> 1 10345228 -0.167 7.106 -1.470 0.186 0.560 -5.866 ENSMUSG00000103746
#> 2 10354534 0.041 4.351 0.520 0.620 0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175 4.955 -0.941 0.379 0.731 -6.458 ENSMUSG00000043629
#> 4 10346337 0.025 8.621 0.188 0.857 0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025 6.063 -0.284 0.785 0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055 7.595 -0.630 0.549 0.836 -6.712 ENSMUSG00000049881
#> gene_symbol
#> 1 1700001G17Rik
#> 2 1700019A02Rik
#> 3 1700019D03Rik
#> 4 1700066M21Rik
#> 5 1700101I19Rik
#> 6 2810025M15Rik
2.4. Create the OmicSignature
object
OmS <- OmicSignature$new(
metadata = metadata,
signature = signature,
difexp = difexp
)
#> [Success] OmicSignature object Myc_reduce_mice_liver_24m created.
Set print_message
= TRUE
to see the
messages.
OmS <- OmicSignature$new(
metadata = metadata,
signature = signature,
difexp = difexp,
print_message = TRUE
)
#> --Required attributes for metadata: signature_name, phenotype, organism, direction_type, assay_type --
#> [Success] Metadata is saved.
#> [Success] Signature is valid.
#> [Success] difexp is valid.
#> [Success] OmicSignature object Myc_reduce_mice_liver_24m created.
See the created object information:
print(OmS)
#> Signature Object:
#> Metadata:
#> adj_p_cutoff = 0.05
#> assay_type = transcriptomics
#> covariates = none
#> description = mice Myc haploinsufficient (Myc(+/-))
#> direction_type = bi-directional
#> keywords = Myc, KO, longevity
#> organism = Mus musculus
#> others = C57BL/6
#> phenotype = Myc_reduce
#> platform = transcriptomics by array
#> PMID = 25619689
#> sample_type = liver
#> score_cutoff = 5
#> signature_name = Myc_reduce_mice_liver_24m
#> year = 2015
#> Metadata user defined fields:
#> animal_strain = C57BL/6
#> Signature:
#> Length (15)
#> Class (character)
#> Mode (character)
#> Differential Expression Data:
#> 884 x 9
Use new criteria to extract significant features:
(this does not change the signature
saved in the
object)
OmS$extractSignature("abs(score) > 10; adj_p < 0.01")
#> probe_id feature_name score direction
#> 1 10349648 ENSMUSG00000004552 14.762 +
#> 2 10345762 ENSMUSG00000026072 -13.543 -
#> 3 10353192 ENSMUSG00000025932 10.487 +
#> 4 10355259 ENSMUSG00000061816 -10.315 -
Besides saving and reading OmicSignature object in .rds
format, you can export the object as a json text file.
saveRDS(OmS, "Myc_reduce_mice_liver_24m_OmS.rds")
writeJson(OmS, "Myc_reduce_mice_liver_24m_OmS.json")
See more in “Functionalities
of OmicSignature” section.
3. Create an OmicSignature
from difexp
and
metadata
You can by-pass the generating signature process once you are an
expert. Simply provide cutoffs (e.g. adj_p_cutoff
and
score_cutoff
) in the metadata
, make sure
difexp
has “adj_p” and “score” columns, and use
OmicSigFromDifexp()
to automatically extract significant
features and create the OmicSignature
object.
OmS1 <- OmicSigFromDifexp(difexp, metadata)
#> -- criterias used to extract signatures: abs(score) >= 5; adj_p <= 0.05 .
#>
#> [Success] OmicSignature object Myc_reduce_mice_liver_24m created.
OmS1
#> Signature Object:
#> Metadata:
#> adj_p_cutoff = 0.05
#> assay_type = transcriptomics
#> covariates = none
#> description = mice Myc haploinsufficient (Myc(+/-))
#> direction_type = bi-directional
#> keywords = Myc, KO, longevity
#> organism = Mus musculus
#> others = C57BL/6
#> phenotype = Myc_reduce
#> platform = transcriptomics by array
#> PMID = 25619689
#> sample_type = liver
#> score_cutoff = 5
#> signature_name = Myc_reduce_mice_liver_24m
#> year = 2015
#> Metadata user defined fields:
#> animal_strain = C57BL/6
#> Signature:
#> Length (15)
#> Class (character)
#> Mode (character)
#> Differential Expression Data:
#> 884 x 10