Create OmicSignature
Vanessa Mengze Li
10/24/2024
CreateOmS.Rmd
devtools::load_all(".")
#> Error in get(paste0(generic, ".", class), envir = get_method_env()) :
#> object 'type_sum.accel' not found
library(dplyr)
1. Cheatsheet
An OmicSignature
object contains three parts:
metadata, a list.
Required fields:
“signature_name”, “organism”, “direction_type”, “assay_type”, “phenotype”.
Recommended optional fields, if applicable:
“platform”, “sample_type”, “description”, “covariates”, “score_cutoff”, “adj_p_cutoff”.signature, a data frame.
Required columns:
“probe_id” (unique identifier for each feature)
“feature_name” (e.g. ENSEMBL ID, Uniprot ID)
Required for bi-directional and categorical signatures: “direction”
Recommended optional column, if applicable: “score”difexp (optional), a data frame of differential expression analysis results.
Required columns:
“probe_id” (unique identifier for each feature)
“feature_name” (e.g. ENSEMBL ID, Uniprot ID)
“score”
At least one of the following: “p_value”, “q_value”, or “adj_p”.
Recommended optional column, if applicable: “gene_symbol”
Create the object:
OmS <- OmicSignature$new(
metadata = metadata,
signature = signature,
difexp = difexp
)
2. Create an OmicSignature
Object Step-by-Step
The example provided below is from an experiment for Myc gene reduce
in mice. Signatures were extracted by comparing the liver of treatment
and control when mice is 24-month old. To save space, only the result of
1000 genes are included.
This is a bi-directional signature example, which contains up and
down regulated features (genes).
2.1. Metadata
A list with the following required fields:
“signature_name”, “organism”,
“direction_type”, “assay_type”,
“phenotype”.
Not required, but highly recommended fields:
“platform”, “sample_type”,
“description”,
“covariates”, “score_cutoff”,
“adj_p_cutoff”, “logfc_cutoff”.
“phenotype” is a one or two-word description of a topic, such as a
drug treatment, a gene knock out, or a research area. For example,
longevity and breast cancer.
Providing a detailed “description” is highly recommended. For instance,
it may include information about how the treatment was administered and
how each group was defined.
Option 1: Create metadata
by hand. This is not
recommended, because typos can occur.
metadata <- list(
"signature_name" = Myc_reduce_mice_liver_24m,
"organism" = "Mus Musculus",
"sample_type" = "liver",
"phenotype" = "Myc_reduce",
"direction_type" = "bi-directional",
"assay_type" = "transcriptomics",
"platform" = "GPL6246"
)
Option 2: Use createMetadata()
(recommended).
This function helps remind you of the built-in attributes. The full list
of current built-in attributes is shown here.
You can also provide your own customized attributes using the “others”
field.
metadata <- createMetadata(
# required attributes:
signature_name = "Myc_reduce_mice_liver_24m",
organism = "Mus Musculus",
direction_type = "bi-directional",
assay_type = "transcriptomics",
phenotype = "Myc_reduce",
# optional and recommended:
covariates = "none",
description = "mice MYC reduced expression",
platform = "GPL6246", # use GEO platform ID
sample_type = "liver", # use BRENDA ontology
# optional cut-off attributes.
# specifying them can facilitate the extraction of signatures.
logfc_cutoff = NULL,
p_value_cutoff = NULL,
adj_p_cutoff = 0.05,
score_cutoff = 5,
# other optional built-in attributes:
keywords = c("Myc", "KO", "longevity"),
cutoff_description = NULL,
author = NULL,
PMID = 25619689,
year = 2015,
# example of customized attributes:
others = list("animal_strain" = "C57BL/6")
)
2.1.1 “sample_type” and “platform”
“sample_type” should be a BRENDA ontology term, and “platform” should
be a GEO platform accession ID.
See how to search for the correct term to use in “Sample
Type & Platform Info”.
2.1.2 “direction_type”
direction_type
must be one of the following:
“uni-directional”. Only a list of significant feature names is available. Examples including “genes mutated in a disease” and “markers of a cell type”.
“bi-directional”. Significant features can be grouped into “up” and “down” categories. For example, when comparing treatment vs. control groups, some features will be higher (“up”, or “+”) and some will be lower (“down” or “-”) in the treatment group. When the phenotype is a continuous trait, such as age, some features will increase (“up”, or “+”) with age, while others will decrease (“down”, or “-”) with age.
“categorical”. Used with multi-valued categorical phenotypes (e.g., “A” vs. “B” vs. “C”), usually analyzed by ANOVA.
2.2. Differential expression analysis results (difexp)
A differential expression dataframe is optional but
highly recommended if available. It facilitates
downstream signature extraction.
difexp
is a dataframe with the following
required columns:
“probe_id”, “feature_name”,
“score”, along with at least one of the following:
“p_value”, “q_value”, or
“adj_p”.
“probe_id” is a unique identifier for each feature.
If not provided, it will be automatically generated. Common examples
include unique numbers and probe IDs.
“feature_name” is a name that identifies each feature,
examples include ENSEMBL IDs, UniProt IDs, and Refmet IDs. To better
identify the features, it is recommended to add an additional annotation
column(s), e.g., “gene_symbol”. For metabolite features, it is
recommended to include multiple annotation columns is available, e.g.,
HMDB ID and InChI key.
“p_value”, “q_value”, or
“adj_p” refers to the p- or q-value representing the
significance of each feature.
“score” is a numeric value that indicates the importance or significance
of a feature. Common examples include t-test statistics and
Z-scores.
Here we use an example out put from the differential expression
analysis using the limma
package.
# difexp <- read.table(file.path(system.file("extdata", package = "OmicSignature"), "difmatrix_Myc_mice_liver_24m.txt"),
# header = TRUE, sep = "\t", stringsAsFactors = FALSE
# )
difexp <- readRDS(file.path(system.file("extdata", package = "OmicSignature"), "difmatrix_Myc_mice_liver_24m.rds"))
head(difexp)
#> Probe.ID logFC AveExpr t P.Value adj.P.Val b ensembl
#> 1 10345228 -0.167 7.106 -1.470 0.186 0.560 -5.866 ENSMUSG00000103746
#> 2 10354534 0.041 4.351 0.520 0.620 0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175 4.955 -0.941 0.379 0.731 -6.458 ENSMUSG00000043629
#> 4 10346337 0.025 8.621 0.188 0.857 0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025 6.063 -0.284 0.785 0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055 7.595 -0.630 0.549 0.836 -6.712 ENSMUSG00000049881
#> gene_symbol
#> 1 1700001G17Rik
#> 2 1700019A02Rik
#> 3 1700019D03Rik
#> 4 1700066M21Rik
#> 5 1700101I19Rik
#> 6 2810025M15Rik
Manually change the column names to match the requirement. The
built-in function replaceDifexpCol()
designed to replace
some frequently used alternative column names.
difexp <- difexp %>% rename(feature_name = ensembl)
colnames(difexp) <- replaceDifexpCol(colnames(difexp))
head(difexp)
#> probe_id logfc mean score p_value adj_p b feature_name
#> 1 10345228 -0.167 7.106 -1.470 0.186 0.560 -5.866 ENSMUSG00000103746
#> 2 10354534 0.041 4.351 0.520 0.620 0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175 4.955 -0.941 0.379 0.731 -6.458 ENSMUSG00000043629
#> 4 10346337 0.025 8.621 0.188 0.857 0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025 6.063 -0.284 0.785 0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055 7.595 -0.630 0.549 0.836 -6.712 ENSMUSG00000049881
#> gene_symbol
#> 1 1700001G17Rik
#> 2 1700019A02Rik
#> 3 1700019D03Rik
#> 4 1700066M21Rik
#> 5 1700101I19Rik
#> 6 2810025M15Rik
2.3. Signature
signature is a dataframe with column
“probe_id” and “feature_name”. If the
signature is “bi-directional” or “categorical” (specified in
direction_type
in metadata
), then column
“direction” is also required.
An optional column “score” is recommended when applicable.
Option 1: Extract signature from difexp.
In this example, we create a bi-directional signature from the difexp
using the score_cutoff
and adj_p_cutoff
specified in the metadata.
signature <- difexp %>%
dplyr::filter(abs(score) > metadata$score_cutoff & adj_p < metadata$adj_p_cutoff) %>%
dplyr::select(probe_id, feature_name, score) %>%
dplyr::mutate(direction = ifelse(score > 0, "+", "-"))
head(signature)
#> probe_id feature_name score direction
#> 1 10346882 ENSMUSG00000025964 -6.990 -
#> 2 10353878 ENSMUSG00000067653 -7.867 -
#> 3 10349648 ENSMUSG00000004552 14.762 +
#> 4 10355278 ENSMUSG00000062209 6.083 +
#> 5 10353192 ENSMUSG00000025932 10.487 +
#> 6 10345762 ENSMUSG00000026072 -13.543 -
Function standardizeSigDF()
can help remove duplicated
and empty names.
signature <- standardizeSigDF(signature)
head(signature)
#> probe_id feature_name score direction
#> 1 10349648 ENSMUSG00000004552 14.762 +
#> 2 10345762 ENSMUSG00000026072 -13.543 -
#> 3 10353192 ENSMUSG00000025932 10.487 +
#> 4 10355259 ENSMUSG00000061816 -10.315 -
#> 5 10351477 ENSMUSG00000102418 8.818 +
#> 6 10353878 ENSMUSG00000067653 -7.867 -
Option 2: Manually create signature dataframe.
For uni-directional signature:
signature <- data.frame("probe_id" = c(1, 2, 3), "feature_name" = c("gene1", "gene2", "gene3"))
For bi-directional signature:
signature <- data.frame(
"probe_id" = c(1, 2, 3),
"feature_name" = c("gene1", "gene2", "gene3"),
"score" = c(0.45, -3.21, 2.44),
"direction" = c("+", "-", "+")
)
For categorical signature:
signature <- data.frame(
"probe_id" = c(1, 2, 3, 4),
"feature_name" = c("gene1", "gene2", "gene3", "gene4"),
"score" = c(0.45, -3.21, 2.44, -2.45),
"direction" = c("group1", "group1", "group2", "group3")
)
2.4. Create the OmicSignature
object
OmS <- OmicSignature$new(
metadata = metadata,
signature = signature,
difexp = difexp
)
#> [Success] OmicSignature object Myc_reduce_mice_liver_24m created.
Set print_message
= TRUE
to see the
messages.
OmS <- OmicSignature$new(
metadata = metadata,
signature = signature,
difexp = difexp,
print_message = TRUE
)
#> --Required attributes for metadata: signature_name, phenotype, organism, direction_type, assay_type --
#> [Success] Metadata is saved.
#> [Success] Signature is valid.
#> [Success] difexp is valid.
#> [Success] OmicSignature object Myc_reduce_mice_liver_24m created.
See the created object information:
print(OmS)
#> Signature Object:
#> Metadata:
#> adj_p_cutoff = 0.05
#> assay_type = transcriptomics
#> covariates = none
#> description = mice MYC reduced expression
#> direction_type = bi-directional
#> keywords = Myc, KO, longevity
#> organism = Mus Musculus
#> others = C57BL/6
#> phenotype = Myc_reduce
#> platform = GPL6246
#> PMID = 25619689
#> sample_type = liver
#> score_cutoff = 5
#> signature_name = Myc_reduce_mice_liver_24m
#> year = 2015
#> Metadata user defined fields:
#> animal_strain = C57BL/6
#> Signature:
#> Length (15)
#> Class (character)
#> Mode (character)
#> Differential Expression Data:
#> 884 x 9
Use new criteria to extract significant features:
(this does not change the signature
saved in the
object)
OmS$extractSignature("abs(score) > 10; adj_p < 0.01")
#> probe_id feature_name score direction
#> 1 10349648 ENSMUSG00000004552 14.762 +
#> 2 10345762 ENSMUSG00000026072 -13.543 -
#> 3 10353192 ENSMUSG00000025932 10.487 +
#> 4 10355259 ENSMUSG00000061816 -10.315 -
Besides save and read OmicSignature object in .rds
format, you can export the object as a json text file.
saveRDS(OmS, "Myc_reduce_mice_liver_24m_OmS.rds")
writeJson(OmS, "Myc_reduce_mice_liver_24m_OmS.json")
See more in “Functionalities
of OmicSignature” section.
3. Create an OmicSignature
from difexp
and
metadata
You can by-pass the generating signature process once you are an
expert. Simply provide cutoffs (e.g. adj_p_cutoff
and
score_cutoff
) in the metadata
, make sure
difexp
has “adj_p” and “score” columns, and use
OmicSigFromDifexp()
to automatically extract significant
features and create the OmicSignature
object.
OmS1 <- OmicSigFromDifexp(difexp, metadata)
#> -- criterias used to extract signatures: abs(score) >= 5; adj_p <= 0.05 .
#>
#> [Success] OmicSignature object Myc_reduce_mice_liver_24m created.
OmS1
#> Signature Object:
#> Metadata:
#> adj_p_cutoff = 0.05
#> assay_type = transcriptomics
#> covariates = none
#> description = mice MYC reduced expression
#> direction_type = bi-directional
#> keywords = Myc, KO, longevity
#> organism = Mus Musculus
#> others = C57BL/6
#> phenotype = Myc_reduce
#> platform = GPL6246
#> PMID = 25619689
#> sample_type = liver
#> score_cutoff = 5
#> signature_name = Myc_reduce_mice_liver_24m
#> year = 2015
#> Metadata user defined fields:
#> animal_strain = C57BL/6
#> Signature:
#> Length (15)
#> Class (character)
#> Mode (character)
#> Differential Expression Data:
#> 884 x 10