Skip to contents
devtools::load_all(".")
library(dplyr)

1. Cheatsheet

An OmicSignature object contains three parts:

  • metadata, a list.
    Required fields:
    signature_name”, “organism”, “direction_type”, “assay_type”, “phenotype”.
    Recommended optional fields, if applicable:
    “platform”, “sample_type”, “description”, “covariates”, “score_cutoff”, “adj_p_cutoff”.

  • signature, a data frame.
    Required columns:
    probe_id” (unique identifier for each feature)
    feature_name” (e.g. ENSEMBL ID, Uniprot ID)
    Required for bi-directional and categorical signatures: “group_label
    Recommended optional column, if applicable: “score”

  • difexp (optional), a data frame of differential expression analysis results.
    Required columns:
    probe_id” (unique identifier for each feature)
    feature_name” (e.g. ENSEMBL ID, Uniprot ID)
    score
    Required for bi-directional and categorical signatures: “group_label
    At least one of the following: “p_value”, “q_value”, or “adj_p”.
    Recommended optional column, if applicable: “gene_symbol”

Create the object:

OmS <- OmicSignature$new(
  metadata = metadata,
  signature = signature,
  difexp = difexp
)

2. Create an OmicSignature Object Step-by-Step

The example provided below is from an experiment evaluating the effects of Myc reduced expression by comparing liver profiles of 24-month old Myc+/+^{+/+}vs. Myc+/^{+/-} mice. This is a bi-directional signature example, since it is a comparison between two groups, and contains up and down regulated features (genes). For ease of exposition, only the top 1000 genes are here included.

2.1. Metadata

A list with the following required fields:
signature_name”, “organism”, “direction_type”, “assay_type”, “phenotype”.
Not required, but highly recommended fields:
platform”, “sample_type”, “description”, “covariates”, “score_cutoff”, “adj_p_cutoff”, “logfc_cutoff”.

Option 1: Create metadata by hand. This is not recommended, because typos can occur.

metadata <- list(
  "signature_name" = Myc_reduce_mice_liver_24m,
  "organism" = "Mus musculus",
  "sample_type" = "liver",
  "phenotype" = "Myc_reduce",
  "direction_type" = "bi-directional",
  "assay_type" = "transcriptomics", 
  "platform" = "transcriptomics by array"
)

Option 2: Use createMetadata() (recommended).
This function helps remind you of the built-in attributes. The full list of current built-in attributes is shown here.
You can also provide your own customized attributes using the “others” field.

metadata <- createMetadata(
  # required attributes:
  signature_name = "Myc_reduce_mice_liver_24m",
  organism = "Mus musculus",
  direction_type = "bi-directional",
  assay_type = "transcriptomics",
  phenotype = "Myc_reduce",

  # optional and recommended:
  covariates = "none",
  description = "mice Myc haploinsufficient (Myc(+/-))",
  platform = "transcriptomics by array",
  sample_type = "liver", # use BRENDA ontology

  # optional cut-off attributes.
  # specifying them can facilitate the extraction of signatures.
  logfc_cutoff = NULL,
  p_value_cutoff = NULL,
  adj_p_cutoff = 0.05,
  score_cutoff = 5,

  # other optional built-in attributes:
  keywords = c("Myc", "KO", "longevity"),
  cutoff_description = NULL,
  author = NULL,
  PMID = 25619689,
  year = 2015,

  # example of customized attributes:
  others = list("animal_strain" = "C57BL/6")
)

2.1.1 “phenotype”

“‘phenotype’ is free text, usually a one- or two-word description of the experimental condition or trait under study, such as a drug treatment, a gene knockout, or a clinical characteristic. Examples include age, breast cancer, and drug X.

Providing a detailed (free text) “description” is highly recommended. For instance, it may include information about how the treatment was administered and how each group was defined.

2.1.2 “organism

“organism” is free text. We provide a list of common organisms, which you can search using searchOrganism. Other entries are allowed, but please use standard naming conventions (e.g., “Homo sapiens”, “Mus musculus”) to ensure consistency.

searchOrganism("homo")
#> [1] "Homo sapiens"

2.1.3 “sample_type” and “platform”

While both “sample_type” and “platform” are free text, it is recommended that predefined terms are used. Thus, “sample_type” should be a BRENDA ontology term, and “platform” should be one of a set of predefined platforms, if appropriate. You can search for predefined terms via the searchSampleType and searchPlatform functions. See “Sample Type & Platform Info” for details.

2.1.4 “direction_type”

direction_type must be one of the following:

  • “uni-directional”. Only a list of significant feature names is available. Examples including “genes mutated in a disease” and “markers of a specific cell type”.

  • “bi-directional”. Significant features are derived from comparison of two groups, or a single continuous trait. Thus, the resulting significant features can be grouped into two categories. For example, when comparing treatment vs. control groups, some features will be higher and some will be lower in the treatment group. When the phenotype is a continuous trait, such as age, some features will increase with age, while others will decrease with age.

  • “categorical”. Used with multi-valued categorical phenotypes (e.g., “A” vs. “B” vs. “C”), usually analyzed by ANOVA.

2.1.5 “assay_type”

assay_type is one of the following:
- “transcriptomics” (e.g. RNA-seq, micro-array)
- “proteomics”
- “metabolomics”
- “methylomics”
- “genetic_variations” (e.g. SNP, GWAS)
- “DNA_binding_sites” (e.g. ChIP-seq)
- “other”

2.2. Signature

signature is a dataframe with the columns “probe_id” and “feature_name”. If the signature is bi-directional or categorical (as specified in direction_type within metadata), an additional column, “group_label”, is also required.
An optional column “score” is highly recommended when applicable.

probe_id” is a unique identifier, usually a platform-specific identifider (e.g., probe IDs for Affymetrix microarrays, or aptamer IDs for SomaScan assays). If not provided, it will be automatically generated.
feature_name” is a name that identifies each feature, examples include ENSEMBL IDs for transcripts, UniProt IDs for proteins, and Refmet IDs for metabolites. To better identify the features, it is recommended to add an additional annotation column(s), e.g., “gene_symbol”. For metabolite features, it is recommended to include multiple annotation columns if available, e.g., HMDB ID and InChI key.
“group_label” is a factor column. This column indicates the experimental group in which a feature is more highly expressed or more significant. For example, if the analysis identifies genes differentially expressed in Treatment v.s. Control, all gene features with a positive score should be labeled “Treatment” in this column, indicating their higher expression in the Treatment group. Features with negative scores should be labeled “Control”. Similarly, if the analysis concerns protein expression changes associated with BMI, protein features with positive scores should be labeled “higher BMI”, while those with negative scores should be labeled “lower BMI”.

Option 1: Extract signature from a differential analysis results table.
In this example, we extract a bi-directional signature from a difexp object (see next section) using the score_cutoff and adj_p_cutoff specified in the metadata. A difexp object is a properly formatted data frame reporting the results of a differential analysis (e.g., by Limma).

#>   probe_id  logfc  mean  score p_value adj_p      b       feature_name
#> 1 10345228 -0.167 7.106 -1.470   0.186 0.560 -5.866 ENSMUSG00000103746
#> 2 10354534  0.041 4.351  0.520   0.620 0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175 4.955 -0.941   0.379 0.731 -6.458 ENSMUSG00000043629
#> 4 10346337  0.025 8.621  0.188   0.857 0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025 6.063 -0.284   0.785 0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055 7.595 -0.630   0.549 0.836 -6.712 ENSMUSG00000049881

We can then extract the significant features as follows:

signature <- difexp %>%
  dplyr::filter(abs(score) > metadata$score_cutoff & adj_p < metadata$adj_p_cutoff) %>%
  dplyr::select(probe_id, feature_name, score) %>%
  dplyr::mutate(group_label = as.factor(ifelse(score > 0, "MYC Reduce", "WT")))
head(signature)
#>   probe_id       feature_name   score group_label
#> 1 10346882 ENSMUSG00000025964  -6.990          WT
#> 2 10353878 ENSMUSG00000067653  -7.867          WT
#> 3 10349648 ENSMUSG00000004552  14.762  MYC Reduce
#> 4 10355278 ENSMUSG00000062209   6.083  MYC Reduce
#> 5 10353192 ENSMUSG00000025932  10.487  MYC Reduce
#> 6 10345762 ENSMUSG00000026072 -13.543          WT

Function standardizeSigDF() can help remove duplicated and empty names.

signature <- standardizeSigDF(signature)
head(signature)
#>   probe_id       feature_name   score group_label
#> 1 10349648 ENSMUSG00000004552  14.762  MYC Reduce
#> 2 10345762 ENSMUSG00000026072 -13.543          WT
#> 3 10353192 ENSMUSG00000025932  10.487  MYC Reduce
#> 4 10355259 ENSMUSG00000061816 -10.315          WT
#> 5 10351477 ENSMUSG00000102418   8.818  MYC Reduce
#> 6 10353878 ENSMUSG00000067653  -7.867          WT

Option 2: Manually create signature dataframe.
For uni-directional signature:

signature <- data.frame("probe_id" = c(1, 2, 3), "feature_name" = c("gene1", "gene2", "gene3"))

For bi-directional signature:

signature <- data.frame(
  "probe_id" = c(1, 2, 3),
  "feature_name" = c("gene1", "gene2", "gene3"),
  "score" = c(0.45, -3.21, 2.44),
  "group_label" = c("Treatment", "Control", "Treatment")
)

For multi-categorical signature:

signature <- data.frame(
  "probe_id" = c(1, 2, 3, 4),
  "feature_name" = c("gene1", "gene2", "gene3", "gene4"),
  "score" = c(0.45, -3.21, 2.44, -2.45),
  "group_label" = c("group1", "group1", "group2", "group3")
)

2.3. Differential expression analysis results (difexp)

A differential expression dataframe is optional but recommended if available. It facilitates downstream signature extraction, and signature comparison by rank-based test.

difexp is a dataframe with the following required columns:
probe_id”, “feature_name”, “score”, along with at least one of the following: “p_value”, “q_value”, or “adj_p”. Same as in the signature dataframe, “group_label” is also required when the signature is bi-directional or categorical.
Descriptions of probe_id, feature_name, and group_label were provided in the signature section above.
p_value”, “q_value”, or “adj_p” refers to the p- or q-value representing the significance of each feature.
score” is a numeric value that indicates the importance or significance of a feature. Depending on how the signature was derived, this can be the t-statistics, log-fold change, Z-score, or other summary statistics.

Here we use an example from the differential expression analysis using the limma package.

# Version reading from a txt file
# difexp <- read.table(
#   file.path(
#     system.file("extdata", package = "OmicSignature"),
#     "difmatrix_Myc_mice_liver_24m.txt"
#   ),
#   header = TRUE, sep = "\t", stringsAsFactors = FALSE
# )
# Version reading from a binary file
difexp <- readRDS(
  file.path(
    system.file("extdata", package = "OmicSignature"),
    "difmatrix_Myc_mice_liver_24m.rds"
  )
)
head(difexp)
#>   Probe.ID  logFC AveExpr      t P.Value adj.P.Val      b            ensembl
#> 1 10345228 -0.167   7.106 -1.470   0.186     0.560 -5.866 ENSMUSG00000103746
#> 2 10354534  0.041   4.351  0.520   0.620     0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175   4.955 -0.941   0.379     0.731 -6.458 ENSMUSG00000043629
#> 4 10346337  0.025   8.621  0.188   0.857     0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025   6.063 -0.284   0.785     0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055   7.595 -0.630   0.549     0.836 -6.712 ENSMUSG00000049881
#>     gene_symbol
#> 1 1700001G17Rik
#> 2 1700019A02Rik
#> 3 1700019D03Rik
#> 4 1700066M21Rik
#> 5 1700101I19Rik
#> 6 2810025M15Rik

Manually change the column names to match the requirement. The built-in function replaceDifexpCol() is designed to replace some of the frequently used alternative column names.

colnames(difexp) <- replaceDifexpCol(colnames(difexp))
#> Warning in replaceDifexpCol(colnames(difexp)): Required column for
#> OmicSignature object difexp: feature_name, is not found in your input. This may
#> cause problem when creating your OmicSignature object.
difexp <- difexp %>%
  rename(feature_name = ensembl) %>%
  mutate(group_label = as.factor(ifelse(score > 0, "MYC Reduce", "WT")))
head(difexp)
#>   probe_id  logfc  mean  score p_value adj_p      b       feature_name
#> 1 10345228 -0.167 7.106 -1.470   0.186 0.560 -5.866 ENSMUSG00000103746
#> 2 10354534  0.041 4.351  0.520   0.620 0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175 4.955 -0.941   0.379 0.731 -6.458 ENSMUSG00000043629
#> 4 10346337  0.025 8.621  0.188   0.857 0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025 6.063 -0.284   0.785 0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055 7.595 -0.630   0.549 0.836 -6.712 ENSMUSG00000049881
#>     gene_symbol group_label
#> 1 1700001G17Rik          WT
#> 2 1700019A02Rik  MYC Reduce
#> 3 1700019D03Rik          WT
#> 4 1700066M21Rik  MYC Reduce
#> 5 1700101I19Rik          WT
#> 6 2810025M15Rik          WT

2.4. Create the OmicSignature object

OmS <- OmicSignature$new(
  metadata = metadata,
  signature = signature,
  difexp = difexp
)
#>   [Success] OmicSignature object Myc_reduce_mice_liver_24m created.

Set print_message = TRUE to see the messages.

OmS <- OmicSignature$new(
  metadata = metadata,
  signature = signature,
  difexp = difexp,
  print_message = TRUE
)
#>   --Required attributes for metadata: signature_name, phenotype, organism, direction_type, assay_type --
#>   [Success] Metadata is saved. 
#>   [Success] Signature is valid. 
#>   [Success] difexp is valid. 
#>   [Success] OmicSignature object Myc_reduce_mice_liver_24m created.

See the created object information:

print(OmS)
#> Signature Object: 
#>   Metadata: 
#>     adj_p_cutoff = 0.05 
#>     assay_type = transcriptomics 
#>     covariates = none 
#>     description = mice Myc haploinsufficient (Myc(+/-)) 
#>     direction_type = bi-directional 
#>     keywords = Myc, KO, longevity 
#>     organism = Mus musculus 
#>     others = C57BL/6 
#>     phenotype = Myc_reduce 
#>     platform = transcriptomics by array 
#>     PMID = 25619689 
#>     sample_type = liver 
#>     score_cutoff = 5 
#>     signature_name = Myc_reduce_mice_liver_24m 
#>     year = 2015 
#>   Metadata user defined fields: 
#>     animal_strain = C57BL/6 
#>   Signature: 
#>     MYC Reduce (5)
#>     WT (10)
#>   Differential Expression Data: 
#>     884 x 10

Use new criteria to extract significant features:
(this does not change the signature saved in the object)

OmS$extractSignature("abs(score) > 10; adj_p < 0.01")
#>   probe_id       feature_name   score group_label
#> 1 10349648 ENSMUSG00000004552  14.762  MYC Reduce
#> 2 10345762 ENSMUSG00000026072 -13.543          WT
#> 3 10353192 ENSMUSG00000025932  10.487  MYC Reduce
#> 4 10355259 ENSMUSG00000061816 -10.315          WT

Besides saving and reading OmicSignature object in .rds format, you can export the object as a text file in json format.

saveRDS(OmS, "Myc_reduce_mice_liver_24m_OmS.rds")
writeJson(OmS, "Myc_reduce_mice_liver_24m_OmS.json")

See more in “Functionalities of OmicSignature” section.

3. Create an OmicSignature from difexp and metadata

You can by-pass the generating signature process once you are an expert. Simply provide cutoffs (e.g. adj_p_cutoff and score_cutoff) in the metadata, make sure difexp has “adj_p”, “score” and “group_label” columns, and use OmicSigFromDifexp() to automatically extract significant features and create the OmicSignature object.

OmS1 <- OmicSigFromDifexp(difexp, metadata)
#> -- criterias used to extract signatures:  abs(score) >= 5; adj_p <= 0.05 . 
#> 
#>   [Success] OmicSignature object Myc_reduce_mice_liver_24m created.
OmS1
#> Signature Object: 
#>   Metadata: 
#>     adj_p_cutoff = 0.05 
#>     assay_type = transcriptomics 
#>     covariates = none 
#>     description = mice Myc haploinsufficient (Myc(+/-)) 
#>     direction_type = bi-directional 
#>     keywords = Myc, KO, longevity 
#>     organism = Mus musculus 
#>     others = C57BL/6 
#>     phenotype = Myc_reduce 
#>     platform = transcriptomics by array 
#>     PMID = 25619689 
#>     sample_type = liver 
#>     score_cutoff = 5 
#>     signature_name = Myc_reduce_mice_liver_24m 
#>     year = 2015 
#>   Metadata user defined fields: 
#>     animal_strain = C57BL/6 
#>   Signature: 
#>     MYC Reduce (5)
#>     WT (10)
#>   Differential Expression Data: 
#>     884 x 10