devtools::load_all(".")
library(dplyr)

1. Cheatsheet

An OmicSignature object contains three parts:

  • metadata, a list.
    Required fields:
    signature_name”, “organism”, “direction_type”, “assay_type”, “phenotype”.
    Recommended optional fields, if applicable:
    “platform”, “sample_type”, “description”, “covariates”, “score_cutoff”, “adj_p_cutoff”.

  • signature, a data frame.
    Required columns:
    probe_id” (unique identifier for each feature)
    feature_name” (e.g. ENSEMBL ID, Uniprot ID)
    Required for bi-directional and categorical signatures: “direction
    Recommended optional column, if applicable: “score”

  • difexp (optional), a data frame of differential expression analysis results.
    Required columns:
    probe_id” (unique identifier for each feature)
    feature_name” (e.g. ENSEMBL ID, Uniprot ID)
    Require at least one of the following: “p_value”, “q_value”, or “adj_p”.
    Recommended optional column, if applicable: “score”, “gene_symbol”, “hgnc_symbol”

Create the object:

OmS <- OmicSignature$new(
  metadata = metadata,
  signature = signature,
  difexp = difexp
)

2. Create an OmicSignature Object Step-by-Step

The example provided below is from an experiment for Myc gene reduce in mice. Signatures were extracted by comparing the liver of treatment and control when mice is 24-month old. To save space, only the result of 1000 genes are included.

This is a bi-directional signature example, which contains up and down regulated features (genes).

2.1. Metadata

A list with the following required fields:
signature_name”, “organism”, “direction_type”, “assay_type”, “phenotype”.
Not required, but highly recommended fields:
platform”, “sample_type”, “description”,
covariates”, “score_cutoff”, “adj_p_cutoff”, “logfc_cutoff”.

“phenotype” is a one or two-word description of a topic, such as a drug treatment, a gene knock out, or a research area. For example, longevity and breast cancer.
Providing a detailed “description” is highly recommended. For instance, it may include information about how the treatment was administered and how each group was defined.

Option 1: Create metadata by hand. This is not recommended, because typos can occur.

metadata <- list(
  "signature_name" = Myc_reduce_mice_liver_24m,
  "organism" = "Mus Musculus",
  "sample_type" = "liver",
  "phenotype" = "Myc_reduce",
  "direction_type" = "bi-directional",
  "assay_type" = "transcriptomics", 
  "platform" = "GPL6246"
)

Option 2: Use createMetadata() (recommended).
This function helps remind you of the built-in attributes. The full list of current built-in attributes is shown here.
You can also provide your own customized attributes using the “others” field.

metadata <- createMetadata(
  # required attributes:
  signature_name = "Myc_reduce_mice_liver_24m",
  organism = "Mus Musculus",
  direction_type = "bi-directional",
  assay_type = "transcriptomics",
  phenotype = "Myc_reduce",

  # optional and recommended:
  covariates = "none",
  description = "mice MYC reduced expression",
  platform = "GPL6246", # use GEO platform ID
  sample_type = "liver", # use BRENDA ontology

  # optional cut-off attributes.
  # specifying them can facilitate the extraction of signatures.
  logfc_cutoff = NULL,
  p_value_cutoff = NULL,
  adj_p_cutoff = 0.05,
  score_cutoff = 5,

  # other optional built-in attributes:
  keywords = c("Myc", "KO", "longevity"),
  cutoff_description = NULL,
  author = NULL,
  PMID = 25619689,
  year = 2015,

  # example of customized attributes:
  others = list("animal_strain" = "C57BL/6")
)

2.1.1 “sample_type” and “platform”

“sample_type” should be a BRENDA ontology term, and “platform” should be a GEO platform accession ID.
See how to search for the correct term to use in “Sample Type & Platform Info”.

2.1.2 “direction_type”

direction_type must be one of the following:

  • “uni-directional”. Only a list of significant feature names is available. Examples including “genes mutated in a disease” and “markers of a cell type”.

  • “bi-directional”. Significant features can be grouped into “up” and “down” categories. For example, when comparing treatment vs. control groups, some features will be higher (“up”, or “+”) and some will be lower (“down” or “-”) in the treatment group. When the phenotype is a continuous trait, such as age, some features will increase (“up”, or “+”) with age, while others will decrease (“down”, or “-”) with age.

  • “categorical”. Used with multi-valued categorical phenotypes (e.g., “A” vs. “B” vs. “C”), usually analyzed by ANOVA.

2.1.3 “assay_type”

assay_type is one of the following:
- “transcriptomics” (e.g. RNA-seq, micro-array)
- “proteomics”
- “metabolomics”
- “methylomics”
- “genetic_variations” (e.g. SNP, GWAS)
- “DNA_binding_sites” (e.g. ChIP-seq)
- “others”

2.2. Differential expression analysis results (difexp)

A differential expression dataframe is optional but highly recommended if available. It facilitates downstream signature extraction.

difexp is a dataframe with the following required columns:
probe_id”, “feature_name”, along with at least one of the following: “p_value”, “q_value”, or “adj_p”.

probe_id” is a unique identifier for each feature. If not provided, it will be automatically generated. Common examples include unique numbers and probe IDs.
feature_name” is a name that identifies each feature, examples include ENSEMBL IDs, UniProt IDs, and Refmet IDs. To better identify the features, it is recommended to add an additional annotation column(s), e.g., “hgnc_symbol”. For metabolite features, it is recommended to include multiple annotation columns is available, e.g., HMDB ID and InChI key.
p_value”, “q_value”, or “adj_p” refers to the p- or q-value representing the significance of each feature.
“score” is a numeric value that indicates the importance or significance of a feature. Common examples include t-test statistics and Z-scores.

Here we use an example out put from the differential expression analysis using the limma package.

# difexp <- read.table(file.path(system.file("extdata", package = "OmicSignature"), "difmatrix_Myc_mice_liver_24m.txt"),
#   header = TRUE, sep = "\t", stringsAsFactors = FALSE
# )
difexp <- readRDS(file.path(system.file("extdata", package = "OmicSignature"), "difmatrix_Myc_mice_liver_24m.rds"))
head(difexp)
#>   Probe.ID  logFC AveExpr      t P.Value adj.P.Val      b            ensembl
#> 1 10345228 -0.167   7.106 -1.470   0.186     0.560 -5.866 ENSMUSG00000103746
#> 2 10354534  0.041   4.351  0.520   0.620     0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175   4.955 -0.941   0.379     0.731 -6.458 ENSMUSG00000043629
#> 4 10346337  0.025   8.621  0.188   0.857     0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025   6.063 -0.284   0.785     0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055   7.595 -0.630   0.549     0.836 -6.712 ENSMUSG00000049881
#>     gene_symbol
#> 1 1700001G17Rik
#> 2 1700019A02Rik
#> 3 1700019D03Rik
#> 4 1700066M21Rik
#> 5 1700101I19Rik
#> 6 2810025M15Rik

Manually change the column names to match the requirement. The built-in function replaceDifexpCol() designed to replace some frequently used alternative column names.

difexp <- difexp %>% rename(feature_name = ensembl)
colnames(difexp) <- replaceDifexpCol(colnames(difexp))
head(difexp)
#>   probe_id  logfc  mean  score p_value adj_p      b       feature_name
#> 1 10345228 -0.167 7.106 -1.470   0.186 0.560 -5.866 ENSMUSG00000103746
#> 2 10354534  0.041 4.351  0.520   0.620 0.870 -6.780 ENSMUSG00000060715
#> 3 10354529 -0.175 4.955 -0.941   0.379 0.731 -6.458 ENSMUSG00000043629
#> 4 10346337  0.025 8.621  0.188   0.857 0.962 -6.911 ENSMUSG00000038323
#> 5 10353792 -0.025 6.063 -0.284   0.785 0.936 -6.885 ENSMUSG00000045815
#> 6 10350848 -0.055 7.595 -0.630   0.549 0.836 -6.712 ENSMUSG00000049881
#>       gene_name
#> 1 1700001G17Rik
#> 2 1700019A02Rik
#> 3 1700019D03Rik
#> 4 1700066M21Rik
#> 5 1700101I19Rik
#> 6 2810025M15Rik

2.3. Signature

signature is a dataframe with column “probe_id” and “feature_name”. If the signature is “bi-directional” or “categorical” (specified in direction_type in metadata), then column “direction” is also required.
An optional column “score” is recommended when applicable.

Option 1: Extract signature from difexp.
In this example, we create a bi-directional signature from the difexp using the score_cutoff and adj_p_cutoff specified in the metadata.

signature <- difexp %>%
  dplyr::filter(abs(score) > metadata$score_cutoff & adj_p < metadata$adj_p_cutoff) %>%
  dplyr::select(probe_id, feature_name, score) %>%
  dplyr::mutate(direction = ifelse(score > 0, "+", "-"))
head(signature)
#>   probe_id       feature_name   score direction
#> 1 10346882 ENSMUSG00000025964  -6.990         -
#> 2 10353878 ENSMUSG00000067653  -7.867         -
#> 3 10349648 ENSMUSG00000004552  14.762         +
#> 4 10355278 ENSMUSG00000062209   6.083         +
#> 5 10353192 ENSMUSG00000025932  10.487         +
#> 6 10345762 ENSMUSG00000026072 -13.543         -

Function standardizeSigDF() can help remove duplicated and empty names.

signature <- standardizeSigDF(signature)
head(signature)
#>   probe_id       feature_name   score direction
#> 1 10349648 ENSMUSG00000004552  14.762         +
#> 2 10345762 ENSMUSG00000026072 -13.543         -
#> 3 10353192 ENSMUSG00000025932  10.487         +
#> 4 10355259 ENSMUSG00000061816 -10.315         -
#> 5 10351477 ENSMUSG00000102418   8.818         +
#> 6 10353878 ENSMUSG00000067653  -7.867         -

Option 2: Manually create signature dataframe.
For uni-directional signature:

signature <- data.frame("probe_id" = c(1, 2, 3), "feature_name" = c("gene1", "gene2", "gene3"))

For bi-directional signature:

signature <- data.frame(
  "probe_id" = c(1, 2, 3),
  "feature_name" = c("gene1", "gene2", "gene3"),
  "score" = c(0.45, -3.21, 2.44),
  "direction" = c("+", "-", "+")
)

For categorical signature:

signature <- data.frame(
  "probe_id" = c(1, 2, 3, 4),
  "feature_name" = c("gene1", "gene2", "gene3", "gene4"),
  "score" = c(0.45, -3.21, 2.44, -2.45),
  "direction" = c("group1", "group1", "group2", "group3")
)

2.4. Create the OmicSignature object

OmS <- OmicSignature$new(
  metadata = metadata,
  signature = signature,
  difexp = difexp
)
#>   [Success] OmicSignature object Myc_reduce_mice_liver_24m created.

Set print_message = TRUE to see the messages.

OmS <- OmicSignature$new(
  metadata = metadata,
  signature = signature,
  difexp = difexp,
  print_message = TRUE
)
#>   --Required attributes for metadata: signature_name, organism, direction_type, assay_type, phenotype --
#>   [Success] Metadata is saved. 
#>   [Success] Signature is valid. 
#>   [Success] difexp is valid. 
#>   [Success] OmicSignature object Myc_reduce_mice_liver_24m created.

See the created object information:

print(OmS)
#> Signature Object: 
#>   Metadata: 
#>     adj_p_cutoff = 0.05 
#>     assay_type = transcriptomics 
#>     covariates = none 
#>     description = mice MYC reduced expression 
#>     direction_type = bi-directional 
#>     keywords = Myc, KO, longevity 
#>     organism = Mus Musculus 
#>     others = C57BL/6 
#>     phenotype = Myc_reduce 
#>     platform = GPL6246 
#>     PMID = 25619689 
#>     sample_type = liver 
#>     score_cutoff = 5 
#>     signature_name = Myc_reduce_mice_liver_24m 
#>     year = 2015 
#>   Metadata user defined fields: 
#>     animal_strain = C57BL/6 
#>   Signature: 
#>     Length (15)
#>     Class (character)
#>     Mode (character)
#>   Differential Expression Data: 
#>     884 x 9

Use new criteria to extract significant features:
(this does not change the signature saved in the object)

OmS$extractSignature("abs(score) > 10; adj_p < 0.01")
#>   probe_id       feature_name   score direction
#> 1 10349648 ENSMUSG00000004552  14.762         +
#> 2 10345762 ENSMUSG00000026072 -13.543         -
#> 3 10353192 ENSMUSG00000025932  10.487         +
#> 4 10355259 ENSMUSG00000061816 -10.315         -

Besides save and read OmicSignature object in .rds format, you can export the object as a json text file.

saveRDS(OmS, "Myc_reduce_mice_liver_24m_OmS.rds")
writeJson(OmS, "Myc_reduce_mice_liver_24m_OmS.json")

See more in “Functionalities of OmicSignature” section.

3. Create an OmicSignature from difexp and metadata

You can by-pass the generating signature process once you are an expert. Simply provide cutoffs (e.g. adj_p_cutoff and score_cutoff) in the metadata, make sure difexp has “adj_p” and “score” columns, and use OmicSigFromDifexp() to automatically extract significant features and create the OmicSignature object.

OmS1 <- OmicSigFromDifexp(difexp, metadata)
#> -- criterias used to extract signatures:  abs(score) >= 5; adj_p <= 0.05 . 
#> 
#>   [Success] OmicSignature object Myc_reduce_mice_liver_24m created.
OmS1
#> Signature Object: 
#>   Metadata: 
#>     adj_p_cutoff = 0.05 
#>     assay_type = transcriptomics 
#>     covariates = none 
#>     description = mice MYC reduced expression 
#>     direction_type = bi-directional 
#>     keywords = Myc, KO, longevity 
#>     organism = Mus Musculus 
#>     others = C57BL/6 
#>     phenotype = Myc_reduce 
#>     platform = GPL6246 
#>     PMID = 25619689 
#>     sample_type = liver 
#>     score_cutoff = 5 
#>     signature_name = Myc_reduce_mice_liver_24m 
#>     year = 2015 
#>   Metadata user defined fields: 
#>     animal_strain = C57BL/6 
#>   Signature: 
#>     Length (15)
#>     Class (character)
#>     Mode (character)
#>   Differential Expression Data: 
#>     884 x 10