Function to create K2 object for pre-processing

This function will generate an object of class, K2. This will run pre-processing functions for running K2 Taxonomer procedure.

Usage

K2preproc(
  object,
  cohorts = NULL,
  eMatDS = NULL,
  colData = NULL,
  vehicle = NULL,
  variables = NULL,
  seuAssay = "RNA",
  seuAssayDS = "RNA",
  sceAssay = "logcounts",
  sceAssayDS = NULL,
  block = NULL,
  logCounts = TRUE,
  use = c("Z", "MEAN"),
  nFeats = "sqrt",
  featMetric = c("F", "mad", "sd", "Sn"),
  DGEmethod = c("limma", "mast"),
  DGEexpThreshold = 0.25,
  recalcDataMatrix = TRUE,
  nBoots = 500,
  useCors = 1,
  clustFunc = "cKmeansDownsampleSqrt",
  clustList = NULL,
  linkage = c("mcquitty", "ward.D", "ward.D2", "single", "complete", "average",
    "centroid"),
  info = NULL,
  genesets = NULL,
  qthresh = 0.05,
  cthresh = 0,
  ntotal = 20000,
  ScoreGeneSetMethod = c("GSVA", "AUCELL"),
  oneoff = TRUE,
  stabThresh = 0,
  geneURL = NULL,
  genesetURL = NULL
)

Arguments

object

An object of class matrix, dgCMatrix, Seurat, SingleCellExperiment, or ExpressionSet. For matrix and dgCMatrix, include genes and observations (single-cell/bulk profiles) as rows and columns, respectively.

cohorts

Character. The column in meta data of 'object' that has cohort IDs. Default NULL if no cohorts in data.

eMatDS

Numeric matrix. A matrix with the same number of observations as 'object' containing normalized expression data to be used in analyses downstream of partitioning results.

colData

data.frame. Only used if 'object' is a matrix or dgCMatrix. A data frame with named rows and columns containing observation data for each column in 'object'.

vehicle

The value in the cohort variable that contains the ID of observation to use as control. Default NULL if no vehicle to be used.

variables

Character. Columns in meta data of 'object' to control for in differential analyses.

seuAssay

Character. Name of assay in Seurat object containing expression data for running partitioning algorithm. If cohorts based on clustering, this should be the assay used.

seuAssayDS

Character. Name of assay in Seurat object containing expression data normalized expression data to be used in analyses downstream of partitioning algorithm.

sceAssay

Character. Name of assay in SingleCellExperimen object containing expression data for running partitioning algorithm. If cohorts based on clustering, this should be the assay used.

sceAssayDS

Character. Name of assay in SingleCellExperiment object containing expression data normalized expression data to be used in analyses downstream of partitioning algorithm.

block

Character. Block parameter in limma for modeling higherarchical data structure, such as multiple observations per individual.

logCounts

Logical. Whether or not expression values are log-scale counts or log normalized counts from RNA-seq. Default is TRUE.

use

Character. Options are 'Z' to generate test statistics or 'MEAN' to use means from differential analysis for clustering.

nFeats

'sqrt' or a numeric value <= number of features to subset for each partition.

featMetric

Character. Metric to use to assign gene-level variance/signal score.

F: F-statistic from evaluating differences in means across cohort
mad: Median absolute deviation
sd: Standard deviation
Sn: Robust scale estimator

DGEmethod

Character. Method for running differential gene expression analyses. Use one of either 'limma' (default) or 'mast'.

DGEexpThreshold

Numeric. A value between 0 and 1 indicating for filtering lowly expressed genes for partition-specific differential gene expression. Proportion of observations with counts > 0 in at least one subgroup at a specific partition.

recalcDataMatrix

Logical. Recalculate dataMatrix for each partion? Default is TRUE.

nBoots

nBoots A value of the number of bootstraps to run at each partition. Default is 500.

useCors

Numeric. Number of cores to use for parallelizable processes.

clustFunc

Character. Wrapper function to be used in recursive partitioning.

cKmeansDownsampleSqrt: Perform constrained K-means clustering after subsampling each cohort by the square root of the number of observations
cKmeansDownsampleSmallest: Perform constrained K-means clustering after subsampling each cohort by the size of the smallest cohort
hclustWrapper: Perform hierarchical clustering

clustList

Optional named list of parameters to use with clustFunc.

cKmeansDownsampleSqrt:
- maxIter: The maximum number of iterations to use with lcvqe()
cKmeansDownsampleSmallest:
- maxIter: The maximum number of iterations to use with lcvqe()
hclustWrapper:
- aggMethod: One of the hierarchichal methods specified by hclust() function
- distMetric: One of the distance metrics specified by dist() function

linkage

Character. Linkage criteria for splitting cosine matrix ('method' in hclust). 'average' by default.

info

Character. A vector of column names in meta data of 'object' that contain information to be used in cohort annotation of dashboard visualization

genesets

Named list. Feature sets to be includes in enrichment-based analyses.

qthresh

Numeric. A value between 0 and 1 indicating the FDR cuttoff to define feature sets.

cthresh

Numeric. A positive value for the coefficient cuttoff to define feature sets.

ntotal

Numeric. A positive value to use as the background feature count. 20000 by default.

ScoreGeneSetMethod

Character. Method for gene set scoring. Use one of either 'GSVA' (default) or 'AUCELL'.

oneoff

Logical. Allow 1 observation partition groups? Default is TRUE.

stabThresh

Numeric. A value between 0 and 1 indicatingThreshold for ending clustering.

geneURL

Named list. URLs linking genes to external resources.

genesetURL

Named list. URLs linking gene set to external resources.

Value

An object of class, `K2`.

References

Reed ER, Monti S (2021). “Multi-resolution characterization of molecular taxonomies in bulk and single-cell transcriptomics data.” Nucleic Acids Research. doi:10.1093/nar/gkab552 , https://pubmed.ncbi.nlm.nih.gov/34226941/. Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47--e47. ISSN 1362-4962, 0305-1048, doi:10.1093/nar/gkv007 , http://academic.oup.com/nar/article/43/7/e47/2414268/limma-powers-differential-expression-analyses-for. Rousseeuw PJ, Croux C (1993). “Alternatives to the Median Absolute Deviation.” Journal of the American Statistical Association, 88(424), 1273-1283. ISSN 0162-1459, doi:10.1080/01621459.1993.10476408 .