Wrapper for constrained K-means on subsampled data — cKmeansWrapperSubsample • K2Taxonomer

This fuction is a wrapper for the constrained Kmeans algorithm using `lcvqe` from the `conclust` package. This function will subset each cohort down to that with the smallest number of observations.This function is not meant to be run individually, but as a 'clustFunc' argument for running `K2preproc()`, `runK2Taxonomer()`, and `K2tax()`.

cKmeansWrapperSubsample(dataMatrix, clustList)

Arguments

dataMatrix	An P x C numeric matrix of data. Where C is the number of cohort labels.
clustList	List of objects to use for clustering procedure. 'eMat'P x N Expression matrix of data set. See `?Biobase::exprs()`. 'labs'Vector of cohort labels of observations in data set, corresponding to columns of `clustList$eMat`.

dataMatrix

An P x C numeric matrix of data. Where C is the number of cohort labels.

clustList

List of objects to use for clustering procedure.

'eMat'P x N Expression matrix of data set. See `?Biobase::exprs()`.
'labs'Vector of cohort labels of observations in data set, corresponding to columns of `clustList$eMat`.

Value

A character string of concatenated 1's and 2's pertaining to the cluster assignment of each column in dataMatrix.

References

Reed ER, Monti S (2020). “Multi-resolution characterization of molecular taxonomies in bulk and single-cell transcriptomics data.” Bioinformatics. doi: 10.1101/2020.11.05.370197 , http://biorxiv.org/lookup/doi/10.1101/2020.11.05.370197. Wagstaff K, Cardie C, Rogers S, Schrodl S (2001). “Constrained K-means Clustering with Background Knowledge.” In ICML, 577--584.

Examples


dat <- scRNAseq::ReprocessedAllenData(assays='rsem_tpm')[seq_len(50),]
#> snapshotDate(): 2020-10-27
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache
#> see ?scRNAseq and browseVignettes('scRNAseq') for documentation
#> loading from cache

eSet <- ExpressionSet(assayData=assay(dat))
pData(eSet) <- as.data.frame(colData(dat))
exprs(eSet) <- log2(exprs(eSet) + 1)

## Subset for fewer cluster labels for this example
eSet <- eSet[, !is.na(eSet$Primary.Type) &
            eSet$Primary.Type %in% c('L4 Arf5',
                'L4 Ctxn3', 'L4 Scnn1a', 'L5 Ucma', 'L5a Batf3')]

## Create cell type variable with spaces
eSet$celltype <- gsub(' ', '_', eSet$Primary.Type)

## Create clustList
cL <- list(
    eMat=exprs(eSet),
    labs=eSet$celltype,
    maxIter=10)

## Run K2preproc to generate generate data matrix
## with a column for each celltype.
K2res <- K2preproc(eSet,
                cohorts='celltype',
                featMetric='F',
                logCounts=TRUE)
#> Collapsing group-level values with LIMMA.
dm <- K2data(K2res)

## Generate K=2 split with constrained K-means
cKmeansWrapperSubsample(dm, cL)
#> [1] "11222"