Introduction

This vignette describes the workflow for running K2Taxonomer on single-cell gene expression data (Reed and Monti 2020). Many steps in this workflow is shared with that of bulk gene expression. Accordingly, these steps are described in more detail here.

Requirements

Load packages

## K2Taxonomer package
library(K2Taxonomer)

## For drawing dendrograms
library(ggdendro)

Read in single-cell RNAseq data

For this example we will use a single-cell RNAseq dat set a of 379 cells from the mouse visual cortex which is a subset from (Tasic et al. 2016), made available by the scRNAseq package.

This data set includes identifiers of 17 cell types belonging to three major classes of cells based on surface markers vasoactive intestinal peptide (Vip), parvalbumin (Pvalp) and somatostatin (Sst). Vip cells include additional subclasses based on additional surface markers: L4, L5, L5a, L5b and L6a. Finally, all of these classes are further subtyped based on additional markers genes.

dat <- scRNAseq::ReprocessedAllenData(assays="rsem_tpm")

Data-processing

Convert to ExpressionSet object

As expression data input K2Taxonomer currently requires an ExpressionSet object. ExpressionSet objects can be easily created from the SingleCellExperiment object uploaded by the scRNAseq package.


# Convert to expression set
eSet <- ExpressionSet(assayData=assay(dat))
pData(eSet) <- as.data.frame(colData(dat))

## Log the data
exprs(eSet) <- log2(exprs(eSet) + 1)

Removing lowly represented cell types

Given that this data set is relatively small, we will need to remove lowly represented cell types. These will include six cell types with fewer than 5 cells. Additionally, we will remove 30 cells without cell type labels.

## Table of cell types
typeTable <- table(eSet$Primary.Type, useNA="ifany")
print(typeTable)
#> 
#>     L4 Arf5    L4 Ctxn3   L4 Scnn1a     L5 Ucma   L5a Batf3 L5a Hsd11b1 
#>          17          19          47          14          41          26 
#>   L5a Pde1c L5a Tcerg1l   L5b Cdh13  L5b Chrna6    L5b Tph2   L6a Car12 
#>           4          25          32           8          24           3 
#>     L6a Mgp     L6a Sla   L6a Syt17 Pvalb Tacr3    Sst Myh8        <NA> 
#>          33          52           2           1           1          30
# Remove NAs
eSet <- eSet[, !is.na(eSet$Primary.Type)]

# Keep cell types with at least 5 observations
eSet <- eSet[, eSet$Primary.Type %in% names(typeTable)[typeTable >= 5]]

print(eSet)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 20816 features, 338 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   sampleNames: SRR2140028 SRR2140022 ... SRR2139336 (338 total)
#>   varLabels: NREADS NALIGNED ... passes_qc_checks_s (22 total)
#>   varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#> Annotation:

Formatting cell type labels

Due to data wrangling procedures in the K2Taxonomer functions, cell type label values can only contain letters, numbers, and underscores. Here we replace the spaces in the cell type labels with underscores and create a new variable celltype.

eSet$celltype <- gsub(" ", "_", eSet$Primary.Type)

Set up partioning function

For single-cell analysis, K2Taxonomer implements the constrained K-means algorithm using the lcvqe() function from the conclust R package. K2Taxonomer includes two approaches for implementing this algorithm, specified by the cKmeansWrapper() and cKmeansWrapperSubsample() functions. While each runs the same constrained K-means algorithm, to improve computational efficiency cKmeansWrapperSubsample() first randomly subsamples the data to the size of the smallest represented cell type for each perturbation data sets (Wagstaff et al. 2001).

In order to run either function, the user must include a named list with the following components:

  • eMat: The matrix of expression values from the ExpressionSet object.
  • labs: The vector of cell type labels
  • maxIter: The maximum number of iterations for running lcvqe(), which uses 10 as the default.
# Create clustList
wrapperList <- list(
    eMat=exprs(eSet),
    labs=eSet$celltype,
    maxIter=10
)

Running K2Taxonomer

Initialize K2 object

# Run K2Taxonomer
K2res <- K2preproc(eSet,
                cohorts="celltype",
                featMetric="F",
                logCounts=TRUE,
                nBoots=100,
                clustFunc=cKmeansWrapperSubsample,
                clustList=wrapperList)
#> Collapsing group-level values with LIMMA.

Run recursive partitioning algorithm

At this point the rest of the analysis identical to that of the basic analysis as presented here (Reed and Monti 2020). Please refer to this document for further description of these steps.

## Run K2Taxonomer aglorithm
K2res <- K2tax(K2res)

Generate dendrogram from K2Taxonomer results

## Get dendrogram from K2Taxonomer
dendro <- K2dendro(K2res)

## Get dendrogram data
ggdendrogram(dendro)

Annotate K2Taxonomer results

Differential analysis

## Run DGE
K2res <- runDGEmods(K2res)

## Concatenate all results and generate table
DGEtable <- getDGETable(K2res)
head(DGEtable)
#>      gene     coef     mean        t          pval           fdr        B edge
#> 1   Pde1a 7.607703 6.795236 33.04659 1.191313e-107 1.487902e-102 234.4221    2
#> 2     Ptn 8.753100 6.087870 40.05787  2.249965e-94  1.405058e-89 203.6672    1
#> 3    Bcl6 3.989053 1.105827 31.13960  1.406226e-88  5.854398e-84 190.8268    2
#> 4 Neurod6 7.148083 6.023724 29.42144  1.107549e-83  3.458212e-79 179.7376    1
#> 5  Fam84a 6.077725 2.119998 29.36922  1.569144e-83  3.919595e-79 179.3947    2
#> 6   Foxp2 5.197925 2.479363 33.31899  4.130923e-81  8.598930e-77 173.8696    2
#>   node direction
#> 1    A        up
#> 2    E        up
#> 3    C        up
#> 4    C        up
#> 5    C        up
#> 6    E        up

Gene set hyperenrichment

## Create dummy set of gene sets
genes <- unique(DGEtable$gene)
genesetsMadeUp <- list(
    GS1=genes[1:50],
    GS2=genes[51:100],
    GS3=genes[101:150]
)

## Run gene set hyperenrichment
K2res <- runGSEmods(K2res,
                genesets=genesetsMadeUp,
                qthresh=0.1)

Single-sample gene set enrichment

## Create expression matrix with ssGSEA(GSVA) estimates
K2res <- runGSVAmods(K2res,
                ssGSEAalg="gsva",
                ssGSEAcores=1,
                verbose=FALSE)

Differential single-sample enrichment analysis

## Run differential analysis on enrichment score Expression Set
K2res <- runDSSEmods(K2res)

## Get differential results for one split
head(K2results(K2res)$A$dsse)
#>   category        coef        mean         t         pval          fdr
#> 1      GS3 0.126609098 -0.02959331 7.3074367 1.423274e-12 1.280947e-11
#> 2      GS2 0.049090375 -0.01559724 2.6250996 8.984562e-03 1.347684e-02
#> 3      GS1 0.005320382 -0.02537272 0.3296895 7.418021e-01 7.854375e-01
#>           B edge
#> 1 17.961540    1
#> 2 -3.581029    2
#> 3 -6.922921    2

## Extract table of all hyper and sample-level enrichment tests
K2enrRes <- getEnrichmentTable(K2res)
head(K2enrRes)
#>   category node edge direction   pval_hyper    fdr_hyper nhits ndrawn ncats
#> 1      GS3    G    1        up 1.079260e-18 3.885337e-17    19    467    50
#> 2      GS2    F    1        up 2.192957e-17 3.947323e-16    16    309    50
#> 3      GS1    C    1        up 3.680606e-16 3.312545e-15    18    541    50
#> 4      GS2    C    1        up 3.680606e-16 3.312545e-15    18    541    50
#> 5      GS3    A    1        up 4.612150e-16 3.320748e-15    21    875    50
#> 6      GS1    E    1        up 3.002425e-14 1.801455e-13    23   1382    50
#>    ntot   pval_limma    fdr_limma      coef         mean        t        B
#> 1 20000 1.437757e-08 4.313270e-08 0.1524040 -0.001945461 5.822848  9.97524
#> 2 20000 3.187581e-02 4.413574e-02 0.0813716 -0.107220173 2.189929 -4.02103
#> 3 20000           NA           NA        NA           NA       NA       NA
#> 4 20000 1.015150e-10 4.568173e-10 0.1384178 -0.003542505 6.675373 13.87399
#> 5 20000 1.423274e-12 1.280947e-11 0.1266091 -0.029593310 7.307437 17.96154
#> 6 20000 1.297947e-10 4.672610e-10 0.1242961 -0.048398313 6.549259 14.15506
#>                                                                                                                                                   hits
#> 1                            Nav1,Tmem47,Camk2n1,Sphkap,Rims3,Lypd1,Galntl6,Ccl27b,Gm13306,Coro6,Fam19a2,Lamp5,Npcd,Hs3st4,Dkkl1,Whrn,Nrep,Coch,Scnn1a
#> 2                                             Nos1ap,Crym,Gucy1a3,Stk17b,Enc1,Dkk3,Fstl5,Ldb2,6430573F11Rik,Luzp2,Dynll1,Meis2,Bcl11b,Ntrk3,Cck,Igfbp4
#> 3                                         Pde1a,Neurod6,Foxp2,Pcp4,Cux2,Slc1a2,Nptx1,Ncald,Slc30a3,Rorb,Atp1b2,Olfm1,Camk2d,Cdh10,Cbln4,Per3,Tle4,Hpca
#> 4                              Unc5d,Zdhhc22,Car10,Kcnv1,Pamr1,Deptor,Tmsb10,Dkk3,Spock1,Alcam,Tspan7,Ccdc80,Gpm6a,S100a10,Fxyd7,Abracl,Scube1,Gadd45a
#> 5                     Camk2n1,Lin7a,Sphkap,Chgb,Pou3f2,Rims3,Rph3a,Cpne9,Fam19a2,Npcd,Dcdc2a,Mapk4,Dkkl1,Whrn,Ppm1h,Pcdh20,Kcna2,Nrep,Coch,Scnn1a,Rgs4
#> 6 Ptn,Neurod6,Fam84a,Cux2,Srgap3,Slc30a3,Rorb,Plcxd2,Atp1b2,Btbd3,Lrrtm4,Olfm1,Utrn,Camk2d,Cdh10,Cbln4,Fam19a1,Per3,Cacna2d3,Serpine2,Vat1l,Hpca,Grik1

Generate interactive dashboard to explore results

For more information see go here.


## NOT RUN
K2dashboard(K2res)

Runtime for this example

#> Time difference of 1.457723 mins

References

Reed, Eric R., and Stefano Monti. 2020. “Multi-Resolution Characterization of Molecular Taxonomies in Bulk and Single-Cell Transcriptomics Data.” Preprint. Bioinformatics. https://doi.org/10.1101/2020.11.05.370197.

Tasic, Bosiljka, Vilas Menon, Thuc Nghi Nguyen, Tae Kyung Kim, Tim Jarsky, Zizhen Yao, Boaz Levi, et al. 2016. “Adult Mouse Cortical Cell Taxonomy Revealed by Single Cell Transcriptomics.” Nature Neuroscience 19 (2): 335–46. https://doi.org/10.1038/nn.4216.

Wagstaff, Kiri, Claire Cardie, Seth Rogers, and Stefan Schrodl. 2001. “Constrained K-Means Clustering with Background Knowledge.” In ICML, 577–84.