vignettes/docs/Diffanal_LimmaBC.Rmd
Diffanal_LimmaBC.RmdHere, we present a couple of simple examples of differential analysis
based on limma. In particular, we show how the design
matrix can be constructed using different ‘codings’ of the regression
variables. We also define a simple wrapper function that can help us
remember the different limma steps.
This document also provides examples of how to embed equations into an R markdown document, and how to cross-reference those equations.
We start by loading and pre-processing the necessary expression dataset.
## let us load the ExpressionSet, an already simplified version of the
## ..breast cancer dataset
data(renamedBreastDB)
#load(file.path(Sys.getenv("BS831"),"data/renamedBreastDB.rda"))
dat <- renamedBreastDB
head(pData(dat))## individual diseaseState
## GSM85500 T44 nonbasal
## GSM85504 T175 nonbasal
## GSM85493 T37 nonbasal
## GSM85496 T161 nonbasal
## GSM85510 T162 nonbasal
## GSM85509 T74 nonbasal
table(dat$diseaseState)##
## nonbasal basal
## 20 18
Limma
When the regression variable is categorical (binary in this case), we can choose different (yet equivalent) ‘codings’. In particular, we can fit a standard model
\[ \begin{equation} y = \beta_0 + \beta_1 X_{group}, \label{eq:lm1} \tag{1} \end{equation} \]
where \(X_{group} = 0,1,\) if the observation is from a nonbasal- or a basal-type tumor, respectively. Alternatively, we can fit the following model
\[ \begin{equation} y = \beta_{\not{basal}} X_{\not{basal}} + \beta_{basal}X_{basal}, \label{eq:lm2} \tag{2} \end{equation} \]
where \(X_{\not{basal}} = 1,0\), if the observation is from a nonbasal- or a basal-type tumor, respectively. Conversely, \(X_{basal} = 0,1\), if the observation is from a nonbasal- or a basal-type tumor, respectively.
The corresponding models in limma are defined based on
the two design matrices design1, for model \(\eqref{eq:lm1}\), and design2,
for model \(\eqref{eq:lm2}\), as
follows.
design1 <- model.matrix( ~ diseaseState, data = pData(dat) )
colnames(design1) <- c("baseline","basal") # let's simplify column names
design2 <- model.matrix( ~ 0 + dat$diseaseState )
colnames(design2) <- levels( dat$diseaseState ) # ditto
print(unique(design1)) # showing only one instance of each class## baseline basal
## GSM85500 1 0
## GSM85480 1 1
## nonbasal basal
## 1 1 0
## 21 0 1
The different designs require slightly different approaches to model fitting. In model \(\eqref{eq:lm1}\), we want to extract the p-value corresponding to the \(\beta_{group}\) parameter. In model \(\eqref{eq:lm2}\), we care about the difference \(\beta_{basal} - \beta_{\not{basal}}\).
## baseline basal
## AGR3 9.241425 -5.476069
## SCGB2A2 9.049974 -4.020716
## AGR2 10.349870 -5.139503
## CALML5 6.314069 2.564793
## KRT23 6.348392 3.193484
## FOXA1 7.215394 -3.997576
fit1 <- eBayes(fit1) # pooling of variance across like-genes
fit1table <- topTable(fit1,coef="basal",adjust="BH",number=Inf,sort.by="P")
head(fit1table)## affy_hg_u133_plus_2 hgnc_symbol
## MLPH 218211_s_at MLPH
## FOXA1 237086_at FOXA1
## FOXC1 1553613_s_at FOXC1
## AGR2 209173_at AGR2
## SFRP1 202035_s_at SFRP1
## SIDT1 219734_at SIDT1
## description
## MLPH melanophilin [Source:HGNC Symbol;Acc:HGNC:29643]
## FOXA1 forkhead box A1 [Source:HGNC Symbol;Acc:HGNC:5021]
## FOXC1 forkhead box C1 [Source:HGNC Symbol;Acc:HGNC:3800]
## AGR2 anterior gradient 2 [Source:HGNC Symbol;Acc:HGNC:328]
## SFRP1 secreted frizzled-related protein 1 [Source:HGNC Symbol;Acc:HGNC:10776]
## SIDT1 SID1 transmembrane family, member 1 [Source:HGNC Symbol;Acc:HGNC:25967]
## logFC AveExpr t P.Value adj.P.Val B
## MLPH -4.539281 9.487324 -15.77794 1.198422e-19 2.539912e-16 34.35447
## FOXA1 -3.997576 5.321805 -15.63213 1.693275e-19 2.539912e-16 34.02400
## FOXC1 3.774299 7.847670 14.27359 4.745041e-18 4.745041e-15 30.82304
## AGR2 -5.139503 7.915368 -12.95288 1.488409e-16 1.116307e-13 27.48939
## SFRP1 4.144489 8.377033 12.32914 8.162432e-16 4.897459e-13 25.83515
## SIDT1 -2.605517 6.013343 -12.23990 1.045491e-15 5.227453e-13 25.59416
## nonbasal basal
## AGR3 9.241425 3.765357
## SCGB2A2 9.049974 5.029258
## AGR2 10.349870 5.210367
## CALML5 6.314069 8.878863
## KRT23 6.348392 9.541876
## FOXA1 7.215394 3.217818
## let's define the contrast matrix we need
contrast.matrix <- makeContrasts((basal-nonbasal),levels=design2)
fit2 <- contrasts.fit(fit2,contrast.matrix)
## notice that the contrast coefficient takes the same values as the basal coefficient in design1
head(fit2$coefficients)## Contrasts
## (basal - nonbasal)
## AGR3 -5.476069
## SCGB2A2 -4.020716
## AGR2 -5.139503
## CALML5 2.564793
## KRT23 3.193484
## FOXA1 -3.997576
fit2 <- eBayes(fit2) # pooling of variance across like-genes
fit2table <- topTable(fit2,coef=1,adjust="BH",number=Inf,sort.by="P")
head(fit2table)## affy_hg_u133_plus_2 hgnc_symbol
## MLPH 218211_s_at MLPH
## FOXA1 237086_at FOXA1
## FOXC1 1553613_s_at FOXC1
## AGR2 209173_at AGR2
## SFRP1 202035_s_at SFRP1
## SIDT1 219734_at SIDT1
## description
## MLPH melanophilin [Source:HGNC Symbol;Acc:HGNC:29643]
## FOXA1 forkhead box A1 [Source:HGNC Symbol;Acc:HGNC:5021]
## FOXC1 forkhead box C1 [Source:HGNC Symbol;Acc:HGNC:3800]
## AGR2 anterior gradient 2 [Source:HGNC Symbol;Acc:HGNC:328]
## SFRP1 secreted frizzled-related protein 1 [Source:HGNC Symbol;Acc:HGNC:10776]
## SIDT1 SID1 transmembrane family, member 1 [Source:HGNC Symbol;Acc:HGNC:25967]
## logFC AveExpr t P.Value adj.P.Val B
## MLPH -4.539281 9.487324 -15.77794 1.198422e-19 2.539912e-16 34.35447
## FOXA1 -3.997576 5.321805 -15.63213 1.693275e-19 2.539912e-16 34.02400
## FOXC1 3.774299 7.847670 14.27359 4.745041e-18 4.745041e-15 30.82304
## AGR2 -5.139503 7.915368 -12.95288 1.488409e-16 1.116307e-13 27.48939
## SFRP1 4.144489 8.377033 12.32914 8.162432e-16 4.897459e-13 25.83515
## SIDT1 -2.605517 6.013343 -12.23990 1.045491e-15 5.227453e-13 25.59416
Regardless of the ‘coding’, the results are the same.
## compare the results from the two 'codings'
all.equal(fit2table[,"t"],fit1table[rownames(fit2table),"t"])## [1] TRUE

Since running limma requires memorizing several commands
that are not easy to remember, here we define a simple wrapper function
that will relief us from remembering them all. Incidentally, this is
also (a simplified version of) the wrapper function defined in
CBMRtools, and it corresponds to the design2,
model \(\eqref{eq:lm2}\), shown
above.
It should be noted that this simple wrapper function does not allow you to specify additional covariates (confounders). In that case, you will have to either define a more advanced wrapper (any volunteer?), or revert back to the direct use of limma commands.
run_limma <- function(
eset, # the expression set
treatment, # pData variable name of the phenotype
cond, # label of the 'treatment' group
control, # label of the 'control' group
sort.by="P", # sort output by given criterion
decreasing = FALSE, # sort order
cutoff = NA # extract only genes above/below 'sort.by' cutoff
)
{
if ( ncol(fData(eset))<1 )
stop( "empty fData" )
cat( "running limma\n" )
test_name <- paste(control, "_vs_", cond, sep = "")
treatmentvec <- pData(eset)[, treatment]
eset <- eset[, treatmentvec %in% c(cond, control)]
treatmentvec <- factor( pData(eset)[, treatment] )
design <- model.matrix( ~ 0 + treatmentvec )
colnames(design) <- levels( treatmentvec )
fit <- lmFit(eset, design)
command_str <- paste("makeContrasts(", "(", cond , "-", control, ")",
",levels = design)", sep = "")
contrast.matrix <- eval( parse(text=command_str) )
fit2 <- contrasts.fit( fit, contrast.matrix )
fit2 <- eBayes(fit2)
## extract full table
fit2.table <- topTable(fit2, coef=1, adjust="BH", number=length(fit2), sort.by=sort.by)
## extract genes above/below cutoff
if( !is.na(cutoff) ) {
if (decreasing) {
fit2.table <- fit2.table[fit2.table[,sort.by] > cutoff,]
}
else {
fit2.table <- fit2.table[fit2.table[,sort.by] < cutoff,]
}
}
return(fit2.table)
}Now, let us test the newly defined function on our data.
fit3table <- run_limma(dat,treatment="diseaseState",cond="basal",control="nonbasal")## running limma
head(fit3table)## affy_hg_u133_plus_2 hgnc_symbol
## MLPH 218211_s_at MLPH
## FOXA1 237086_at FOXA1
## FOXC1 1553613_s_at FOXC1
## AGR2 209173_at AGR2
## SFRP1 202035_s_at SFRP1
## SIDT1 219734_at SIDT1
## description
## MLPH melanophilin [Source:HGNC Symbol;Acc:HGNC:29643]
## FOXA1 forkhead box A1 [Source:HGNC Symbol;Acc:HGNC:5021]
## FOXC1 forkhead box C1 [Source:HGNC Symbol;Acc:HGNC:3800]
## AGR2 anterior gradient 2 [Source:HGNC Symbol;Acc:HGNC:328]
## SFRP1 secreted frizzled-related protein 1 [Source:HGNC Symbol;Acc:HGNC:10776]
## SIDT1 SID1 transmembrane family, member 1 [Source:HGNC Symbol;Acc:HGNC:25967]
## logFC AveExpr t P.Value adj.P.Val B
## MLPH -4.539281 9.487324 -15.77794 1.198422e-19 2.539912e-16 34.35447
## FOXA1 -3.997576 5.321805 -15.63213 1.693275e-19 2.539912e-16 34.02400
## FOXC1 3.774299 7.847670 14.27359 4.745041e-18 4.745041e-15 30.82304
## AGR2 -5.139503 7.915368 -12.95288 1.488409e-16 1.116307e-13 27.48939
## SFRP1 4.144489 8.377033 12.32914 8.162432e-16 4.897459e-13 25.83515
## SIDT1 -2.605517 6.013343 -12.23990 1.045491e-15 5.227453e-13 25.59416
## [1] "'is.NA' value mismatch: 2973 in current 0 in target"