vignettes/docs/Diffanal_LimmaBC.Rmd
Diffanal_LimmaBC.Rmd
Here, we present a couple of simple examples of differential analysis
based on limma
. In particular, we show how the design
matrix can be constructed using different ‘codings’ of the regression
variables. We also define a simple wrapper function that can help us
remember the different limma
steps.
This document also provides examples of how to embed equations into an R markdown document, and how to cross-reference those equations.
We start by loading and pre-processing the necessary expression dataset.
## let us load the ExpressionSet, an already simplified version of the
## ..breast cancer dataset
data(renamedBreastDB)
#load(file.path(Sys.getenv("BS831"),"data/renamedBreastDB.rda"))
dat <- renamedBreastDB
head(pData(dat))
## individual diseaseState
## GSM85500 T44 nonbasal
## GSM85504 T175 nonbasal
## GSM85493 T37 nonbasal
## GSM85496 T161 nonbasal
## GSM85510 T162 nonbasal
## GSM85509 T74 nonbasal
table(dat$diseaseState)
##
## nonbasal basal
## 20 18
Limma
When the regression variable is categorical (binary in this case), we can choose different (yet equivalent) ‘codings’. In particular, we can fit a standard model
\[ \begin{equation} y = \beta_0 + \beta_1 X_{group}, \label{eq:lm1} \tag{1} \end{equation} \]
where \(X_{group} = 0,1,\) if the observation is from a nonbasal- or a basal-type tumor, respectively. Alternatively, we can fit the following model
\[ \begin{equation} y = \beta_{\not{basal}} X_{\not{basal}} + \beta_{basal}X_{basal}, \label{eq:lm2} \tag{2} \end{equation} \]
where \(X_{\not{basal}} = 1,0\), if the observation is from a nonbasal- or a basal-type tumor, respectively. Conversely, \(X_{basal} = 0,1\), if the observation is from a nonbasal- or a basal-type tumor, respectively.
The corresponding models in limma
are defined based on
the two design matrices design1
, for model \(\eqref{eq:lm1}\), and design2
,
for model \(\eqref{eq:lm2}\), as
follows.
design1 <- model.matrix( ~ diseaseState, data = pData(dat) )
colnames(design1) <- c("baseline","basal") # let's simplify column names
design2 <- model.matrix( ~ 0 + dat$diseaseState )
colnames(design2) <- levels( dat$diseaseState ) # ditto
print(unique(design1)) # showing only one instance of each class
## baseline basal
## GSM85500 1 0
## GSM85480 1 1
## nonbasal basal
## 1 1 0
## 21 0 1
The different designs require slightly different approaches to model fitting. In model \(\eqref{eq:lm1}\), we want to extract the p-value corresponding to the \(\beta_{group}\) parameter. In model \(\eqref{eq:lm2}\), we care about the difference \(\beta_{basal} - \beta_{\not{basal}}\).
## baseline basal
## AGR3 9.241425 -5.476069
## SCGB2A2 9.049974 -4.020716
## AGR2 10.349870 -5.139503
## CALML5 6.314069 2.564793
## KRT23 6.348392 3.193484
## FOXA1 7.215394 -3.997576
fit1 <- eBayes(fit1) # pooling of variance across like-genes
fit1table <- topTable(fit1,coef="basal",adjust="BH",number=Inf,sort.by="P")
head(fit1table)
## affy_hg_u133_plus_2 hgnc_symbol
## MLPH 218211_s_at MLPH
## FOXA1 237086_at FOXA1
## FOXC1 1553613_s_at FOXC1
## AGR2 209173_at AGR2
## SFRP1 202035_s_at SFRP1
## SIDT1 219734_at SIDT1
## description
## MLPH melanophilin [Source:HGNC Symbol;Acc:HGNC:29643]
## FOXA1 forkhead box A1 [Source:HGNC Symbol;Acc:HGNC:5021]
## FOXC1 forkhead box C1 [Source:HGNC Symbol;Acc:HGNC:3800]
## AGR2 anterior gradient 2 [Source:HGNC Symbol;Acc:HGNC:328]
## SFRP1 secreted frizzled-related protein 1 [Source:HGNC Symbol;Acc:HGNC:10776]
## SIDT1 SID1 transmembrane family, member 1 [Source:HGNC Symbol;Acc:HGNC:25967]
## logFC AveExpr t P.Value adj.P.Val B
## MLPH -4.539281 9.487324 -15.77794 1.198422e-19 2.539912e-16 34.35447
## FOXA1 -3.997576 5.321805 -15.63213 1.693275e-19 2.539912e-16 34.02400
## FOXC1 3.774299 7.847670 14.27359 4.745041e-18 4.745041e-15 30.82304
## AGR2 -5.139503 7.915368 -12.95288 1.488409e-16 1.116307e-13 27.48939
## SFRP1 4.144489 8.377033 12.32914 8.162432e-16 4.897459e-13 25.83515
## SIDT1 -2.605517 6.013343 -12.23990 1.045491e-15 5.227453e-13 25.59416
## nonbasal basal
## AGR3 9.241425 3.765357
## SCGB2A2 9.049974 5.029258
## AGR2 10.349870 5.210367
## CALML5 6.314069 8.878863
## KRT23 6.348392 9.541876
## FOXA1 7.215394 3.217818
## let's define the contrast matrix we need
contrast.matrix <- makeContrasts((basal-nonbasal),levels=design2)
fit2 <- contrasts.fit(fit2,contrast.matrix)
## notice that the contrast coefficient takes the same values as the basal coefficient in design1
head(fit2$coefficients)
## Contrasts
## (basal - nonbasal)
## AGR3 -5.476069
## SCGB2A2 -4.020716
## AGR2 -5.139503
## CALML5 2.564793
## KRT23 3.193484
## FOXA1 -3.997576
fit2 <- eBayes(fit2) # pooling of variance across like-genes
fit2table <- topTable(fit2,coef=1,adjust="BH",number=Inf,sort.by="P")
head(fit2table)
## affy_hg_u133_plus_2 hgnc_symbol
## MLPH 218211_s_at MLPH
## FOXA1 237086_at FOXA1
## FOXC1 1553613_s_at FOXC1
## AGR2 209173_at AGR2
## SFRP1 202035_s_at SFRP1
## SIDT1 219734_at SIDT1
## description
## MLPH melanophilin [Source:HGNC Symbol;Acc:HGNC:29643]
## FOXA1 forkhead box A1 [Source:HGNC Symbol;Acc:HGNC:5021]
## FOXC1 forkhead box C1 [Source:HGNC Symbol;Acc:HGNC:3800]
## AGR2 anterior gradient 2 [Source:HGNC Symbol;Acc:HGNC:328]
## SFRP1 secreted frizzled-related protein 1 [Source:HGNC Symbol;Acc:HGNC:10776]
## SIDT1 SID1 transmembrane family, member 1 [Source:HGNC Symbol;Acc:HGNC:25967]
## logFC AveExpr t P.Value adj.P.Val B
## MLPH -4.539281 9.487324 -15.77794 1.198422e-19 2.539912e-16 34.35447
## FOXA1 -3.997576 5.321805 -15.63213 1.693275e-19 2.539912e-16 34.02400
## FOXC1 3.774299 7.847670 14.27359 4.745041e-18 4.745041e-15 30.82304
## AGR2 -5.139503 7.915368 -12.95288 1.488409e-16 1.116307e-13 27.48939
## SFRP1 4.144489 8.377033 12.32914 8.162432e-16 4.897459e-13 25.83515
## SIDT1 -2.605517 6.013343 -12.23990 1.045491e-15 5.227453e-13 25.59416
Regardless of the ‘coding’, the results are the same.
## compare the results from the two 'codings'
all.equal(fit2table[,"t"],fit1table[rownames(fit2table),"t"])
## [1] TRUE
Since running limma
requires memorizing several commands
that are not easy to remember, here we define a simple wrapper function
that will relief us from remembering them all. Incidentally, this is
also (a simplified version of) the wrapper function defined in
CBMRtools
, and it corresponds to the design2
,
model \(\eqref{eq:lm2}\), shown
above.
It should be noted that this simple wrapper function does not allow you to specify additional covariates (confounders). In that case, you will have to either define a more advanced wrapper (any volunteer?), or revert back to the direct use of limma commands.
run_limma <- function(
eset, # the expression set
treatment, # pData variable name of the phenotype
cond, # label of the 'treatment' group
control, # label of the 'control' group
sort.by="P", # sort output by given criterion
decreasing = FALSE, # sort order
cutoff = NA # extract only genes above/below 'sort.by' cutoff
)
{
if ( ncol(fData(eset))<1 )
stop( "empty fData" )
cat( "running limma\n" )
test_name <- paste(control, "_vs_", cond, sep = "")
treatmentvec <- pData(eset)[, treatment]
eset <- eset[, treatmentvec %in% c(cond, control)]
treatmentvec <- factor( pData(eset)[, treatment] )
design <- model.matrix( ~ 0 + treatmentvec )
colnames(design) <- levels( treatmentvec )
fit <- lmFit(eset, design)
command_str <- paste("makeContrasts(", "(", cond , "-", control, ")",
",levels = design)", sep = "")
contrast.matrix <- eval( parse(text=command_str) )
fit2 <- contrasts.fit( fit, contrast.matrix )
fit2 <- eBayes(fit2)
## extract full table
fit2.table <- topTable(fit2, coef=1, adjust="BH", number=length(fit2), sort.by=sort.by)
## extract genes above/below cutoff
if( !is.na(cutoff) ) {
if (decreasing) {
fit2.table <- fit2.table[fit2.table[,sort.by] > cutoff,]
}
else {
fit2.table <- fit2.table[fit2.table[,sort.by] < cutoff,]
}
}
return(fit2.table)
}
Now, let us test the newly defined function on our data.
fit3table <- run_limma(dat,treatment="diseaseState",cond="basal",control="nonbasal")
## running limma
head(fit3table)
## affy_hg_u133_plus_2 hgnc_symbol
## MLPH 218211_s_at MLPH
## FOXA1 237086_at FOXA1
## FOXC1 1553613_s_at FOXC1
## AGR2 209173_at AGR2
## SFRP1 202035_s_at SFRP1
## SIDT1 219734_at SIDT1
## description
## MLPH melanophilin [Source:HGNC Symbol;Acc:HGNC:29643]
## FOXA1 forkhead box A1 [Source:HGNC Symbol;Acc:HGNC:5021]
## FOXC1 forkhead box C1 [Source:HGNC Symbol;Acc:HGNC:3800]
## AGR2 anterior gradient 2 [Source:HGNC Symbol;Acc:HGNC:328]
## SFRP1 secreted frizzled-related protein 1 [Source:HGNC Symbol;Acc:HGNC:10776]
## SIDT1 SID1 transmembrane family, member 1 [Source:HGNC Symbol;Acc:HGNC:25967]
## logFC AveExpr t P.Value adj.P.Val B
## MLPH -4.539281 9.487324 -15.77794 1.198422e-19 2.539912e-16 34.35447
## FOXA1 -3.997576 5.321805 -15.63213 1.693275e-19 2.539912e-16 34.02400
## FOXC1 3.774299 7.847670 14.27359 4.745041e-18 4.745041e-15 30.82304
## AGR2 -5.139503 7.915368 -12.95288 1.488409e-16 1.116307e-13 27.48939
## SFRP1 4.144489 8.377033 12.32914 8.162432e-16 4.897459e-13 25.83515
## SIDT1 -2.605517 6.013343 -12.23990 1.045491e-15 5.227453e-13 25.59416
## [1] "'is.NA' value mismatch: 2973 in current 0 in target"