Gene Expression Differential Analysis based on Limma

Here, we present a couple of simple examples of differential analysis based on limma. In particular, we show how the design matrix can be constructed using different ‘codings’ of the regression variables. We also define a simple wrapper function that can help us remember the different limma steps.

This document also provides examples of how to embed equations into an R markdown document, and how to cross-reference those equations.

library(BS831)
library(Biobase)
library(limma)

Load the Data

We start by loading and pre-processing the necessary expression dataset.

## let us load the ExpressionSet, an already simplified version of the
## ..breast cancer dataset
data(renamedBreastDB)

#load(file.path(Sys.getenv("BS831"),"data/renamedBreastDB.rda"))

dat <- renamedBreastDB
head(pData(dat))

##          individual diseaseState
## GSM85500        T44     nonbasal
## GSM85504       T175     nonbasal
## GSM85493        T37     nonbasal
## GSM85496       T161     nonbasal
## GSM85510       T162     nonbasal
## GSM85509        T74     nonbasal

table(dat$diseaseState)

## 
## nonbasal    basal 
##       20       18

Differential Analysis based on `Limma`

When the regression variable is categorical (binary in this case), we can choose different (yet equivalent) ‘codings’. In particular, we can fit a standard model

\[ \begin{equation} y = \beta_0 + \beta_1 X_{group}, \label{eq:lm1} \tag{1} \end{equation} \]

where \(X_{group} = 0,1,\) if the observation is from a nonbasal- or a basal-type tumor, respectively. Alternatively, we can fit the following model

\[ \begin{equation} y = \beta_{\not{basal}} X_{\not{basal}} + \beta_{basal}X_{basal}, \label{eq:lm2} \tag{2} \end{equation} \]

where \(X_{\not{basal}} = 1,0\), if the observation is from a nonbasal- or a basal-type tumor, respectively. Conversely, \(X_{basal} = 0,1\), if the observation is from a nonbasal- or a basal-type tumor, respectively.

The corresponding models in limma are defined based on the two design matrices design1, for model \(\eqref{eq:lm1}\), and design2, for model \(\eqref{eq:lm2}\), as follows.

design1 <- model.matrix( ~ diseaseState, data = pData(dat) )
colnames(design1) <- c("baseline","basal")      # let's simplify column names  
design2 <- model.matrix( ~ 0 + dat$diseaseState )
colnames(design2) <- levels( dat$diseaseState ) # ditto

print(unique(design1)) # showing only one instance of each class

##          baseline basal
## GSM85500        1     0
## GSM85480        1     1

print(unique(design2)) # ditto

##    nonbasal basal
## 1         1     0
## 21        0     1

The different designs require slightly different approaches to model fitting. In model \(\eqref{eq:lm1}\), we want to extract the p-value corresponding to the \(\beta_{group}\) parameter. In model \(\eqref{eq:lm2}\), we care about the difference \(\beta_{basal} - \beta_{\not{basal}}\).

## model (1)
fit1 <- lmFit(dat,design1) # fitting of linear model
head(fit1$coefficients)

##          baseline     basal
## AGR3     9.241425 -5.476069
## SCGB2A2  9.049974 -4.020716
## AGR2    10.349870 -5.139503
## CALML5   6.314069  2.564793
## KRT23    6.348392  3.193484
## FOXA1    7.215394 -3.997576

fit1 <- eBayes(fit1)       # pooling of variance across like-genes
fit1table <- topTable(fit1,coef="basal",adjust="BH",number=Inf,sort.by="P")
head(fit1table)

##       affy_hg_u133_plus_2 hgnc_symbol
## MLPH          218211_s_at        MLPH
## FOXA1           237086_at       FOXA1
## FOXC1        1553613_s_at       FOXC1
## AGR2            209173_at        AGR2
## SFRP1         202035_s_at       SFRP1
## SIDT1           219734_at       SIDT1
##                                                                   description
## MLPH                         melanophilin [Source:HGNC Symbol;Acc:HGNC:29643]
## FOXA1                      forkhead box A1 [Source:HGNC Symbol;Acc:HGNC:5021]
## FOXC1                      forkhead box C1 [Source:HGNC Symbol;Acc:HGNC:3800]
## AGR2                    anterior gradient 2 [Source:HGNC Symbol;Acc:HGNC:328]
## SFRP1 secreted frizzled-related protein 1 [Source:HGNC Symbol;Acc:HGNC:10776]
## SIDT1 SID1 transmembrane family, member 1 [Source:HGNC Symbol;Acc:HGNC:25967]
##           logFC  AveExpr         t      P.Value    adj.P.Val        B
## MLPH  -4.539281 9.487324 -15.77794 1.198422e-19 2.539912e-16 34.35447
## FOXA1 -3.997576 5.321805 -15.63213 1.693275e-19 2.539912e-16 34.02400
## FOXC1  3.774299 7.847670  14.27359 4.745041e-18 4.745041e-15 30.82304
## AGR2  -5.139503 7.915368 -12.95288 1.488409e-16 1.116307e-13 27.48939
## SFRP1  4.144489 8.377033  12.32914 8.162432e-16 4.897459e-13 25.83515
## SIDT1 -2.605517 6.013343 -12.23990 1.045491e-15 5.227453e-13 25.59416

## model (2)
fit2 <- lmFit(dat,design2) # fitting of linear model
head(fit2$coefficients)

##          nonbasal    basal
## AGR3     9.241425 3.765357
## SCGB2A2  9.049974 5.029258
## AGR2    10.349870 5.210367
## CALML5   6.314069 8.878863
## KRT23    6.348392 9.541876
## FOXA1    7.215394 3.217818

## let's define the contrast matrix we need
contrast.matrix <- makeContrasts((basal-nonbasal),levels=design2)
fit2 <- contrasts.fit(fit2,contrast.matrix)
## notice that the contrast coefficient takes the same values as the basal coefficient in design1
head(fit2$coefficients)

##          Contrasts
##           (basal - nonbasal)
##   AGR3             -5.476069
##   SCGB2A2          -4.020716
##   AGR2             -5.139503
##   CALML5            2.564793
##   KRT23             3.193484
##   FOXA1            -3.997576

fit2 <- eBayes(fit2)       # pooling of variance across like-genes
fit2table <- topTable(fit2,coef=1,adjust="BH",number=Inf,sort.by="P")
head(fit2table)

##       affy_hg_u133_plus_2 hgnc_symbol
## MLPH          218211_s_at        MLPH
## FOXA1           237086_at       FOXA1
## FOXC1        1553613_s_at       FOXC1
## AGR2            209173_at        AGR2
## SFRP1         202035_s_at       SFRP1
## SIDT1           219734_at       SIDT1
##                                                                   description
## MLPH                         melanophilin [Source:HGNC Symbol;Acc:HGNC:29643]
## FOXA1                      forkhead box A1 [Source:HGNC Symbol;Acc:HGNC:5021]
## FOXC1                      forkhead box C1 [Source:HGNC Symbol;Acc:HGNC:3800]
## AGR2                    anterior gradient 2 [Source:HGNC Symbol;Acc:HGNC:328]
## SFRP1 secreted frizzled-related protein 1 [Source:HGNC Symbol;Acc:HGNC:10776]
## SIDT1 SID1 transmembrane family, member 1 [Source:HGNC Symbol;Acc:HGNC:25967]
##           logFC  AveExpr         t      P.Value    adj.P.Val        B
## MLPH  -4.539281 9.487324 -15.77794 1.198422e-19 2.539912e-16 34.35447
## FOXA1 -3.997576 5.321805 -15.63213 1.693275e-19 2.539912e-16 34.02400
## FOXC1  3.774299 7.847670  14.27359 4.745041e-18 4.745041e-15 30.82304
## AGR2  -5.139503 7.915368 -12.95288 1.488409e-16 1.116307e-13 27.48939
## SFRP1  4.144489 8.377033  12.32914 8.162432e-16 4.897459e-13 25.83515
## SIDT1 -2.605517 6.013343 -12.23990 1.045491e-15 5.227453e-13 25.59416

Regardless of the ‘coding’, the results are the same.

## compare the results from the two 'codings'
all.equal(fit2table[,"t"],fit1table[rownames(fit2table),"t"])

## [1] TRUE

plot(fit2table[,"t"],fit1table[rownames(fit2table),"t"],xlab="design2",ylab="design1")

Defining a wrapper function

Since running limma requires memorizing several commands that are not easy to remember, here we define a simple wrapper function that will relief us from remembering them all. Incidentally, this is also (a simplified version of) the wrapper function defined in CBMRtools, and it corresponds to the design2, model \(\eqref{eq:lm2}\), shown above.

It should be noted that this simple wrapper function does not allow you to specify additional covariates (confounders). In that case, you will have to either define a more advanced wrapper (any volunteer?), or revert back to the direct use of limma commands.

run_limma <- function(
    eset,                # the expression set
    treatment,           # pData variable name of the phenotype
    cond,                # label of the 'treatment' group
    control,             # label of the 'control' group
    sort.by="P",         # sort output by given criterion
    decreasing = FALSE,  # sort order
    cutoff = NA          # extract only genes above/below 'sort.by' cutoff
)
{
    if ( ncol(fData(eset))<1 ) 
        stop( "empty fData" )
    
    cat( "running limma\n" )
    test_name <- paste(control, "_vs_", cond, sep = "")
    treatmentvec <- pData(eset)[, treatment]
    eset <- eset[, treatmentvec %in% c(cond, control)]
    treatmentvec <- factor( pData(eset)[, treatment] )
    design <- model.matrix( ~ 0 + treatmentvec )
    colnames(design) <- levels( treatmentvec )
    fit <- lmFit(eset, design)
    command_str <- paste("makeContrasts(", "(", cond , "-", control, ")", 
                         ",levels = design)", sep = "")
    contrast.matrix <- eval( parse(text=command_str) ) 
    fit2 <- contrasts.fit( fit, contrast.matrix )
    fit2 <- eBayes(fit2)
    
    ## extract full table
    fit2.table <- topTable(fit2, coef=1, adjust="BH", number=length(fit2), sort.by=sort.by)

    ## extract genes above/below cutoff
    if( !is.na(cutoff) ) {
        if (decreasing) {
            fit2.table <- fit2.table[fit2.table[,sort.by] > cutoff,]
        }
        else {
            fit2.table <- fit2.table[fit2.table[,sort.by] < cutoff,]
        }
    }
    return(fit2.table)
}

Now, let us test the newly defined function on our data.

fit3table <- run_limma(dat,treatment="diseaseState",cond="basal",control="nonbasal")

## running limma

head(fit3table)

##       affy_hg_u133_plus_2 hgnc_symbol
## MLPH          218211_s_at        MLPH
## FOXA1           237086_at       FOXA1
## FOXC1        1553613_s_at       FOXC1
## AGR2            209173_at        AGR2
## SFRP1         202035_s_at       SFRP1
## SIDT1           219734_at       SIDT1
##                                                                   description
## MLPH                         melanophilin [Source:HGNC Symbol;Acc:HGNC:29643]
## FOXA1                      forkhead box A1 [Source:HGNC Symbol;Acc:HGNC:5021]
## FOXC1                      forkhead box C1 [Source:HGNC Symbol;Acc:HGNC:3800]
## AGR2                    anterior gradient 2 [Source:HGNC Symbol;Acc:HGNC:328]
## SFRP1 secreted frizzled-related protein 1 [Source:HGNC Symbol;Acc:HGNC:10776]
## SIDT1 SID1 transmembrane family, member 1 [Source:HGNC Symbol;Acc:HGNC:25967]
##           logFC  AveExpr         t      P.Value    adj.P.Val        B
## MLPH  -4.539281 9.487324 -15.77794 1.198422e-19 2.539912e-16 34.35447
## FOXA1 -3.997576 5.321805 -15.63213 1.693275e-19 2.539912e-16 34.02400
## FOXC1  3.774299 7.847670  14.27359 4.745041e-18 4.745041e-15 30.82304
## AGR2  -5.139503 7.915368 -12.95288 1.488409e-16 1.116307e-13 27.48939
## SFRP1  4.144489 8.377033  12.32914 8.162432e-16 4.897459e-13 25.83515
## SIDT1 -2.605517 6.013343 -12.23990 1.045491e-15 5.227453e-13 25.59416

all.equal(fit2table[,"t"],fit3table[rownames(fit2table),"t"])

## [1] "'is.NA' value mismatch: 2973 in current 0 in target"

Stefano Monti

Load the Data

Differential Analysis based on Limma

Defining a wrapper function

Differential Analysis based on `Limma`