There is always a limit to the number of variables that can be used in high-dimensional graphical modeling. This is typically determined by the number of samples available as well as the methodology. In some cases, it may be suitable to build networks on curated genes of interest. In other cases we may need to to filter out genes due some failure to meet a requirement or select genes based on favorable property. We provide some useful functions for variable filtering and selection that consider multiple networks.

library(shine)

We use a toy expression set object with three subtypes to illustrate…

data(toy)
dim(toy)
Features  Samples 
     150       30 
head(pData(toy))
   subtype
S1       A
S2       A
S3       A
S4       A
S5       A
S6       A
table(toy$subtype)

 A  B  C 
10 10 10 
exprs(toy)[1:5,1:5]
           S1         S2         S3          S4         S5
G1 -0.2354423 -1.0229070  1.3398810 -0.45969696  2.5326985
G2  0.8372641  0.8283583 -0.6982707  0.21500515 -1.5961633
G3  0.2556321 -0.2623000 -0.6208904 -0.07742454 -0.2845315
G4 -0.3441312  0.5300648 -1.2360177  0.68617017 -1.5084982
G5  1.1543899  0.3910137 -0.4389481 -0.48118378 -1.5107490

Variation Filtering

Graphical modeling considers the co-variance of variables across samples. Therefore, variables must have non-zero variance to be useful. Additionally, when learning multiple networks, the same variable must have non-zero variance within each subtype. Here is a function used to find variables with non-zero variance within every subtype.

[1] "G1" "G2" "G3" "G4" "G5" "G6"
# Edit in zero variance in at least one subtype
exprs(toy)["G2", toy$subtype == "A"] <- 0

genes.filtered <- keep.var(toy, 
                           column="subtype", 
                           subtypes=c("A", "B", "C"), 
                           fn=var)

head(genes.filtered)
[1] "G1" "G3" "G4" "G5" "G6" "G7"

Variable Selection

Even after filtering, there still may be too many variables. One method to reduce the number of variables is to rank them by some favorable property and select the top x amount. When finding structural constraints, we rely on biweight midcorrelation which is more reliable when variables have a high median absolute deviation. Since there are multiple networks, we give each network an opportunity to contribute a high-ranking variable in a turn-based system until the limit has been reached.

genes.selected <- rank.var(toy, 
                           column="subtype",
                           subtypes=c("A", "B", "C"), 
                           limit=30, 
                           genes=genes.filtered, 
                           fn=mad)

head(genes.selected)
[1] "G52"  "G109" "G115" "G10"  "G148" "G117"
reactable(rankings,
          compact=TRUE, 
          fullWidth=TRUE,
          defaultPageSize=30,
          striped=TRUE,
          showPageSizeOptions=FALSE)