There is always a limit to the number of variables that can be used in high-dimensional graphical modeling. This is typically determined by the number of samples available as well as the methodology. In some cases, it may be suitable to build networks on curated genes of interest. In other cases we may need to to filter out genes due some failure to meet a requirement or select genes based on favorable property. We provide some useful functions for variable filtering and selection that consider multiple networks.

library(shine)

We use a toy expression set object with three subtypes to illustrate…

data(toy)
dim(toy)
Features  Samples 
     150       30 
head(pData(toy))
   subtype
S1       A
S2       A
S3       A
S4       A
S5       A
S6       A
table(toy$subtype)

 A  B  C 
10 10 10 
exprs(toy)[1:5,1:5]
           S1         S2         S3          S4         S5
G1 -0.2354423 -1.0229070  1.3398810 -0.45969696  2.5326985
G2  0.8372641  0.8283583 -0.6982707  0.21500515 -1.5961633
G3  0.2556321 -0.2623000 -0.6208904 -0.07742454 -0.2845315
G4 -0.3441312  0.5300648 -1.2360177  0.68617017 -1.5084982
G5  1.1543899  0.3910137 -0.4389481 -0.48118378 -1.5107490

Variation Filtering

Graphical modeling considers the co-variance of variables across samples. Therefore, variables must have non-zero variance to be useful. Additionally, when learning multiple networks, the same variable must have non-zero variance within each subtype. Here is a function used to find variables with non-zero variance within every subtype.

[1] "G1" "G2" "G3" "G4" "G5" "G6"
# Edit in zero variance in at least one subtype
exprs(toy)["G2", toy$subtype == "A"] <- 0

genes.filtered <- keep.var(toy, 
                           column="subtype", 
                           subtypes=c("A", "B", "C"), 
                           fn=var)

head(genes.filtered)
[1] "G1" "G3" "G4" "G5" "G6" "G7"

Variable Selection

Even after filtering, there still may be too many variables. One method to reduce the number of variables is to rank them by some favorable property and select the top x amount. When finding structural constraints, we rely on biweight midcorrelation which is more reliable when variables have a high median absolute deviation. Since there are multiple networks, we give each network an opportunity to contribute a high-ranking variable in a turn-based system until the limit has been reached.

genes.selected <- rank.var(toy, 
                           column="subtype",
                           subtypes=c("A", "B", "C"), 
                           limit=30, 
                           genes=genes.filtered, 
                           fn=mad)

head(genes.selected)
[1] "G52"  "G109" "G115" "G10"  "G148" "G117"
reactable(rankings,
          compact=TRUE, 
          fullWidth=TRUE,
          defaultPageSize=30,
          striped=TRUE,
          showPageSizeOptions=FALSE)
A
B
C
G52
1
1
2
G109
30
2
17
G115
3
11
1
G10
2
6
6
G148
14
3
93
G117
28
45
3
G71
4
15
7
G20
82
4
15
G44
6
8
4
G51
5
7
18
G25
11
5
8
G19
53
66
5
G12
7
31
30
G119
10
9
39
G68
66
24
9
G69
8
82
57
G67
57
10
14
G18
20
40
10
G37
9
52
12
G121
19
12
76
G63
65
35
11
G6
12
20
50
G40
67
13
23
G43
33
71
13
G49
13
22
88
G116
50
16
73
G59
27
64
16
G126
15
19
37
G17
18
17
40
G54
32
46
19