Contextualization of protein-protein interaction databases by cell line
If you just want the data it’s easy to load into R…
ppi <- read.delim("data/v_1_00/PPI-Context.txt", header=TRUE, sep="\t", stringsAsFactors=FALSE)
data.frame(sort(table(ppi$cell_name), decreasing=TRUE)) %>% set_colnames(c("var", "freq")) %>% head(30) %>% ggbarplot(x="var", y="freq", fill="freq") + labs(title="", x="Cell Line Name", y="PPI") + scale_fill_viridis_c(option="inferno", begin=0, end=0.8) + theme(legend.position="none", axis.text.x=element_text(angle=45, hjust=1, size=12, face="bold"))
| PPI - Context (v1.0)
usage: ppictx.py [-h] [-r] [-d]
[-fh PATH_HIPPIE]
[-fp PATH_PUBTATOR]
[-fc PATH_CELLOSAURUS]
optional arguments:
-h, --help show this help message and exit
-r, --run run pipeline
-d, --download download raw data first
-fh PATH_HIPPIE path to downloaded Hippie data (optional)
-fp PATH_PUBTATOR path to downloaded Pubtator data (optional)
-fc PATH_CELLOSAURUS path to downloaded Cellosaurus data (optional)
In most cases you will need to download the latest bulk data first and then process it…
| PPI - Context (v1.0)
| Downloading raw data...
| Processing raw data
~ [PPI]
~ [PID -> CLA]
~ [CLA -> CID]
~ [PPI -> PID -> CLA -> CID]
In other cases, you might have the previous versions of the data to process…
Cell lines that are primarily used in research due to their efficiency as an expression vector (e.g. HeLa, HEK, CHO, Sf9) may not be useful representations of cell-specific protein dynamics. However it may be useful to filter out PPIs annotated with these cell lines.
Cellosaurus contains synonymous cell lines, therefore some annotations such as HEK (CVCL_M624) and HEK293 (CVCL_0045) refer to the same cell line. Users should be aware of synonymous cell lines relevant to their interests and filter accordingly.