Multi-level analysis of molecular mechanisms underlying cancer development

Here I present some of my research works as an example to explain how analytical approaches constructed on different levels can contribute to the study of cancers. The methods presented here can be applied on their own or serve as a complement to other approaches.

At the cellular level:

A structural systems biology approach for quantifying the phenotypic impact of missense mutations

Annotating the phenotypic outcomes of missense mutations is an important issue in the pursuit of personalised medicine. However, it is non-trivial to understand how missense mutations affect a cell's behavior by introducing amino acid changes in proteins. This is because complex information, at both protein and pathway level, has to be considered. To tackle this task, I have explored and validated an approach, PEPP (Phenotype Extrapolation via Pathway and Protein information), to effectively integrate information on both protein and pathway levels.

To validate PEPP I use the G2 to mitosis transition mechanism in the cell cycle of fission yeast (Schizosaccharomyces pombe) as a model, in which the systemic impact of a missense mutation can be quantified through the size of yeast cells. By using yeast cells that each contains a single missense mutation in proteins involving in the G2 to mitosis mechanism as benchamerks, PEPP is able to quantify the systemic impact of missense mutations in a manner that correlates well with the in vivo cell lengths. This work was awarded a RCSB PDB Poster Prize at ISMB in 2011, presented as one of the Late Breaking Researches in the ISMB Conference 2012, and has been published recently in PLoS Compt. Biol. (PubMed link)

At the pathway or network level:

A multivariate statistical model to identify analyte clusters that serve as useful biomarkers

Most conventional methods for biomarker studies focus on individual analytes expressed differently between patients and normal controls. An obvious disadvantage of this type of approach is its weakness in capturing collective relationships between multiple analytes.

To tackle this problem, I introduced a new statistical model, targeted analyte cluster (TAC, PubMed link), which considers the patterned behaviours of a small set of analytes. TAC can be used to analyse gene or protein expression data and hence can be applied in a broad range of research works. So far TAC has facilitated studies in bipolar disorder and schizophrenia by bringing new insights into their disease mechanisms (Neuropsychopharmacology 37: 364-377; Mol Psychiatry 16:848-859).


nsSNPs are single nucleotide variations in human genomes that cause amino acid changes in proteins. They can affect an individual's susceptibility to cancers and response to drugs, thus a good estimation of their tendency to cause cancers is an essential step towards personalised medicine.

For estimating the likelihood of a nsSNP to be associated with cancers, Bongo (PubMed link) applies graph theory to project the disease susceptibility of a specific nsSNP by evaluating its impact on the amino acid network within the host protein. nsSNPs that destablise significantly the stability of the internal networks of their host proteins are considered as potential mutations contributing to cancer development. (A new webserver of Bongo is currently under construction)


At the protein level:

Analysing structural effects of non-synonymous single nucleotide polymorphisms (nsSNPs) on proteins

Estimating structural effects of nsSNPs on protein complexes through protein-protein docking

In addition to the structural impact on individual proteins, nsSNPs can also affect interactions between proteins. In order to identify nsSNPs that are likely to affect protein interactions, I worked with Dr Juan Fernandez-Recio on a rigid-body protein-protein docking program pyDock (PubMed link). pyDock models the structure of protein complexes and thus helps to identify nsSNPs at the protein interfaces. It explores either FTDOCK or ZDOCK to generate the complex conformations and has an optimised scoring function for selecting the best solutions. The performance of pyDock is similar, if not superior, to contemporary rigid-body approaches.

Due to the fact that approximately two thirds of protein in prokaryotes and four fifths of proteins in eukaryotes are multi-domain proteins, I also developed pyDockTET (PubMed link), which is a distance-restrained docking method for predicting structural assembly of two-domain proteins. pyDockTET is not only helpful for identifying nsSNPs located at domain-domain interfaces but also serves as one of few methods available for predicting structural assembly of two-domain proteins.