Tree Decomposable Models for Computational Biology
Many computational biology problems are difficult (e.g., NP-hard). In applications, however, they are often require to have accurate and yet efficient solutions, presenting a great challenge in computation. My research, while solving individual problems in computational biology and bioinformatics, trys to address the general issue of effectively modeling biological objects and systems so that upon such models, fast algorithms can be developed with desired accuracy. The work has been based on tree decomposable graphs that allow to apply tree decomposition techniques to achieve fast (e.g., linear-time) practical algorithms on graph models of small tree width.
Tree decomposable graphs are graphs with small tree width. Rooted at the deep graph minor theorems by Robertson and Seymour, tree decomposition and tree decomposition based techniques are sophisticated yet effective in solving problems on graphs of small tree width. As many optimization problems, NP-hard on general graphs, can be solved efficiently on trees, tree decomposition allows many such problems to be solved in time O(ct n) on graph of tree width t.
My current research investigates computational biology problems that can be formulated as graph-theoretic problems. They usually have natural and small parameters that can potentially correspond to the tree width of graphs. My research work has investigated a number of important problems in genomics, structural genomics, proteomics, and systems biology. Some of these problems can be directly modeled with graphs of small tree width, while others can be done through biological and statistical analyses on the data or by non-trivial parameterized reductions to monadic second order logic, yielding graphs of small tree width at the same time. Some of my projects are:
- Non-coding RNA genes finding through RNA secondary structure search. The core part of the search is structure-sequence alignment between the profiled structure and the target sequence. The RNA structure can be profiled with conformational graphs that almost always are of small tree width. The target sequence can be preprocessed with the structure profile to result in a sequence graph. The structure-sequence
alignment is essentially the subgraph isomorphism problem between the structure graph and the sequence graph. We apply this method to identify yeast and plant telomerase RNA genes [8][9][14][15][17][18][20][21[22].
- Proteins identification based on mass spectrometry data. MS-based
peptide identification can be either de novo or by database search. Both methods can be formulated as the problem of finding the longest antisymmetric path on the spectrum graph of the new protein. Spectrum graphs are constructed from mass spectrometry data in which mass peaks are presented as vertices that are connected with edges if the mass difference equals the mass of some amino acid. Such graphs are of small tree width. In the case of MS data containing noise and post translational modifications, it becomes important to find short signficant peptides in the new protein sequence to use them as tags to filter protein database for efficient search. In this case, all-pair longest antisymmetric paths are needed [7][16].
- Protein tertiary structure prediction via protein threading. The threading is to align a target protein sequence to a database of structure templates. The structure-sequence alignment resembles the task in RNA structure search. It can be shown that conformational graphs for protein structures are of small tree width as well. The tree decomposable graph model allows the protein folding not only to be based on the backbone but also to simultaneously consider side-chain packing to achieve higher prediction accuracy [4][5][8][10].
- Phylogentic network inference by haplotyping [12].
- Biological pathway inference by pathway alignment [1][3].
- Rapid ab initio single RNA folding [11].
- etc.