|UCLA Computer Science Department|
Bioinformatics at UCLA, and other bioinformatics-related links
In relational database systems the index is generally a second-class construct: users cannot explicitly use an index. (In fact, the keyword INDEX is not even defined in the SQL2 (SQL92) standard.) The principle that "indexes should be used but not seen" has been followed for decades, and is often justified as necessary in order to avoid the complexities introduced by explicit access paths and navigational queries.
We review arguments for and against this principle, and for making the index a first-class construct in relational database systems. For large and complex databases, such as those arising in bioinformatics, the second-class status of indexing can be in conflict with its importance. The case for a first-class index appears strongest for situations like these.
We investigate ways to incorporate first-class indexing into the relational database model, surfacing indexes as functionals. This investigation gives insights about the relational model, and also suggests ways for relational databases to support applications like bioinformatics, for which indexing is of central importance.
The ability to perform genome-wide and cross-genome data analyses can dramatically reduce the time required for new biological discoveries. This raises important issues in bioinformatics database research involving data representations and data integration. Essential biological datatypes (such as sequence locations) and tools (such as the popular BLAST sequence alignment tools) are not supported in traditional database systems, which has forced researchers to represent biological knowledge counterintuitively, and implement codes for data operations.
This paper introduces BioIndexing, a conceptual infrastructure for representing and managing biological information that permits this information to be queried within a modern database system. The paper also describes an implementation -- BLASTgres, an extension of the PostgreSQL database system -- that provides indexable bioinformatics datatypes and joinable BLAST alignment. The sequence location datatype in BLASTgres is of specific interest, since it is indexable, essential to the sequence alignment information produced by BLAST, and pervasive in existing biological information.
We argue the significance of a fundamental shift in bioinformatics, from in-the-small to in-the-large. Adopting a large-scale perspective is a way to manage the problems endemic to the world of the small -- constellations of incompatible tools for which the effort required to assemble an integrated system exceeds the perceived benefit of the integration. Where bioinformatics in-the-small is about data and tools, bioinformatics in-the-large is about metadata and dependencies. Dependencies represent the complexities of large-scale integration, including the requirements and assumptions governing the composition of tools. The popular make utility is a very effective system for defining and maintaining simple dependencies, and it offers a number of insights about the essence of bioinformatics in-the-large. Keeping an in-the-large perspective has been very useful to us in large bioinformatics projects. We give two fairly different examples, and extract lessons from them showing how it has helped. These examples both suggest the benefit of explicitly defining and managing knowledge flows and knowledge maps (which represent metadata regarding types, flows, and dependencies), and also suggest approaches for developing bioinformatics database systems. Generally, we argue that large-scale engineering principles can be successfully adapted from disciplines such as software engineering and data management, and that having an in-the-large perspective will be a key advantage in the next phase of bioinformatics development.
Multiple Sequence Alignment (MSA) is a fundamental tool of bioinformatics. Row-Column MSA (RC-MSA) methods such as CLUSTALW produce tabular alignments that are now familiar. However, these methods have a number of shortcomings, including difficulty of understanding the result, high computational complexity, questionable assumptions, and other artifacts (such as poor handling of prefix- and suffix-alignment).
Partial Order Alignment was proposed recently as an alternative approach to MSA. Partial Order MSA (PO-MSA) methods produce a partial order --- a labeled directed acyclic graph --- that includes the input sequences as subgraphs. This approach has its own set of strengths and shortcomings. In particular, when the number of sequences grows large, POA can provide a clearer view of structure shared among these sequences. Issues involving complexity, assumptions, and artifacts all diverge from those for RC-MSA.
In this paper, we formalize PO-MSA as a graph problem, show that it corresponds to finding a Minimal Common Supergraph for a set of partial order graphs, and characterize how such a supergraph can be derived. This formalization offers some perspective on MSA generally, and also on particular tradeoffs between RC-MSA and PO-MSA.
Partial Order Alignment (POA) has been proposed recently as an alternative to conventional sequence alignment. Instead of the familiar tabular alignments, POA methods produce a partial order --- a labeled directed acyclic graph --- that includes the input sequences. In this paper, we formalize POA in terms of graphs, and show it corresponds to finding a Minimal Common Supergraph for a set of partial order graphs. We also show how this formulation may serve as a guide in the development of new alignment algorithms, and in addressing perennial problems, such as how to align alignments.