castanyes blaves

Random ramblings about some random stuff, and things; but more stuff than things -- all in a mesmerizing and kaleidoscopic soapbox-like flow of words.

3/18/2009

The blurring line between protein alignments and genomic alignments

As more and more data is being poured into the public sequence databases, an increasingly detailed map is being drawn that relates sequences from different individuals or different species, mainly into what has been known in the field as protein or genomic (DNA) alignments. This is what one can call twenty-first century molecular cartography.

All references to molecular evolution this year should be accompanied with an analogy to Darwin's work, so here is how it works in this case: Darwin's next generation machine, the Beagle, went on a journey to accumulate an enormous variety of specimens that, when compared all together, allowed Darwin to draw the first phylogenetic tree.

Contrary to what one would think, alignments with more sequences are easier to resolve than ones with less sequences, at least when the phylogenetic tree relating the sequences increases in detail, which is almost always. And this is allowing researchers to generate genomic alignments for phylogenetically dense groups of genomes while, in parallel, the protein alignments for the corresponding protein coding genes in these genomes are combined together with more distantly related species. This dense taxon sampling is making the distinction between protein alignments and genomic alignments less and less obvious.

As an example, one can use the highly conserved protein coding exons to anchor the points in the different chromosomes that define stretches of conserved synteny among the genomes, and then align these DNA stretches all together with a genomic aligner. At the same time, one can use the exon boundaries defined in the DNA sequences of the coding genes to help infer the right protein alignment at the aminoacidic level.

A new opportunity is now arising in exploiting the information that is contained separately in the genomic and protein alignments to combine them into a single object representing both. New methods are being developed that will exploit the landmarks that both genomic and protein alignments have correctly place to converge into a single intertwined alignment object. This new type of alignment has in a way already been represented in closely related prokaryotic genomes. But prokaryotic genomes are less interesting for some topics, like alternative splicing, repetitive elements or recombination hotspots. Combined genomic and protein alignments will bring new elements of detail together that have been scattered so far for researchers to study and hopefully some new and brilliant mechanistic explanations of the innards of molecular evolution will arise from them, in the same way that Darwin did two centuries ago.

So a deluge of sequencing data is not really a problem but an opportunity.