Random ramblings about some random stuff, and things; but more stuff than things -- all in a mesmerizing and kaleidoscopic soapbox-like flow of words.
There are ongoing discussions in our Campus regarding the use of
de novo clustering or HMM classification to update the family models in an Orthology database from one release to the other.
One trend that I don't favour is to use the HMM models from the previous build and classify the current protein sets to it. Then re-run the alignment and tree-building steps after that.
The other trend that I favour is to re-run the new blasts/phmmers for the new proteins, re-cluster with the other hits in the updated graph, and then re-run the alignment and tree-building steps in the new set of family models.
People who argue in favour of the HMM classification procedure and want to convince me of their feasibility show give convincing answers to these questions:
Let's say you only have a few complete genomes sequences with provisional gene predictions from your clade but expect to have 20% more extra finished genomes with better gene predictions every two months. Over the course of a year your will have more than doubled the number of genomes. Do you trust the HMMs you are doing today to represent the family models in two month, four-month, six-month, eight-month, ten-month, twelve-month time?
Let's say that you have answered the previous question with a 'yes' then, where do you draw the line to update your HMM by rebuilding them from updated genome sets? Why don't you take Human, S.cerevisiae, A.thaliana, E.coli and P.furiosus and call that the ultimate representation of all family models on Earth? Do you think those families would be as good as than the ones obtained using 80 genomes instead?
If so, then it's very easy for you, just use those 5 genomes as your family model set. But I don't think that is the way to go.
Labels: ensembl, genomics