It turns out that the
Solexa machines are getting better at such a pace that the calculations that the
1000 Genomes Project made are no longer true. Under the assumption that the
read length and throughput of these NextGen machines would increase in 2008 and 2009, the project was funded enough money to fully sequence 1000 genomes from a panel of diverse ethnicities. The production capacity is currently led by the
Sanger Institute, the
Beijing Genomics Institute and the
Broad Institute,
then Baylor and WashU. The MaxPlanck offered to do some production in
mid 2008, and Illumina, Roche and ABI are also contributing. Now the machines are better which means that the project is going to aim at even more than 1000 genomes. Where does the money come from? Well, it comes from biomedical research funding, as the aim of the project is to create a deep catalog of human genetic variation that will represent all rare shared variants in our species. This catalog will facilitate biomedical research by enabling the prospection of phenotypes on all the sampled genotypes, and link both to identify the causes of human diseases and traits. Beyond this obvious goal, such a deep sampling of population genomics data will give us great clues on the evolutionary processes that took place in our genome in the last hundreds of thousands of years. Particularly, one will be able to see what are the polymorphism patterns in the chromosomes, and how these correlate with all the genetic features we are getting from another big project, the scale-up
ENCODE project. Add to that the
comparative genomics information to closely related monkeys to compare divergence vs polymorphism levels, and you have a winner!
Now that we even have a browser for the 1000 genomes project, you can get a snippet of the kind of data the project will produce:
http://browser.1000genomes.org/Homo_sapiens/genesnpview?db=core;gene=ENSG00000128573;context=200http://browser.1000genomes.org/Homo_sapiens/transcriptsnpview?db=core;transcript=ENST00000393489;context=200Notice the "context=200" argument in GeneSNPView and TranscriptSnpView URLs: some people may mistakenly think that the intronic sequences are depleted of variation when one would expect to have most of the SNPs there: well, they are there, but the context in Gene and Transcript SNP Views restricts intronic SNPs to 100bp left and right to the exon by default. This allows for a more coding-centric view of variations, which according to the Ensembl HelpDesk tickets, is what people working in hospitals around the world really like about this view.
I remember when I joined the Ensembl project three years ago that these new machines were only a rumour, something that was secretly happening in a small science park in
Great Chesterford, something that people at that time was dismissing simply as an undelivered promise: "Oh, but I've heard that they can only sequence 25bp pairs...", etc, etc. It's been like that for a lot of other scientific and technological promises:
- like production plug-in electric/hybrid cars -- "Oh, but I've heard that they only have an autonomy of a few miles..."
- inexpensive solar energy on the roof of your house -- "Oh, but I've heard that they only pay after 25 years..."
- your very own robotic butler -- "Oh, but I've heard that it doesn't even know how to make a good latte..."