Random ramblings about some random stuff, and things; but more stuff than things -- all in a mesmerizing and kaleidoscopic soapbox-like flow of words.
Illumina Short-read transcriptome data has the potential to help solve many problems with curating gene models and the genomic sequence in C. elegans
. This is an initial look at the data and some examples of how it can be used.
So far, C. elegans
36% - fully confirmed by ESTs
48% - partially confirmed
14% - no transcript confirmation
RNAseq data -- different worms than the genome, so some polymorphisms expected -- 200bp inserts, 36bp paired end reads
MAQ or cross-match to genomic or transcript sequences
6137 new splice junctions (6% increase)
Jumped from 70000 to 98000 splice junctions.
3x as many polyA sites
80 possible new coding genes
V-shaped coverages -- validation against traces, then:
- Detected sequencing error, correction needed for the reference
- Detected alternative haplotype
Moving towards single-cell sequencing -- not sequencing in tiny cells but sequencing each cell in each developmental state in the worm. Moving towards RNA sequencing C. briggsae
and C. remanei
Updated gene builds will be given to other projects. Next Ensembl Metazoa
comparative genomics build may already have the modENCODE-updated C. elegans
and D. melanogaster
Labels: ensembl, genomics, nextgen sequencing, scientific talk
Social or behavioural disorders affect a quarter of individuals at some time during their lives however the molecules and mechanisms that mediate social cues, process their meaning, and initiate the corresponding behaviour are unknown. Instinctive social behaviours in mammals are thought to be largely promoted by pheromones: specialized olfactory cues secreted by one animal that directly influence the behaviour of another. Here I will describe studies into two instinctive, olfactory mediated behaviours in mice, aggression and pup suckling.
Our studies found that aggression is promoted by specific protein pheromones excreted in male urine. These activate specialized, finely tuned sensory neurons in the noses of other males, resulting in a robust aggressive behaviour in the recipient. Our genomic and functional characterization of the gene family encoding these pheromones reveals an extraordinary scope for information-coding. I will describe our recent efforts to elucidate their social significance using cellular and behavioural techniques.
Pup suckling is a behaviour that is found in all mammals and is thought to be promoted by pheromones emitted by the mother and detected by the infant. We found that newborn mice do use maternal odour cues to promote suckling but, in contrast to the aggression pheromones, these cues are not genetically predetermined to elicit behaviour. Instead, the cues are complex, variable and learned by pups around birth. Suckling is subsequently initiated when the pup recognizes the same odour pattern in the context of their mother's nipple. The sensory neurons that mediate this are not specialized and found in the noses of all mammals, including humans.
Together these studies demonstrate a diversity of mechanisms and molecules that underlie instinctive behaviours, and are a first step towards understanding the neural circuitry of social interaction.
Labels: scientific talk
interesting case where a small skipping exon generates an extra copy for the Znf-C2H2 domain.
REST is an essential vertebrate transcription factor with very diverse roles. It has an important role in regulatory secretory pathway. Independently confirmed by 2 other groups.
RE1 array used to identify misregulated REST target genes in diseases like Huntington's.
RE1 "half" sites. Canonical/Transfac/Discovered motifs.
Different evolutionary pressures over RE1 sites seem to be associating with different function subsets: common sites are less well conserved than unique sites. Unique sites need to be tissue specific, so they are bound to keep a general binding weakness to turn off binding in non-specific tissue (if I correctly understand?).
Solexa sequencing quite good in identifying high affinity motifs, but poor at low affinity motifs.
There is an in vivo hierarchy between RE1 for REST binding and it can be discriminated at the DNA sequence level.
Labels: genomics, scientific talk
European Genotype Archive
: Genome Wide Association Studies (GWAS) like WTCCC data and others. Only public information is public available under very strict rules.NHGRI GWAS
will be imported in Ensembl: it's got manually curated data of high quality.
Links in Variation view: link "Phenotype Data (n)
: situation will now improve in EnsemblLocus specific databases (LSDBs):
p53, ABO, collagen, albinism, cystic
fibrosis, Altzheimer's disease, ... >700
The main aim is to be able to link the reference CDS sequence used by the biomedical community
to the most up-to-date reference sequence in the genomics community
. This mapping will allow clinicians to link all phenotype data on their end to the genomic data in the genomic community.
Political pros and cons have to be carefully handled and continuously explained. Ensembl openness, existing infrastructure and visibility is the biggest selling point to have these dbs linked in a common LSDB resource.
will have LRG XML files and prettified HTML reports soon.
Labels: ensembl, genomics, scientific talk
All Ensembl gene predictions for all vertebrate species are based on experimental evidence:
- NCBI RefSeq proteins and mRNAs
- EMBL Nucleotide Sequence Archive
Aligning the evidences back to the gene prediction with Exonerate. Types of alignment results:
- added start
- longer region
- missing start
- non-matching start
- non-matching region
- shorter region
Exonerate has an exhaustive mode that takes a lot more time but fixes some of the mini-intron and
mini-exon issues that sometimes occur. Exonerate cdna2genome is very useful for quality checking.
Genebuild now uses head-to-head alignments of genewise and exonerate, and takes the best in each case.
Some cases are still difficult to get right with algorithmic solutions: this is were the curators are needed.
Labels: ensembl, gene prediction, genomics, scientific talk
Here is an example success story of using phylogenetic information to improve human gene annotation. What do you see wrong in this EnsemblCompara GeneTree?
The human gene prediction has been split into one third to the left and two thirds to the right. Some of the other species have the full length prediction, but some of the 2x and projected genomes also have this issue. This case was reported to the Havana team at the Sanger and they have now built a human and mouse full-length prediction for the gene (notice the Havana_genes
blue and dark green
The next Ensembl/Havana merge will hopefully reflect this change but, right now, you need to activate the Havana_genes DAS track to see the most up-to-date Havana annotation. There is a good number of these fixed now in the highly loved genomes, aka human, mouse and zebrafish. But there is a second level of genomes that are not getting any manual annotation here but may be annotated somewhere else...
Labels: ensembl, genomics
Interesting to see all the buzz that the Benjamin Franklin Award
has generated in the blogosphere, twittersphere, facebooksphere and any of the other spheres out there... I still think there is something we should try and resolve in open source bioinformatics, which is promoting Open Source software to create more awareness in the scientific community. We need to reconcile the promotion of modularity and generality with the fact that giving credit to the scientists who contribute to Open Source Bioinformatics software is still important. Projects that have built very modular and generic components may be doing a lot for the bioinformatics community at large but, at the same time, the less atomic and single-purpose your software is, the more difficult it is to publish it in a prominent scientific journal. The same goes for citing it in the downstream publications: very atomic programs are very successful in citation metrics, but infrastructure code is not. This means that well-designed, well-implemented and well-tested software is often not prominent enough for new people to notice, and too many bioinformaticians resort to their own glue code for building their bioinformatics infrastructure.
There wouldn't be anything wrong with rewriting your own code over and over again if it weren't because: (a) people spend too much time writing scaffolding
code that will let them access what is really
new and interesting in their project and (b) that code tends to be used and tested only internally and almost never reused for any other party unless it has been very well designed and documented --- hence the name scaffolding
There is a really good chance now to build an infrastructure that brings up a terminal
next to the next generation Petabyte-size data sources, using emerging "cloud" technologies. These technologies are already advanced in other fields other than bioinformatics, so we can leverage what it has already been done for us and make extensive use of it. These terminals don't need to be silly, and the community should provide in them as much prebuilt code as possible so that the new breed of bioinformaticians get used to have this software at their fingertips.
A few years ago all the effort was in building packages for different Linux distributions, so that people could easily install Open Source software on their in-house CPU clusters. I think we need to shift gears now to cloud software accessibility. The good news is that it seems everybody is
happy with the common Ubuntu system as a start. I fear the proliferation of iPhone-like SDKs around that will make the existing bioinformatics software useless. In an era where everybody is acutely aware about governments having to pour our taxes into infrastructure that was already paid for, noone will like to see all existing bioinformatics software become a "toxic" or "legacy asset"!
Labels: genomics, nextgen sequencing, open source
A quarterly activity report for the different activities that take place under what can be considered "sequence" at the WTGC in Hinxton, Cambridge, UK:
* This is a personal blog. Things said here are not to be taken as official reports.
- HGNC has made great progress in solving nomenclatures for 130 cases where the community has a diversity of opinions and it's difficult to agree on something. Very good point in saying that in the Internet era it's better to give a gene a name that is distinctive to common words that would clobber your Google search results. There is now a forum set up for different communities to use in discussions for gene family names.
- Havana now has started using RNAseq data to confirm new genes found in human and zebrafish that didn't have evidence before. One new feature is a "confirmed intron" for when paired Solexa reads bridge two exons, with an associated score for read depth. Confident this type of data will bring out many interesting new annotations that couldn't be found before, e.g. genes expressed in a given tissue during a lapse of a few hours in the development. There are already a few examples in zebrafish.
- Wormbase has been working hard on compiling more data from the modENCODE. Small but cool infrastructure achievement in having VMware images running for old releases that investigators can just pull and bring up on demand.
- Ensembl Genomes has been successfully testing the beta sites for Bacteria, Protists and the first Metazoa build. Another Metazoa build is in progress, with all the phylogenetics goodness of the 12 Drosophila genomes plus the vectors plus C.elegans and a few other outgroups. The modENCODE project is about to complete the re-annotation of gene models using CAGE data that will bring more precise gene starts for melano and elegans. Ensembl Genomes is still working together with Manchester and now the US to put together an Aspergillus resource that provides the best value for money to researchers. PombBase is also being pursued, lots of labs interested in having it Ensemblified and ready to use.
Labels: ensembl, genomics