Random ramblings about some random stuff, and things; but more stuff than things -- all in a mesmerizing and kaleidoscopic soapbox-like flow of words.
BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR)
and the European Bioinformatics Institute (EBI)
Queries were crashing the mart servers or taking a long time. Log any queries longer than 60 seconds.
New hardware on v51 dropped the no. slow queries from 95000 to 3300. Picked up probably more userbase in v52 to 12000.
Something got wrong on Oct 13th as a lot of slow queries happened that day. Another 2000 on Oct 19th.
One go pull downs of EMBL, MGI, UniprotSWISSPROT. Pull down all GO. Pull down all est and gnf data. Pull all protein_feature PFAM, tfhmm data. A bit of HGNC filtering, but most of the time people want all in one query with no filtering.
Conclusion: there is a fine line between using BioMart and fetching zipped files in an FTP server. Some users seem to prefer BioMart even though it will be much slower than going to the FTP site.
For v52, lots of snp_marts. People sending queries filtering for lots of stuff.
All Xref EMBL, EntrezGene and protein_id: some genes have thousands of EMBL links, multiplied by thousands of protein_ids, makes queries slow.
SNP52: strain polymorphism table with a lot of strain filters.
SNP52: transcript_variation (variation_feature_ids)
E!52: exon_transcript table + transcript_variation table + filterings.
- remove duplication and solve all NULL values on row issues (e.g. gnf)
- limits on external attributes
- limits on est and gnf attributes
- indexed expression tables
- new hardware
- increase result batch size?
- merge 3 GO categories?
- Remove unused tables?
- Canned queries?
- Stop user ability to re-send a query?
- Keep analysing the logs and make it more automated and informative -- keep an eye on what is happening
- Maybe limits on formats for "gimme all for X" --> goes to the FTP link
- Maybe new FTP dumps for "gimme all attributes for Y+Z" --> goes to the FTP link
- More complicated combinations --> cached them on a "dynamic FTP"
Lots of little things that together improve things a lot.
Labels: scientific talk