castanyes blaves

Random ramblings about some random stuff, and things; but more stuff than things -- all in a mesmerizing and kaleidoscopic soapbox-like flow of words.

3/09/2009

 

BioMart Slow Query Analysis -- Rhoda Kinsella -- European Bioinformatics Institute - EMBL

BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the European Bioinformatics Institute (EBI).

Queries were crashing the mart servers or taking a long time. Log any queries longer than 60 seconds.

New hardware on v51 dropped the no. slow queries from 95000 to 3300. Picked up probably more userbase in v52 to 12000.

Something got wrong on Oct 13th as a lot of slow queries happened that day. Another 2000 on Oct 19th.

One go pull downs of EMBL, MGI, UniprotSWISSPROT. Pull down all GO. Pull down all est and gnf data. Pull all protein_feature PFAM, tfhmm data. A bit of HGNC filtering, but most of the time people want all in one query with no filtering.

Conclusion: there is a fine line between using BioMart and fetching zipped files in an FTP server. Some users seem to prefer BioMart even though it will be much slower than going to the FTP site.

For v52, lots of snp_marts. People sending queries filtering for lots of stuff.

All Xref EMBL, EntrezGene and protein_id: some genes have thousands of EMBL links, multiplied by thousands of protein_ids, makes queries slow.

SNP52: strain polymorphism table with a lot of strain filters.

SNP52: transcript_variation (variation_feature_ids)

E!52: exon_transcript table + transcript_variation table + filterings.

Solutions:
  • remove duplication and solve all NULL values on row issues (e.g. gnf)
  • limits on external attributes
  • limits on est and gnf attributes
  • indexed expression tables
  • new hardware
Upcoming solutions:
  • increase result batch size?
  • merge 3 GO categories?
  • Remove unused tables?
  • Canned queries?
  • Stop user ability to re-send a query?
  • Keep analysing the logs and make it more automated and informative -- keep an eye on what is happening
  • Maybe limits on formats for "gimme all for X" --> goes to the FTP link
  • Maybe new FTP dumps for "gimme all attributes for Y+Z" --> goes to the FTP link
  • More complicated combinations --> cached them on a "dynamic FTP"
Lots of little things that together improve things a lot.

Labels:


Comments:
very informative post. Thanks for sharing.
regards
GIS Mapping services
 
Post a Comment

Subscribe to Post Comments [Atom]





<< Home

Archives

200409   200412   200501   200502   200503   200504   200505   200506   200507   200508   200509   200510   200511   200512   200601   200602   200603   200604   200605   200606   200607   200608   200609   200610   200611   200612   200701   200702   200703   200704   200705   200707   200708   200709   200710   200711   200712   200801   200802   200803   200804   200805   200806   200807   200808   200809   200810   200811   200812   200901   200902   200903   200904   200905   200906   200907   200908   200909   200912   201001   201002   201003   201004   201007   201009   201011   201102  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]