Searching the gene database? Expect a wait, says expert

By Pete Young
Tuesday, 23 July, 2002

Hour-long waits for results from gene and protein database searches that now take a few minutes could become reality within a few years, according to a search engine expert.

That scenario is inevitable if more efficient search algorithms aren't applied, says Dr Hugh Williams, head of a software research group at RMIT's School of Computer Science and Information Technology.

Williams and his team are moving to commercialise a more efficient search engine, named Cafe, which has been given pre-seed funding by RMIT but is now looking for outside investors.

Their work could be important because DNA sequence databases such as GenBank are doubling in size every 13 months. At that rate they are outpacing the capabilities of commonly-used search algorithms like BLAST.

"I believe current search techniques as unsustainable," says Williams.

Sequence database searches that took 10 seconds to produce answers three years ago are now taking several minutes, he says.

Simple extrapolation suggests those times could blow out to half an hour or an hour in another three years unless corrective measures are applied.

Williams believes his team will contribute to the solution.

In the past 12 years, it has built up an international reputation for its search engine prowess. One leading internet search engine company, Google, trolls for new recruits on the RMIT campus and regularly offer jobs to members of Williams' team, he says.

Its specialty lies in building faster, more efficient engines and addressing scalability issues so database growth doesn't translate into longer search times.

"Our skills lie with speeding up the search process using data compression and algorithms designed to process data in faster ways."

About seven years ago, the group began applying the techniques it originally developed for Google to what it saw as the fruitful area of protein and DNA databases.

Williams says search algorithms like BLAST treat databases as one huge text file and don't scale well, meaning they become rapidly less efficient as database sizes increase.

Nor is faster hardware the answer. Some observers believe gene and protein information databases are exploding faster than Moore's Law (governing the growth of affordable processing power) can handle.

The techniques developed by Williams' group will make searches 100 times faster than current algorithms permit and will scale up as databases continue to expand.