Desperately seeking storage
Wednesday, 06 November, 2002
At a Sydney conference last month, Kerri Hartland, the executive general manager of Commonwealth government agency Biotechnology Australia, revealed the results of a poll which asked Australian biotechs what they thought were their biggest challenges.
The usual suspects topped the list -- tax, staff, venture capital -- but the surprise omission was bioinformatics. Not even the most basic bio-IT requirement for just about any life science company, data storage and management, rated in the poll. According to one industry insider, though, that's not because storage isn't an issue, but more a case of "when you're starving, you don't worry about your next Ferrari".
Data storage is an issue, but complicating the issue is the fact that data storage requirements vary dramatically between different strata of the life sciences sector, according to market research company IDC Asia Pacific's regional director for life sciences and consulting, Philip Fersht. "The storage-thirst from some life science companies -- particularly those involved in proteomics or bio-content (GenBank, Inpharmatica, Incyte) are massive, because their requirements can be huge and often at short notice," Fersht says.
"For others, like stem cell, biomedical, there is probably little difference between their storage needs and those of a regular small business." The main trend in storage hardware is its plummeting cost -- a trend that is "great news for the industry", if it continues.
Of the major storage vendors in the region, which Fersht lists as IBM, Hitachi, EMC Corp, Sun Microsystems, Silicon Graphics and HP, the slowest in making a commitment to life sciences is probably Sun, he says.
IDC believes revenues generated by data storage sales into Australia's life sciences sector will show average annual growth rates of just over 25 per cent every year between now and 2006. That lags well behind the 52 per cent expected growth in total sales of hardware, software and services in 2003-4 (of which the service component lies in triple digits).Overall, storage should generate estimated revenues in Australia for 2002 of $US100 million ($180.8 million) rising to $US270 million ($488 million) in 2006, according to IDC. In the larger Asia-Pacific picture, excluding Japan, the market researcher tips storage revenues to jump from about $US400 million ($723 million) this year to $US1 billion ($1.8 billion) in 2006.
The largest data handling issue facing the life sciences sector is not simple storage so much as management. A never-ending stream of new data formats is presenting bioindustry with the challenge of integrating data sources so intelligent queries can be carried out. The arrival of every new technology, such as microarrays, brings with it a new data type so that "data we are dealing with now is fundamentally different than anything we were dealing with 10 years ago," says Australian Genomic Information Centre (ANGIS) CEO Mike Poidinger.
ANGIS provides access to gene sequence databases including Genbank as well as a range of database-related software and services."In biology, we are asking the IBMs of this world how we can federatedly store and make intelligent queries of all our different types of data, " says Poidinger.
The application challenge
The challenge is really at the application layer where software applications must be modified to recognise each new data format before it can manipulate it. "Writing software is the slowest part of any product set," Poidinger says.
IBM's offering in this area is DiscoveryLink, a middleware product that promises single-query access to multiple existing databases, applications and search engines.
One solution to integrating diverse data sources is the creation of software "wrappers" which smooth the assimilation of new data sources by existing data structures. Software company geneticXchange, which grew out of Singapore but has a US headquarters and Australian office, has a data integration aid called discoveryHub which is specifically designed for the life sciences market. The company has developed software wrappers which allow discoveryHub to query and integrate about 70 public and private data sources, according to Asia-Pacific vice-president Lorraine Noffke.
One client is the Genome Institute of Singapore and the company is currently in pilot projects which it hopes will produce its first Australian customer. Life science companies follow the same ground rules as truck manufacturers when working out the most cost-effective ways to store and manage digital data.
The difference lies in the larger volumes of data that many bioindustry players must store and the plethora of new data types they face compared to other business sectors. Their data volumes also tend to shoot up much faster, according to Wayne Glynne, business unit manager for IBM Storage, Australia and New Zealand.
"Data amounts can double every six to 12 months for a life science organisation that starts with a concept and goes into research phase," he says. "The challenge for them is to scale up without a cost blow-out."
Transferring large gulps of information out of databases across inadequate networks forms another potential bottleneck, according to ANGIS' Poidinger.The downloading and maintenance of ANGIS' local copy of GenBank, for example, "is a real issue," he says.
Updating it weekly requires a four gigabyte download session that takes up to five hours. Only one gigabyte of that is new material but the entire four gigabytes must be downloaded because GenBank supplies the update as a single file which includes data from the previous week.
"So there is an issue with moving data from larger databases and maintaining them," says Poidinger. "Network speeds and cost are what it is about." From the vendors' point of view, given enough money, there is no problem with shifting vast volumes of information out of databases onto servers and workstations where researchers can make use of them. "Technology is accelerating in the area of data transfer between storage devices and servers," says IBM's Glynne.
"It is done over a storage area network (SAN) these days and, in the space of a couple of years, transfer speeds have doubled to two gigabits per second. "That would handle the workloads of most organisations today and by 2004 we expect speeds to be approaching 10 gigabits per second."
An extra hiccup is the generation gap between 64-bit and 32-bit operating systems. The large overseas databases sit in 64-bit environments while organisations such as ANGIS are still using 32-bit systems. Communication is by means of a 64-bit emulation mode which imposes the penalty on ANGIS of not being able to handle files larger than two gigabytes without splitting them into smaller chunks.
"It is not a major issue but it is one more data-handling complication," says Poidinger.
One of the larger vendors of externally based data storage, EMC Corp, is maintaining a watching brief on the Australian life sciences market at the moment, according to director of marketing and sales support Clive Gold. But Gold noted that a recent EMC initiative known as Centera is drawing interest from the life sciences community in the US.
Centera technology uses content-based addressing software to ease the storage management headache over data that must be retained for more than a decade. As such it appeals to drug discovery and pharma companies who need to manage large volumes of permanent data related to R&D development, clinical trials and manufacturing processes over long periods.
The Centera storage architecture links data storage with the applications used to create, view, and store data. Its content address feature relieves applications of the need to keep track of a data file's physical location and simplifies long-term storage management, according to EMC.
Another differentiator in the way the life sciences use data storage is the issue of long-term data storage. Tax rules encourage corporate Australia to think in terms of seven-year cycles for financial data. But a drug development pipeline stretching over 12 years and more imposes a longer scale on life sciences data storage. To offset cost increases as more and more data is stored, older data can be migrated to progressively less expensive media than disk, starting with optical and ending with tape. The danger is that archiving data to tape prematurely can create problems for users who still need to access it.
A variation on the cost-reduction theme is the arrival of newer tape technologies over the passage of time which allow the same amount of data to be stored more cheaply. Vendors provide special software, such as Tivoli's TSM (Tivoli Storage Manager), as tools to avoid disk wastage and ensure migrated data remains accessible to software applications.
Goodbye terabytes, hello brontobytes
Back when Celera Genomics was winning the race to sequence the human genome, it filled its databases with 70 terabytes of data, according to market research firm IDC.
But a terabyte isn't what it used to be. Earlier this year IBM demonstrated in the lab that it could cram a terabyte of data onto a single linear digital tape cartridge. It might be another eight years before that ultra-capacity cartridge will become routinely available.
The demand is already there. The size of databases needed to hold the complexities of the proteomics era will be a thousand times larger than Celera's genome-cracking effort, which punches them into the petabyte zone.
Weighing in at one million gigabytes, a petabyte is a seriously large number. IBM isn't saying when it plans to bring out a one petabyte tape cartridge. When it does, however, it will still have plenty of goals to shoot for. In ascending order they will be the exabyte, the zetabyte, the yottabyte and the brontobyte -- each one 1000 times larger than its predecessor.
To put brontobytes into perspective, each one equals one trillion trillion megabytes or the number one followed by 27 zeroes worth of bytes.
If a storage vendor of the future ever does come out with a one brontobyte disk, Celera's entire 70 terabyte database would fill less than 10 per cent of it.
Bright nights may increase risk of death, Alzheimer's
Avoiding night light and seeking daylight may lead to reduction in disease burden, especially...
COVID-19 infection increases risk of heart attack and stroke
COVID-19 infection may increase the risk of heart attack, stroke and death from any cause for up...
A bout of COVID could protect you from a severe case of flu
Recovery from COVID appears to have a protective effect against the worst effects of the flu,...