Feature: Towards the $1000 genome

By Graeme O'Neill
Tuesday, 27 July, 2010

As a postdoctoral researcher at the University of California, San Francisco, in 1977, pioneering Australian molecular geneticist, John Shine, who is now head of the Garvan Institute of Medical Research, took weeks to decode the 1000-nucleotide human insulin gene, using the Maxam-Gilbert biochemical method. Any of today's next-generation sequencers could sprint through the same task in seconds.

Sanger sequencing, invented in the same year that Shine manually sequenced the rat and human insulin genes, has been the sequencing technology of choice for three decades, but in recent years has fallen orders of magnitude off the pace for major genome projects.

Dr Sue Forrest, CEO of the Australian Genome Research Facility, believes that many in the Australian research community have yet to benefit from the revolutionary potential of next-generation sequencing. The AGRF still makes extensive use of Sanger sequencing, which is much cheaper for smaller samples; it costs a few dollars compared with around $1000 for a next-generation run. It also provides a turnaround time of a few days, compared with several weeks for next-generation technology.

Sanger sequencers employ the dideoxy termination method to add nucleotides, one at a time, to cloned DNA fragments with a typical read length between 750 and 1000 bases, and read them into computer databases.

But Sanger sequencing will lose its advantage in read length this year, when 454 Life Sciences, an independent subsidiary of Roche since 2005, upgrades its GS FLX sequencer to provide a read length of 1000 bases. 454 is also about to launch a compact benchtop sequencer, the GS Junior, complete with a desktop computer optimised to process 454 sequence data.

Sequencing: The next generation

The AGRF's Brisbane node acquired Australia's first GS-FLX sequencer in May 2007, and stepped up the pace by acquiring a new Illumina Genome Analyzer II (GAII) sequencer late in 2008, which began providing contract sequencing in January last year.

The next generation sequencing revolution was already afoot as the Human Genome Project headed for the tape in 2000. Massive investment in the Human Genome Project resulted in a very large scale, but only modestly parallel, operation involving thousands of Sanger sequences in special centres in the US, Japan, the UK and Europe. The project delivered its final draft, covering 90 per cent of the human genome in 2003, a decade after it began in 1993.

Yet in 2000, a year before the first draft of the Human Genome Project was completed, Swedish company, Pyrosequencing AB, began marketing the machinery and reagents required to sequence short DNA fragments by a simple but ingenious new method.

---PB---

Illumina, 454 and Sequencing by Oligonucleotide Ligation and Detection (SOLiD) sequencers differ in proprietary detail, but rely on the same basic methodology of cleaving chromosomes into short DNA fragments and amplifying each fragment in situ on a sequencing substrate.

The amplified material serves as a template for the enzyme-mediated assembly of labelled nucleotides, that are scanned automatically, revealing the original, anonymous sequence. Computer algorithms then match up overlapping sequences to reconstruct the full sequence of the original chromosome.

The technique can be applied to virtually any nucleic acid starting material: genomic DNA, messenger RNA transcriptomes, cDNAs, cloned DNA from bacterial artificial chromosomes, small non-coding RNA extracts, or polymerase chain reaction (PCR) sequences.

In 2003, the year that the Human Genome Project was completed, 454 Life Sciences and the Baylor College of Medicine began a second human genome sequence. Two years later, in 2005, the partners announced they had completed sequencing the personal genome of DNA pioneer and Nobel laureate James Watson. It took one-fifth the time of the Human Genome Project.

Dr Mark Crowe, manager of AGRF's Brisbane node says the 454 instrument can simultaneously sequence a million randomly cleaved DNA fragments with an average read length of 400 to 450 DNA bases, which provides adequate sequence overlap for rapid alignment and mapping.

"It allows us to do genome sequencing on organisms like bacteria, or transcriptome sequencing on non-model organisms, at relatively low cost," he says. "We use the 454 sequencer to do a lot of work on small-genome organisms, but when you get into the realm of higher-order genomes, the Illumina sequencer comes into its own."

Crowe says the AGRF will shortly be adding to its arsenal of sequencers with the impending arrival of two Illumina HiSeq 2000 instruments, each capable of producing more than 200 giga base pairs per run - equivalent to 70 human genomes' worth of sequence - in just over a week.

The AGRF has already used Illumina sequencing technology to sequence the genome of the Great Barrier Reef coral, Acropora millepora (see 'Illuminating coral'), and is currently collaborating with bioinformaticians around the country on the painstaking task of stitching together the hundreds of millions of individual sequencing reads,

Crowe says that when the Illumina sequences are aligned and compared against the Human Genome Project's template, they are ideal for identifying and comparing single nucleotide polymorphisms (SNPs) and other variation, like copy number variance, deletions, insertions and duplications.

"There's a lot of interest in human resequencing now that there are good reference sequences available," he says.

---PB---

Genome projects

In April, the International Cancer Genome Consortium (ICGC) also released the first results from its massive project to sequence 500 genomes from each of 50 different types and subtypes of human cancer.

University of Queensland pancreatic cancer expert, Professor Sean Grimmond, leads the Australian contribution to the ICGC, which is funded by a $27.5 million from the National Health and Medical Research Council, the largest single grant the Council has ever awarded.

Grimmond says the project is already revolutionising cancer research. His team has already confirmed suspicions that the selected pancreatic cancer cell lines used for research differ significantly from the genomes of patients' pancreatic cancers.

According to Crowe, carcinoma profiling will be essential for the coming age of personalised, targeted cancer therapies. "Cancer researchers are looking for mutations that occur frequently in related cancer types, to distinguish causative mutations in specific cancers from mutations that are merely innocent bystanders.

"Another opportunity for Illumina sequencing is messenger RNA sequencing for transcriptome analysis in unfamiliar organisms for which no reference samples are available," he says.

"Illumina's GAII sequencer gives a lot more sequence reads than Life Roche's 454 sequencing, but they're much shorter, at around 100 bases. One of the challenges is that the pipelines and assembly software developed for the Human Genome Project are designed for long sequence reads with good overlaps.

"It's much harder to get rapid assembly of shorter sequences, so there's a huge bioinformatics effort dedicated to developing new software. It's quite challenging just to reassemble the mRNAs from a transcriptome. But when there's a reference genome like the Human Genome to map against, Illumina sequencing can give very good depth.

"Not only can you identify the gene that produced an mRNA, you can tell how the exons were combined to produce a particular splice variant. It's a different approach to doing transcriptomics with microarrays because you're getting a digital measure of gene expression, rather than just confirmation of hybridisation."

With their short read lengths, next-generation sequencers are transforming the study of ancient genomes. DNA is very durable, but over time, fractures into very short segments. Crowe says ancient DNA poses a particular challenge for Sanger sequencing because of the difficulties in cloning these small fragments, a process not required for next-generation sequencing. The much greater coverage provided by next-generation sequencing also helps to differentiate single-base errors in these old, degraded samples from naturally occurring SNPs in the ancient genome.

Next-generation sequencers do have a higher error rate per base of sequence, says Crowe, but because they produce so many separate reads, individually amplified sequences can be compared against each other to distinguish process errors from natural variation.

---PB---

Even Moore's Law is struggling to keep pace with the challenges of storing and processing the swelling torrent of data from next-generation DNA sequencers. And the problem will only grow with the advent of third-generation sequencers with read lengths measured in tens of kilobases.

"While that should solve many of the assembly challenges, it means much higher throughput at lower cost, so the number of genome projects will increase," says Crowe. "Associated with other developments, it will be possible to do 100 bacterial genomes for $100 to $200 each, and when bacterial researchers can do experiments on that sort of scale, it's going to transform metagenomics," which is the analysis of complex bacterial communities in terrestrial, aquatic and marine systems.

Crowe says the $1000 personal genome is probably only 18 months to two years away - and that will place enormous demands on processing and storage capacity.

"The AGRF has tens of terabytes of storage capacity, but we'll soon need hundreds. We'll probably need petabyte and exobyte drives in the foreseeable future, which will really push the boundaries of storage capacity, let alone our ability to analyse the data."

This feature appeared in the May/June 2010 issue of Australian Life Scientist. To subscribe to the magazine, go here.

Feature: Towards the $1000 genome

Quitting smoking increases life expectancy even for seniors

Stem cell transplants treat blindness in mini pigs

Sugary drinks raise cardiovascular disease risk, but occasional sweets don't

Content from other channels on our network