When p is greater than n

By Kate McDonald
Friday, 19 January, 2007

Just as the technology allowing biological investigation has exploded over the past two decades so has the amount of data that technology produces. Getting sensible information out of that mountain of data is a challenge facing all researchers, not the least those who went into the wet end of science to avoid the horrors of mathematics in the first place.

At CSIRO Mathematical and Information Sciences (CMIS), however, that challenge is being taken up with gusto. The 120-odd researchers working within the division come from many areas of science, including molecular biology, computer science, mathematics and statistics. Within the division is a group comprising 40 people researching biotechnology and imaging, led by Dr David Mitchell.

Mitchell's group covers three areas, all slightly different in application but sharing a strong statistical approach. There are two statistical bioinformatics groups: one focused on agriculture and led by Dr David Lovell, and the other on health, led by Dr Bill Wilson. The third area is biotech imaging, led by Dr Pascal Vallotton.

It is the statistical approach to bioinformatics that this group concentrates on, rather than the computationally intensive work traditionally associated with the discipline. As Mitchell characterises it, statistical bioinformaticians work on any data that fits the "p bigger than n" format - that is, many more variables than subjects.

It may sound rather dry to those new to the area, but "you have to get past the mathematical phobia," Mitchell says. "Mathematics gives you some fantastic insights into different ways of doing things that are not apparent to you as a biologist."

Mitchell himself did his PhD with CSIRO on rotavirus before moving to Switzerland to work on viral diagnostics and then industrial enzymes. He has a strong commercial background, having set up a lab supplies company and then a fledgling biotech company, and was the founding CEO of Cryosite, the cryogenic storage facility.

"Bioinformatics is very broad," he says. "In bioinformatics you have all of the traditional sequencing stuff - blasting and blatting. You have [biological] diversity so you have databases that have bacteria and insects. You have statistical bioinformatics and statistical genetics, which is about linkage. You have the whole microarray type of area, which is a combination of statistics and traditional bioinformatics and then you have a whole area around protein folding, pathway analysis, digital life.

"But they are all arbitrary definitions. We don't do protein folding, we don't do biodiversity, we don't do much sequencing style bioinformatics or comparative genomics. We do statistical bioinformatics and that's mainly around microarray style data, scanning SNP data, proteomics, metabolomics - any data that fits in the format 'p>n' ."

Mitchell says that, from a data-centric point of view, all of the areas his group studies share similar characteristics on which to work. Beginning with designing analysis methods for microarray data, the group is now working in areas such as single nucleotide polymorphism (SNP) chip data analysis, broader discovery biology, neurobiology, biological markers - any area in which analysis of massively multivariate data is required.

"This has been tremendously productive because we are now working with people who look at things fundamentally differently from biologists and other bioinformaticians, because they have a statistical and quantitative background," he says. "Many bioinformaticians have either computer science or biology backgrounds - most molecular biologists do not have quantitative skills. That's why we did biology!

"Rather than just being consumers of technology we are now starting to say this is a really interesting platform - the Affymetrix platform, SNP chips, for example - how can we change the information on that chip to do something different? All of a sudden things that were not possible become possible."

Data is ubiquitous and it will be the analysis that will make the difference, he says. "Think about the stock market - everybody is pretty much working on the same information, there's more of it than you know what to do with and it's the people that really can use that better or have outside information that are better off."

The microarray data mountain

Mitchell iterates that his is a research group, not a service group - that's what the 'r' in CSIRO stands for after all. Having said that, a lot of what the group researches and develops is aimed at assisting discovery science to wade through that data mountain.

One tool CSIRO has developed that many would be aware of is GeneRaVE, a statistical technique developed specifically for microarray data to identify sets of genes important in various diseases. GeneRaVE makes it possible to analyse the massive amounts of data produced by contemporary microarray experiments by classifying small sets of genes rather than the hundred or so commonly arising from high-throughput techniques.

For example, GeneRaVE is being used to study paediatric acute lymphoblastic leukaemia, which features six subtypes. CSIRO has been able to develop a classification system of only seven genes that distinguish those subtypes, a major advance on the hundreds of genes associated with the disease.

Using these techniques, GeneRaVE has given birth to NetRaVE, which can identify the networks of genes around these classifier genes in the dataset. Both techniques are being applied to several areas using microarray data: to develop simpler clinical diagnostics, to understand the mode of action of candidate drugs in toxicogenomics, and to identify subpopulations of patients who respond differently to therapies by looking at their SNP profiles.

Mitchell calls the approach his group takes to analysing data 'sparse'. "Sparse means that when we look at our analysis techniques for classifying cancers, for example, compared to other techniques we might use three genes and they might use 50."

Mitchell and his colleague Bill Wilson, who trained as a molecular biologist concentrating on gene regulation before moving into bioinformatics, are currently working on developing a general purpose vertebrate chip for discovery biology. With microarray platforms adding more and more oligonucleotides every year on a chip with the same surface area, there is a concurrent increase in the available data, to the extent that it can become overwhelming.

"Standard transcriptional profiling chips? 1.35 million oligos," Mitchell says. "Tiling arrays? 6.5 million, but physically the size of your thumb. All of a sudden, that's a lot of real estate. But what can you do with it?"

Wilson says the approach his team is taking is to look at the chip in a different way, a "genome-wide analysis for genome-wide measurements".

"We don't see it as a genome on a chip for measuring physical genes that have been transcribed," Wilson says. "We can start looking at is as data points - 1.3 million data points and what we can do with those data points."

As Mitchell explains, if you fed those data points through GeneSpring you would get a list of the top 20,000 oligos. "How useful is that? With ours you'd get four, or ten. You can actually do something with that."

For the vertebrate chip it won't matter which particular animal you are looking at due to the similarity in genes, Wilson says. "What we know about vertebrate genomes at the moment is that it's the gene regulation that's really different between the vertebrates, not the make-up of the genes. There seems to be a little subset of really unique genes for each organism but they are tiny. Again it will be combination of these oligos that will give us the answers. It will be a tool that will allow people working on marginalised organisms to have access to microarrays."

Mitchell says the philosophy behind the all-purpose chip is that it is a place to get started. "At the moment the way that you tackle a new species is that you go out and make an EST [expressed sequence tag] library, but there are 5000 different sorts of mammals and 46,000 vertebrates and everyone would like to try this sort of technology.

"Our hope is that you can take a couple of these chips out of the fridge and have a go, see what sort of results you'll get. We're sure you will get some data and then you can make an assessment of that data based on your biological knowledge of the area."

The team is currently finalising the sorting of the oligos and will then pass it on to Affymetrix to manufacture. After this is completed, they are hoping to do the same thing by profiling other metagenomic communities, particularly bacteria.

Asking the right questions of the data

The analytical methods developed by the team are aimed at being broadly applicable, according to David Lovell, who heads the bioinformatics for agribusiness group. This group applies mathematics and statistics to benefit agribusiness, still an area of enormous consequence to Australia' s economy.

Lovell, who adds to the esoteric mix within the group by having a degree in electrical engineering, says his team has a specific focus on high-throughput measurement technologies such as microarrays, SNP arrays and gas chromatography/mass spectrometry, working in collaboration with researchers and other bodies such as CRCs.

"For example, we are working in partnership with the grape and wine research and development corporation, the CRC for Sugar Industry Innovation through Biotechnology and we have indirect links with the Aquafin CRC," Lovell says. "In all of those cases we are holding hands with researchers in the plant biology domain, marine aquaculture domain and the sugar domain. It's a hugely important sector in terms of the national and international markets.

"One of the things that keeps Australia agriculturally competitive is access to the latest technology that enables effective breeds to be created, that enables us to deal with agriculture in a changing climate, a demanding climate."

One area the group is looking at is how to breed sugar cane so that it is resistant to invasive diseases such as sugarcane smut, he says.

"Until recently this fungal disease wasn't well known in Australia but it is now well and truly entrenched and is having significant impacts on a billion dollar industry. So by working to identify genes associated with resistance to this fungal disease we hope to be able to speed up breeding programs to benefit industry."

The team is also working with groups in rice functional genomics to develop a system that will support all of the information that comes from massive insertional mutagenesis experiments. It has developed an information management system for the Rice Gene Machine project to help ensure data is accurate, consistent and that the biologists get the maximum mileage out of the data.

"This is a big problem for biology in this age of industrialisation because people are dealing with huge amounts of data," Lovell says. "You can generate many, many thousands, sometimes millions of measurements quite quickly. Our aim is to help people get maximum value out of measurements from high-throughput platforms.

"Ideally we would like to ask the right question of the data. You can measure lots of things on relatively few individuals but how do you go about analysing that in the right way? At the end of the day you have approaches that are applicable to a variety of measurement platforms."

Lovell believes one of the most valuable things that statisticians add to the biosciences is in asking awkward questions at the beginning of an experiment, before the resources have been committed. His colleague Dr Glenn Stone, whom Lovell jokingly refers to as a 'statistical renaissance man' and who likes nothing better than to explain exactly what multivariate analysis is by way of diagram, paraphrases famous statistician Ronald Fisher to describe the philosophy of the team.

"Calling in a statistician after an experiment has already been run may be useful in that he can provide a post-mortem to tell you what the experiment died of," he says.

Stone, who has degrees in mathematics, computer science and statistics, has been working with colleague Maree O'Sullivan to assist a group from the Kolling Institute of Medical Research in Sydney identify gene and protein sets that discriminate between high- and low-grade intercranial tumours.

The Kolling researchers, led by Dr Kerrie McDonald, are collaborating with the CSIRO team to improve the accuracy of diagnosing these gliomas," O'Sullivan says.

"The problem is that currently it's a subjective diagnosis - they have the histology of the tumour sample, the pathologists have a look at it - but if you gave the tumour to six pathologists it is unlikely they would all agree," she says. "The identification of new molecular tools is needed to supplement current diagnostic methods.

"There are a number of different intercranial tumour sub-types, ranging from low grade to high grade. The high-grade tumours are obviously very aggressive and you are often looking at a 12 to 15-month survival time.

"What we are trying to do is develop a diagnostic that will improve accuracy. We have used our algorithms and looked at Compugen gene expression data."

O'Sullivan and Stone have used two algorithms to improve the accuracy of the data. As in the other applications the whole team has worked on, the advantage of the algorithms is that they are able to pick out small subsets of genes as opposed to the current methodologies that generate a whole list of potential genes, O'Sullivan says.

"With our methodologies we identify a small number of subsets of genes and then the validation is done on a small number, which cuts out a lot of the cost and the time. In every run we did it picked out pairs of genes, and in one case a three-gene subset. That's out of 19,000 potential genes."

Kerrie McDonald then did a real-time PCR validation of the gene expression and needed to validate only the 18 genes identified by the CSIRO algorithms rather than 50 or 100, O'Sullivan says.

CSIRO's major role in the project was to do the data analysis differently, Stone says. It is an approach that has far wider application than intercranial tumours. "We have worked with other groups on things like childhood leukaemia and amoebic gill disease in salmon - being statistical analysts we don't really care where the data comes from. We have developed these statistical methods to be widely applicable so they can produce as much information as possible.

"All of our methods are basically about saying are there combinations of several of these things that make a better classifier. The reason standard multivariate methods don't work is that there are more measurements than patients. This p>n situation is what makes life difficult for statisticians so we have new ways of looking at things. There you go - multivariate analysis 101."

Keeping track of brain changes

Dr Pascal Vallotton leads the biotech imaging group within CMIS, which has developed nerve cell analysis software allowing neuroscientists to automatically characterise the branching topology of neurites. Called HCA-Vision, the software also allows researchers to quantify proteins in different parts of a cell.

"Researchers are sitting at a microscope with a digital camera attached and they take an image of a cell undergoing a certain process," he explains. "For example, the cell is being subjected to some treatment or a series of different chemicals and researchers want to see whether these chemicals stimulate the growth of neurites.

"Our software analyses these images automatically and delivers a report on the geometry of these neurons - the length of the neurites, the number of branching points, or the intensities along branches. It also quantifies the surface of the cell body.

"You get an awful lot of statistics out of that. Obtaining accurate statistics is the really challenging part of our work."

Vallotton says that with other analysis software, cell morphologies that appear distinctly different to a human observer can end up sharing the same statistics and hence be considered identical by an automated classifier downstream.

"[Our] software reports the number of primary or root neurites as well as the number of layers associated with these primary neurites - secondary, tertiary, quaternary etc. These additional features allow users to selectively screen for compounds triggering different types of neurite outgrowth behaviour. Results are available either on a cell-by-cell basis or as averages over images.

"In the high-content analysis area you have a lot of high-end instruments that will typically cost half a million dollars and that will have bundled software tailored for that application. At the other end of the market you have a lot of software that is generic image analysis, such as Matlab, but you have to become an expert in order to do good measurement of cells. What is not available is high content analysis software for people who don't have these high end instruments or are not expert image analysts. With HCA-Vision we are addressing this important need."

Vallotton says the biotech imaging group, which comprises 10 researchers drawn from mathematics, statistics and software engineers as well as his own background in biophysics and computational biology, is looking to expand the initial neurite analysis module to look at more applications.

"We are going to release additional modules for quantifying protein translocation or to look at morphology of nuclei, or to look at systems that are more complex, such as systems that have mixed cell populations, as that is more like what happens in vivo.

"We are also trying to develop a 3D version of our software. Neurons are 3D by nature and looking at them in 2D in dishes is not very representative of what happens in the brain. Researchers now have technology that allows them to image in 3D almost in situ but they need to do the same quantification as they do in 2D and that demands a lot of computational resources.

"We are also very interested in micronuclei - little nuclei that appear around the nucleus - when the cell is stressed or when it is exposed to carcinogens. These micronuclei are interesting because they are indicative of the toxicology of a candidate compound. They are typically counted, often manually, in pre-clinical trials. We are collaborating with a number of labs in that area who want an automated solution to that."

Another project the team is working on is headed out of Brisbane by team member Paul Jackway. This project is looking at counting probiotic bacteria using image analysis rather than PCR or flow cytometry.

"The benefit of image analysis is that it potentially can give much more information on the population of bacteria that you have in your sample," Vallotton says. "In our solution, everything is automated - from loading the slides on the microscope, to acquiring mosaic images and counting the number of different types of bacteria."

Other CMIS projects

1. Automated microscopy for counting bacteria in faecal samples CSIRO biotech imaging specialists are working with others in the Preventative Health National Research Flagship searching for bacterial species in human faeces that may protect against colorectal cancer. The research requires an automated fluorescence microscope system that can achieve greater speed and accuracy in counting fluorescently tagged bacteria than manual counting. CSIRO has written software to automate an ordinary fluorescence microscope to automatically focus, to determine which parts of the image represent bacteria and to count the bacteria and record the results in a spreadsheet.

2. Discovering biomarkers to prevent colon cancer Preventative Health Flagship scientists want to identify biomarkers for precancerous lesions (adenomas) in the colon to create better diagnostic products. Using CSIRO's statistical techniques, the researchers are investigating gene expression in colon tissue by looking at adenomas rather than cancers. The researchers have identified two sets of biomarkers from the data: one set predictive of the position of healthy tissue along the length of the colon and the other diagnostic of adenomas. They have also built a molecular map of how gene expression changes along the colon in healthy people and are using it as a foundation for studying gene expression in diseased tissue.

When p is greater than n

Cannabis use may double risk of cardiovascular disease death

Space conditions can lead to periodontitis, scientists say

Personalised brain stimulation helps treat those with depression

Content from other channels on our network