Mining biotech's data mother lode

A EU-sponsored project has developed a suite of tools that will enable biotech companies to mine through vast quantities of data created by modern life-science labs to find the nuggets of genetic gold that lie within.

The BioGrid project brought together six partners from the UK, Germany, Cyprus and The Netherlands to address one of the key problems facing the life sciences today.

"How to integrate the huge volume of disparate data – on gene expression, protein interactions and the vast output of literature both inside and outside laboratories – to find out what is important," says Dr Michael Schroeder, Professor of the Bioinformatics group at Dresden Technical University and coordinator of this IST-funded project.

"I attended a workshop recently, held by the W3 consortium, and many of the companies there said that this was the biggest problem they face."

Currently, pharmaceutical and biotech companies produce vast quantities of raw data on the problems that interests them. Microarrays process thousands of samples to discover what genes are over expressing.

These over-expressing genes – numbering sometimes in their thousands, too – create proteins. The researchers then need to discover what protein interactions are taking place among all the different proteins created by the over-expressing genes. This is not trivial.

If a researcher can identify protein interactions they then need to do a search on their company intranet to see what other work company labs have produced relevant to the topic. Since the data output at biotech labs is vast, this is also not trivial.

Finally, the researcher must perform a search of academic journals to find relevant journal papers. Currently PubMed, the most important public literature database available, has 15,000,000 entries, and the number is growing every day. Finding relevant data there is again not a trivial task.

Dr Schroeder gives an example. "The medical faculty here were studying pancreatic tumours. They found 1,000 genes over expressing. Using our software they were able to find, among others, three protein interactions that were particularly relevant. Using our literature search ontology they were able to discover that two of these interactions were novel. They are now going to study these novel interactions more closely," he says.

BioGrid explained
This is how the project will help companies integrate all the data they need to make relevant discoveries using a BioGrid. A BioGrid is essentially a data and computational Grid created through a suite of tools developed by the project.

Biotech and life science companies will use these tools internally to integrate data. They can then also use the same tools to create ad-hoc distributed processing networks to crank through computationally-intense problems, if needed.

Here's how it works. One element of the software suite analyses over-expressing genes discovered during micro assays to establish what proteins become encoded. This uses standard techniques.

A second analysis tool in the suite predicts what possible protein-protein interactions are taking place. This is novel. When a gene encodes a protein, the protein folds up into a unique shape, forming a 3D structure. This structure can only interact, or fit, with some proteins, but not others, like pieces of a jigsaw puzzle.

BioGrid's protein interaction software includes a database of the 20,000 known protein structures and uses that database to identify which ones could potentially interact, among the thousands of proteins created by the over-expressing genes. This focuses research efforts on the most interesting candidates for a particular problem, like, for example, pancreatic tumours.

A smart way to search
Once interesting potential protein interactions are known, BioGrid's ontology-based search technology can mine company or journal data for any relevant information. The Gene Ontology (GO) was established by the life sciences as a vocabulary to describe all the different genetic processes. There are 20,000 terms in this vocabulary so far.

"The problem is researchers never use exact terms from the Gene Ontology to do their search of the literature. Our innovation was to create an algorithm that intelligently matches the terms used by researchers to the Gene Ontology," says Dr Schroeder.

This smart search produces a list of all the possible GO terms and lists all the articles relevant to that specific term. Researchers are presented with a vast quantity of information broken down by sub-topic, so they can quickly drill down to the most relevant information.

Linking all these software tools together is a rules-based Java scripting language called Prova, also developed by the BioGrid team. It is the glue the sticks the Gene Expression, Protein Interaction and ontology-based literature analysis together into an integrated, cohesive unit. "It's an open source language, available at www.prova.ws, and about 20 groups are using it around the world right now. We made it open source because you need to develop a community to keep a programming language alive," says Dr Schroeder.

So biotech companies can establish their own BioGrids on their intranet, some of the researchers involved in BioGrid created the spin-off company Molgenis supported by a BioPartner first stage Grant. In addition other partners are in the process of obtaining seed funding for a spin-off company called Transinsight.

"BioGrid is a knowledge Grid that offers the proper software and support so that companies can do rapid prototyping for their data-integration and large computations," says Dr Schroeder. "Transinsight will continue working on new, innovative and novel solutions to data problems faced by the life sciences industry," hopes Dr Schroeder. "The company will look at natural language processing and the semantic Grid for the life sciences. It is currently running two feasibility studies with two international customers. Molgenis provides software to support the storage and analysis of gene expression data."

Many of the tools developed by BioGrid are available for public use. The ontology-based search is available at GoPubMed.org, while the protein interaction database is at Scoppi.org. But GoPubMed is limited to 100 searches while Scoppi does not include the predictive analysis tool. These limitations will not be there for commercial clients. For those, BioGrid offers an earthmover for data mining.

Contact:
Michael Schroeder
Professor in Bioinformatics
Biotec/Dept. of Computing, TU Dresden
Tatzberg 47-51
D-01307 Dresden
Germany
Tel: +49-351-46340060
Fax: +49-351-46340061
Email: This email address is being protected from spambots. You need JavaScript enabled to view it.

Source: IST Results Portal

Most Popular Now

AI Points the Way to Better Doctor-Patie…

A computer analysis of hundreds of thousands of secure email messages between doctors and patients found that most doctors use language that is too complex for their patients to understand...

Open Call DIGITAL-2021-DEPLOY-01-TWINS-H…

The development of digital twins in healthcare (DTH) has progressed substantially, profiting from advances in science and technology. In order to exploit their benefits in view of better prevention approaches...

Mayo Clinic Researchers Use AI, Biomarke…

Treatment options for rheumatoid arthritis have often relied on trial and error. Now Mayo Clinic researchers are exploring the use of artificial intelligence (AI) and pharmacogenomics to predict how patients...

Mjog by Livi Launches Remote Monitoring …

Mjog by Livi has launched a remote monitoring tool that will help GPs support and monitor people with depression through messages sent to their smartphones. The latest data from the Office...

Could EKGs Help Doctors use AI to Detect…

Pulmonary embolisms are dangerous, lung-clogging blot clots. In a pilot study, scientists at the Icahn School of Medicine at Mount Sinai showed for the first time that artificial intelligence (AI)...

Computer Model of Blood Enzyme

Membrane-associated proteins play a vital role in a variety of cellular processes, yet little is known about the membrane-association mechanism. Lipoprotein-associated phospholipase A2 (Lp-PLA2) is one such protein with an...

4.5 Million Euros in EU Funding for Saar…

This year, three computer scientists from Saarbrücken were awarded an "ERC Starting Grant" by the European Research Council. This award, endowed with 1.5 million euros each, is among the most...

2022 EU4Health Work Programme Adopted to…

Today the Commission has adopted the second EU4Health work programme. In 2022, the EU4Health will continue to invest in building stronger, more resilient health systems and pave the way for...

Five NHS Trusts in Surrey and Sussex to …

A consortium of NHS trusts that covers a population of circa 1.2 million will gain immediate access to important patient imaging, and will mobilise a regional workforce for patients, following...

Helping Cancer Patients Avoid Excessive …

A Case Western Reserve University-led team of scientists has used Artificial Intelligence (AI) to identify which patients with certain head and neck cancers would benefit from reducing the intensity of...

The Programme for the Union's Action in …

On 24 March 2021, Regulation (EU) 2021/522 of the European Parliament and of the Council1 was adopted as part of the Multiannual Financial Framework for the 2021-2027 period. That Regulation...

CliniSys Acquires HORIZON Lab Systems an…

CliniSys is announcing the recent acquisition of HORIZON Lab Systems and the combination with Sunquest Information Systems, as CliniSys. This acquisition and Sunquest combination creates one of the world’s...