Linked Data and Translational Research

The World Wide Web Consortium (W3C) has been building and organizing technologies to construct a "Semantic Web" for the past few years. The Semantic Web is envisioned to be for data what the Web is for documents. Just what that might or could mean has been a focus of as much work as has the implementation per se.

The main focus has been to construct tools that process information that is more structured than is most information found in HTML pages on the Web. Structure here usually means that the contents of Web files will have a common syntax and a common "interpretation" that can be interpreted by clients accessing those files. In addition, structured data content will continue to link data sources together using common Web mechanisms (namely accessible URLs).

These characteristics enable a number of capabilities. First, programs may interrogate and navigate these files and process the information contained to perform user-directed operations. Second, users will be able to navigate these files as if they were Web pages. They will, of course, not be very reader friendly, since they are designed for ease of automated processing, rather than reading. However, programs will also be able to navigate these files and will find them quite "program-friendly."

This technology has been taken up by many several information providers, though particularly striking applications have involved the construction of large data warehouses combining multiple data resources, rather than in the construction of separate, distributed data resources. For example, the Bio2RDF project has gathered around 40 bioinformatics resources together in a single massive collection including over 2 billion pieces of information (triples).

The DBpedia project has similarly warehoused a large number of data points taken from the WikiPedia. Auer writes in 2007:

"The DBpedia dataset currently provides information about more than 1.95 million "things", including at least 80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It contains 657,000 links to images, 1,600,000 links to relevant external web pages, 180,000 external links into other RDF datasets, 207,000 Wikipedia categories and 75,000 YAGO categories [16]."

SparQL is the main query language used to access and manipulate data stored within the Linked Data Web, and SparQL endpoints publish Web forms that allow users to enter SparQL queries, and can usually accept queries embedded in URLs or within Web Services.

The Virtuoso Universal Server supports a SparQL endpoint with a good collection of interfaces and return formats, and is probably the most-used implmentation of the SparQL endpoint protocol.

There are several areas of ongoing research associated with the Linked Data Web:

A number of bioinformatics research projects have used semantic technologies as PART OF THEIR EXPERIMENTAL ANALYSIS, as opposed to simple data storage and retrieval. Some papers are mentioned below.

The following document provides a context for work with Linked Data in the area of translational research:

Alan Ruttenberg, et al., "Advancing translational research with the Semantic Web", BMC Bioinformatics 2007, 8(Suppl 3):S2 doi:10.1186/1471-2105-8-S3-S2.

In this paper the authors

present a scenario that shows the value of the information environment the Semantic Web can support for aiding neuroscience researchers. We then report on several projects by members of the HCLSIG, in the process illustrating the range of Semantic Web technologies that have applications in areas of biomedicine.

They conclude:

Current [Semantic Web] tools and standards are already adequate to implement components of the bench-to-bedside vision. On the other hand, these technologies are young.

Satoo, et al. provide an excellent example application motivating the use of semantic approaches in:

Satoo, et al., "An experiment in integrating large biomedical knowledge resources with RDF: Application to associating genotype and phenotype information", Proceedings of the workshop on Health Care and Life Sciences Data Integration for the Semantic Web (8--12 May 2007).

They write:

"Semantic Web technologies provide a valid framework for information integration in the life sciences. Ontology-driven integration represents a flexible, sustainable and extensible solution to the integration of large volumes of information. Additional resources, which enable the creation of mappings between information sources, are required to compensate for heterogeneity across namespaces.
They based this claim on a proof of concept project in which they used:
an ontology-driven approach to integrate two gene resources (Entrez Gene and HomoloGene) and three pathway resources (KEGG, Reactome and BioCyc), for five organisms, including humans. We created the Entrez Knowledge Model (EKoM), an information model in OWL for the gene resources, and integrated it with the extant BioPAX ontology designed for pathway resources.

Another paper demonstrating possible applications for the semantic approach is:

Jentzsch, Anja, "Enabling Tailored Therapeutics with Linked Data," Linked Data on the Web, 2009.

Jentzsch, et al. examine: "the applicability and potential benefits of using Linked Data to connect drug and clinical trials related data sources . . ." and they present a use-case "that demonstrates the immediate benefit of this work in enabling data to be browsed from disease, to clinical trials, drugs, targets and companies." and conclude:

The Linked Data approach is very promising for the pharmaceutical industry, and its value will increase as more data sources become available.

There are many reasons this approach will be useful, but 2 stand out from the Satoo paper. Quoting from that paper:

"Bridging between genotype and phenotype is generally achieved through the integration of knowledge sources such as Entrez Gene (EG), Online Mendelian Inheritance in Man (OMIM) and the Gene Ontology (GO)."
and
"The interpretation of experimental data generally requires physicians and biologists to compare their clinical and biological data to already existing data sets and to reference knowledge bases."

Now to be more specific, in their proof-of-concept, they

"were primarily interested in demonstrating how one particular hypothesis, i.e., the existence of an association between glycosyltransferase and congenital muscular dystrophy, could be refined through the existence of paths in the RDF graph."
So perhaps we can say "Semantic web approaches will help researchers refine hypotheses", or if we want to be really precise we could say "Semantic web approaches will help researchers demonstrate associations between glycolsyltranferase and congenital muscular dystrophy."

Alternatively, "researchers could create SPARQL queries to identify all classes of enzymes involved with a given disease, or with an arbitrary list of diseases, thus generating hypotheses, not only refining ad hoc hypotheses."

The Ruttenberg, et al. attempted to demonstrate the value of Semantic Web to Neuroscience researchers, claiming that

in general, the Semantic approach will facilitate the "interdisciplinary knowledge transfer needed to improve the bench-to-bedside process," and improving the B2B process will improve the quality of life of Hoosiers, et al. by getting newer medical treatments into use sooner than they would otherwise, and by feeding back results from such treatments faster than they would otherwise be feedback into the process of clinical experimentation and discovery.

To be more specific, they state that in order to improve this process:

"Queries need to be made across experimental data regardless of the community in which it originates. Making cross-disease connections and combining knowledge from the molecular to the clinical level has to be practical in order to enable cross-disciplinary projects. Both well-structured standardized representation of data as well as linking and discovery of convergent and divergent interpretations of it must be supported in order to support activities of scientists and clinicians. Finally, the elements of this information environment should be linked to both the current and evolving scientific publication process and culture."

As it happens, the Semantic web excels at this kinds of interlinking, and thus will therefore help researchers make cross-disease connections, combine knowledge from different levels of abstraction and from different communities, and link these kinds of information to science publication process.

They focus on Alzheimer's Disease as an example, pointing out that:

"Increasingly, researchers recognize that Ad, PD, and HD share various features at the clinical [13], neural [14-17], cellular [18-20], and molecular levels [21,22]. Nonetheless, it is still common for biologists in different subspecialties to be unaware of the key literature in one other's domain."

Additional Information

For more information about Bio2RDF, one main web site is bio2rdf.org, but MOST info about the collection is on the Wiki:

http://bio2rdf.wiki.sourceforge.net/

Here is a link to a list of databases included in Bio2RDF:

http://www.freebase.com/view/user/bio2rdf/public/sparql

and here a list to the actual download files from which to choose:

http://quebec.bio2rdf.org/download/n3/

The best SparQL endpoint provided by the Bio2RDF group is at

http://quebec.bio2rdf.org:8890/sparql

which can be queried with these example queries listed at:

http://bio2rdf.wiki.sourceforge.net/Demo+queries

Bio2RDF is introduced by Francois Belleau, et al. (out of Quebec's Laval University) in this paper:

Belleau, F., et al., "Bio2RDF: Towards a mashup to build bioinformatics knowledge systems", 2008.

and here is a paper the same group did for a workshop:

http://www2007.org/workshops/paper_143.pdf

and a useful introductory PPT slide set:

http://carbon.videolectures.net/2008/active/iswc08_karlsruhe/prudhommeaux_swhcls/iswc08_prudhommeaux_swhcls_05.ppt#690,2,Overview

Modifications made to Virtuoso to accomodate semantic content are described in

Erling, Orri, and Ivan Mikhailov, "RDF Support in the Virtuoso DBMS"

http://www.csee.umbc.edu/691s/papers/rdfdb1.pdf

Clients

Note that if you use SparQL queries interactively in Firefox after having installed the Tabulator plug-in, and set the query web form return format to "Auto", Tabulator will let you select a variety of display formats, and actually browse the resulting documents and documents to which they link, almost . . .

. . . as if the Web were a group of interconnected RDF documents. . .

. . . a Gigantic Global Graph, one might say.

Another very interesting Linked Data client is the Explorator, which is much like a data workbench able to download multiple datasets from SparQL endpoints, and manipulate them in combination on the desktop.

This approach is useful for dealing with extracts of manageable size from otherwise humongous data stores, and also provides a more user-friendly interface using a set algebra model of data manipulation:

Araujo S.; Schwabe D., Barbosa S. - Experimenting with Explorator: a Direct Manipulation Generic RDF Browser and Querying Tool. Visual Interfaces to the Social and the Semantic Web (VISSW 2009), Sanibel Island, Florida - February 2009

Since many DBpedia users will not be interested in composing SparQL queries, some work from Leipzig University in

Auer, Soren and Jens Lehmann, "What do Innsbruck and Leipzig have in common? Extracting Semantics from Wiki Content, European Semantic Web Conference (ESWC), 2007.

seems very interesting because it has resulted in a pattern matching interface shown in that paper as follows:

An interface like this could well be used as a search interface for many web sites.

(Note: That paper is also interesting for the implications of its title. To wit: it's relatively easy to build a semantic query that will produce a list all the predicates for which two entities have matching objects. That is you can determine to a first approximation what those entities have "in common" with respect to the network being queried. This would be much more difficult in a relational setting.)

Even more information

And finally, here are another couple of links about using Linked data from the Bioinformatics research literature:

Conclusion

In general, Semantic/Linked Data Web appears to lie at the boundary between research and practice. The technology is stable enough to be deployed at a scale, but not well tested by its various target audiences. It has also not reached the maturity that comes from understanding its power and functionality relative to other such technologies.

For example, the power of both the data representation model and the SQL query language is well understood with respect to other database management schemes and other formal languages possible within the relational approach. This understanding accrued over seveal decades and has yet to emerge with respect to the Semantic model.