BIG4 - BLOG POST: OpenBiodiv: The Semantic Web comes to biodiversity informatics

This post was written by Viktor Senderov.

The Semantic Web is a vision for the future of the web where not only documents but also data are connected. As part of the Semantic Web, the goal of the Pensoft-based OpenBiodiv project is to create a system for management of biodiversity knowledge extracted from scientific articles and shared as Linked Open Data. The system consists of our ontology OpenBiodiv-O, a knowledge graph, the RDF4R package and related source-code, and a front-end user interface available at http://openbiodiv.net.

First of all, it is useful to understand the term knowledge management system both by looking at some explicit definitions and by looking at several examples in practice. Variations of the term were already widely discussed by the 1980's and early nineties and were understood to mean the utilization of ideas from both database management systems (DBMS) and artificial intelligence (AI) to create a type of computer system called knowledge base management system (KBMS). Knowledge-bases differ from traditional databases in the fact that they contain "prestored rules and facts from which useful inferences and conclusions may be drawn by an inference engine." In other words, using a knowledge-base, it is not only possible to retrieve and filter information stored in a database, but it is also possible to obtain facts that have been inferred (logically deduced) from what is already stated in the knowledge-base. For example, using FactForge one can find all airports in a given radius around London. More examples of modern knowledge-bases include Freebase, which recently migrated to WikiData, DBPedia, and what most people have seen without realizing the mechanism behind it: the infoboxes that appear next to Google searches.

There has also been progress in incorporating statistical techniques into databases and representing knowledge under uncertainty. In such probabilistic databases, new knowledge is not logically deduced but is rather inferred under uncertainty with techniques from Bayesian statistics. However, the architecture chosen for the OpenBiodiv project is one of a classical rule-based knowledge-base.

Modern knowledge-based systems also incorporate Linked Open Data (LOD) principles to emphasize the community aspects of knowledge sharing and to make data more interconnected and reusable. Tim Berners-Lee, the creator of the World Wide Web writes that similar to how the hypertext web functions, "with linked data, when you have some of it, you can find other, related, data."

OpenBiodiv’s store rises up to the challenge of LOD by interlinking data from three sources: more than 5000 biodiversity-related articles published in Pensoft’s journals, data extracted from legacy literature provided by from Plazi, and the Global Biodiversity Information Facility (GBIF). In order to illustrate the capabilities of OpenBiodiv and draw attention to the impact of the tragically lost collection of the Museu Nacional de Rio de Janeiro (MNRJ), I can ask our system to give me the number of times a specimen from that collection was used in a taxonomic article, and in which ones. It turns out that MNRJ has been mentioned 195 times in our system in a total of 22 articles published by Pensoft. Perhaps more interestingly, we can see specimens of which taxa may have been lost. Examples include the insects (Xestoblatta, Charinus, Lamproclasiopa, etc.), nematode worms (Paracamallanus, Cucullanus, Pseudascarophis, etc.), birds (Ichthyouris), fish (Sphoeroides), and many others for a total of 6,127 distinct names mentioned in taxonomic articles whose materials methods include MNRJ. I believe a similar analysis may prove helpful in the effort to rebuild MNRJ’s collection by highlighting what may have been lost and giving access to the primary literature describing the lost species. Imagine what detailed inventory we could have build if OpenBiodiv contained knowledge extracted from all taxonomic literature!

It is hard for me in such a short format to describe all of the results of OpenBiodiv. Perhaps the reader is well advised to look directly at the text of my dissertation, which I develop in open access, for a full treatment and at the system itself at http://openbiodiv.net. Nevertheless, I would like to highlight the tool-development work that I have been doing as part of the OpenBiodiv. The knowledge graph is stored as a GraphDB triple store, whereas the input data comes in the form of XML files published by Pensoft and Plazi and a huge CSV file in the case of GBIF’s taxonomic backbone. In order to convert XML to RDF I wrote an R library called RDF4R, whose usage is described in a vignette. During the development process I learnt a lot about free and open source software development and how frustrating and fun it can be at the same time. Perhaps the fun part is best illustrated by a feature of the library called query_factory. It is a functional-programming inspired widget that "magically" transforms a given input SPARQL query s and a given SPARQL access point a into an R function f. f’s invokation executes the query s parameterized by f’s arguments on the access point a. However, neither the query nor the access point are arguments of f itself! The "magic" becomes apparent when one understands functions as closures, i.e. an execution sequence and an execution environment—the query and access point are enclosed in f from its parent query_factory.

I think it must have become clear so far that thanks to the programming I did as a part of BIG4 I became even more of a geek that I had been. It cemented my desire to look for a career combining science and programming in the future. This blog post came so late (amongst other reasons) as I just submitted a Marie Skłodowska Curie individual fellowship to do a postdoc in probabilistic programming for phylogenetics at a BIG4 project partner: the Ronquist lab at the Swedish Natural History Museum in Stockholm. BIG4 is responsible for not only this great connection but also many other professional and personal friendships that I made. Certainly, I am indebted for life to the friendly and professional people at Pensoft, and in particular to my advisor Lyubomir Penev, as well as Kiril Simov, and hope to maintain my connection to the project who will from now on be lead by Pensoft’s new Ph.D. student, Maria Dimitrova.

Publications within Viktor's project:

Senderov, Viktor et al. 2018. "OpenBiodiv-O: Ontology of the OpenBiodiv Knowledge Management System." Journal of Biomedical Semantics 9(1). https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-017-0174-5 (February 13, 2018).
Senderov, Viktor, and Lyubomir Penev. 2016. "The Open Biodiversity Knowledge Management System in Scholarly Publishing." Research Ideas and Outcomes 2: e7757. http://rio.pensoft.net/articles.php?id=7757 (July 22, 2017).
Cardoso, Pedro et al. 2016. "Species Conservation Profiles Compliant with the IUCN Red List of Threatened Species." Biodiversity Data Journal 4: e10356. http://bdj.pensoft.net/articles.php?id=10356 (March 15, 2018).
Senderov, Viktor, Teodor Georgiev, and Lyubomir Penev. 2016. "Online Direct Import of Specimen Records into Manuscripts and Automatic Creation of Data Papers from Biological Databases." Research Ideas and Outcomes2: e10617. http://rio.pensoft.net/articles.php?id=10617 (March 15, 2018).