Karma: A System for Mapping Structured Sources into the Semantic Web

Gupta, Shubham; Szekely, Pedro; Knoblock, Craig A.; Goel, Aman; Taheriyan, Mohsen; Muslea, Maria

doi:10.1007/978-3-662-46641-4_40

Shubham Gupta²⁰,
Pedro Szekely²⁰,
Craig A. Knoblock²⁰,
Aman Goel²⁰,
Mohsen Taheriyan²⁰ &
…
Maria Muslea²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7540))

Included in the following conference series:

Extended Semantic Web Conference

2300 Accesses
17 Citations

Abstract

The Linked Data cloud contains large amounts of RDF data generated from databases.

This research is based upon work supported in part by the Intelligence Advanced Research Projects Activity (IARPA) via Air Force Research Laboratory (AFRL) contract number FA8650-10-C-7058. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, AFRL, or the U.S. Government.

You have full access to this open access chapter, Download conference paper PDF

YAGO 4: A Reason-able Knowledge Base

Linked Data Creation with ExcelRDF

KartoGraphI: Drawing a Map of Linked Data

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The Linked Data cloud contains large amounts of RDF data generated from databases. Much of this RDF data, generated using tools such as D2R, is expressed in terms of vocabularies automatically derived from the schema of the original database. The generated RDF would be significantly more useful if it were expressed in terms of commonly used vocabularies. Using today’s tools, it is labor-intensive to do this. For example, one can first use D2R to automatically generate RDF from a database and then use R2R to translate the automatically generated RDF into RDF expressed in a new vocabulary. The problem is that defining the R2R mappings is difficult and labor intensive because one needs to write the mapping rules in terms of SPARQL graph patterns.

In this work, we present a semi-automatic approach for building mappings that translate data in structured sources to RDF expressed in terms of a vocabulary of the user’s choice. Our system, Karma, automatically derives these mappings, and provides an easy to use interface that enables users to control the automated process to guide the system to produce the desired mappings. In our evaluation, users need to interact with the system less than once per column (on average) in order to construct the desired mapping rules. The system then uses these mapping rules to generate semantically rich RDF for the data sources.

We demonstrate Karma using a bioinformatics example and contrast it with other approaches used in that community. Bio2RDF [7] and Semantic MediaWiki Linked Data Extension (SMW-LDE) [2] are examples of efforts that integrate bioinformatics datasets by mapping them to a common vocabulary. We applied our approach to a scenario used in the SMW-LDE that integrate ABA, Uniprot, KEGG Pathway, PharmGKB and Linking Open Drug Data datasets using a common vocabulary. In this demonstration, we first show how a user can interactively map these datasets to the SMW-LDE vocabulary, and then we use these mappings to generate RDF for these sources.

2 Application: Karma

Karma^{Footnote 1} is a web application that enables users to perform data-integration tasks by example [8]. Karma provides support for extracting data from a variety of sources (relational databases, CSV files, JSON, and XML), for cleaning and normalizing data, for modeling it according to a vocabulary of the user’s choice, for integrating multiple data sources, and for publishing in a variety of formats (CSV, KML, and RDF). In this demonstration we focus on the capabilities to interactively model sources according to a chosen vocabulary and to publish data in RDF.

The modeling process takes as input a vocabulary defined in an OWL ontology, one or more data sources to be modeled, and a database of semantic types learned in previous modeling sessions. It outputs a formal mapping between the source and the ontology that can be then used to generate RDF. The key technologies that this process exploits are the learning of semantic types using conditional random fields (CRF) [6] and a Steiner tree algorithm to compute the relationships among the schema elements of a source.

Semantic types characterize the meaning of data. For example, consider a dataset with a column containing PharmGKB accession identifiers for pathways. The syntactic type of the values is String. In our formulation, we represent their semantic type as a pair consisting of the class Pathway and the property pharmGKBId to capture the idea that these values are a particular type of pathway identifier. In RDF terms, the values are the objects of triples whose subject is of type Pathway and whose property is pharmGKBId. Karma infers semantic types automatically using the semantic types it has been trained to recognize. When Karma is unable to infer the semantic type for a column, users can interactively assign the desired type; Karma uses the assigned type and the data in the column to train a CRF model to recognize the type in the future [4]. The semantic types are used by our Steiner tree algorithm to compute the source model as the minimum tree that connects the assigned semantic types via properties in the ontology (the details of the approach are published elsewhere [5]). Because the minimum model is not always the desired model, Karma provides a user interface to enable users to force this algorithm to include specific properties in the model.

Most of the existing mapping generation tools, such as Clio [3], Altova MapForce (altova.com), or NEON’s ODEMapster [1], rely on the user to manually specify the mappings in a graphical interface. In contrast, Karma provides a semi-automatic approach to achieve the same objective, enabling domain experts (and not just DB administrators or ontology engineers) to specify the mappings.

3 Demonstration

In this demonstration, we first show how users model structured sources according to an ontology they select; then we show how Karma can use the model to generate RDF represented using the classes and properties defined in the ontology. We will illustrate the process using a bioinformatics example.

In the first part of the demonstration we provide an overview of the Karma workspace (Fig. 1) and show how to import data into Karma.

In the second part we show the model that Karma automatically infers for a source. Karma builds the initial model using the existing database of semantic types and visualizes it as hierarchical headings over the worksheet data. The inferred semantic types are shown in the grey boxes nested inside the dark blue boxes that show the column names.

In the third part we show how users can adjust the automatically generated model. We show how users can fix incorrectly assigned semantic types, and how users can adjust the model when Karma infers incorrect relationships between columns.

In our example shown in Fig. 1, when the user loads the source, Karma incorrectly assigns the semantic type Gene.name to the DRUG_NAME column. To correct the problem, users click on the semantic type to bring up the semantic type specification dialog (Fig. 2). The dialog shows the top options computed by the CRF model. When the correct option is in the list, users can select it with a single click. Otherwise, users specify the class and property by typing it (with completion) or by selecting the appropriate class or property from an ontology browser. In our example, the correct semantic type Drug.name is the fourth option. After each adjustment to the semantic types, Karma retrains the CRF model and invokes the Steiner tree algorithm to recalculate the set of properties that tie together the semantic types. Figure 3 shows the updated model incorporating the user changes.

The model proposed by Karma in Fig. 3 is not correct because it specifies that the Gene columns contain information about genes that cause the disease described in the Disease columns (it models the relationship using the isCausedBy property). The correct model is that the genes are involved in the pathways that are disrupted by the disease. Users can specify the correct properties by clicking on the pencil icons.

Figure 4 shows the pop-up that appears by clicking on the pencil icon on the isCausedBy Gene cell. The pop-up shows domain/ property pairs that satisfy two conditions. First, the class domain is a valid domain for the property and second, the class the user clicked (Gene in our example) is a valid range for the property. In our example, the correct choice is the first one because the information in the table is about Pathways that involve our Gene. After users make a selection, Karma recomputes the Steiner tree, which is now required to include the class/property selections users make [5]). Figure 5 shows the correct, updated model.

In the last part of the demonstration we show the RDF generation process. Once users are satisfied with the source model, they can generate and download the RDF for the whole source or view the RDF generated for a single cell (Fig. 5). A movie of the whole user-interaction process is available online^{Footnote 2}.

Notes

References

Barrasa-Rodriguez, J., Gómez-Pérez, A.: Upgrading relational legacy data to the semantic web. In: Proceedings of WWW Conference, pp. 1069–1070 (2006)
Google Scholar
Becker, C., Bizer, C., Erdmann, M., Greaves, M.: Extending smw+ with a linked data integration framework. In: Proceedings of ISWC (2010)
Google Scholar
Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Mylopoulos Festschrift. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009)
Chapter Google Scholar
Goel, A., Knoblock, C.A., Lerman, K.: Using conditional random fields to exploit token structure and labels for accurate semantic annotation. In: Proceedings of AAAI-11 (2011)
Google Scholar
Knoblock, C.A., et al.: Semi-automatically mapping structured sources into the semantic web. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 375–390. Springer, Heidelberg (2012)
Chapter Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Peter, A.: Model and prototype for querying multiple linked scientific datasets. Future Gener. Comput. Syst. 27(3), 329–333 (2011). http://www.sciencedirect.com/science/article/pii/S0167739X10001706
Article Google Scholar
Tuchinda, R., Knoblock, C.A., Szekely, P.: Building mashups by demonstration. ACM Trans. Web (TWEB) 5(3), 1–50 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Information Sciences Institute and Department of Computer Science, University of Southern California, Los Angeles, USA
Shubham Gupta, Pedro Szekely, Craig A. Knoblock, Aman Goel, Mohsen Taheriyan & Maria Muslea

Authors

Shubham Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Szekely
View author publications
You can also search for this author in PubMed Google Scholar
Craig A. Knoblock
View author publications
You can also search for this author in PubMed Google Scholar
Aman Goel
View author publications
You can also search for this author in PubMed Google Scholar
Mohsen Taheriyan
View author publications
You can also search for this author in PubMed Google Scholar
Maria Muslea
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Craig A. Knoblock .

Editor information

Editors and Affiliations

University of Southampton, Southampton, United Kingdom
Elena Simperl
British Museum, London, United Kingdom
Barry Norton
Ljubljana, Slovenia
Dunja Mladenic
DEIB - Politecnico di Milano, Milano, Italy
Emanuele Della Valle
Foundation for Research and Technology Hellas (FORTH), Heraklion, Greece
Irini Fundulaki
MDG Web Limited, Dublin, Ireland
Alexandre Passant
Multimedia Communications Department, EURECOM, Campus SophiaTech, Biot, France
Raphaël Troncy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gupta, S., Szekely, P., Knoblock, C.A., Goel, A., Taheriyan, M., Muslea, M. (2015). Karma: A System for Mapping Structured Sources into the Semantic Web. In: Simperl, E., et al. The Semantic Web: ESWC 2012 Satellite Events. ESWC 2012. Lecture Notes in Computer Science(), vol 7540. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46641-4_40

Download citation

DOI: https://doi.org/10.1007/978-3-662-46641-4_40
Published: 21 April 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46640-7
Online ISBN: 978-3-662-46641-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics