VIVO ETL using open source tools
- Keywords: etl, data loading, data mapping, json2rdf, robot, SPARQL
- TL;DR: An open source pipeline for loading data to VIVO from APIs, CSVs, or JSON is available
- Abstract: Loading data to VIVO requires the creation of triples using the VIVO ontologies. Data may come from a variety of sources and in a variety of formats. vivo-etl (https://github.com/mconlon17/vivo-etl) is a simple open source command-line pipeline using available open source tools for extracting data from a source, transforming it to VIVO triples, and loading the triples to a VIVO TDB data store. The method extracts data from an API using wget, transforms CSV or JSON data to "raw" RDF and then transforms the "raw" RDF to VIVO RDF using a SPARQL CONSTRUCT query executed from the command line using robot, an open source tool (http://robot.obolibrary.org/). VIVO triples can then be loaded using tdbloader. The method can be used to transform data from any source (CERFIF, PubMed, Dimensions, local repositories) to the current VIVO ontologies, or to ontologies under development by the VIVO Ontology Interest Group. A demonstration gathering data from ROR (Research Organization Registry) and providing the data as VIVO triples is included in the presentation.