Building massive knowledge graphs using automated ETL pipelines

FAIR Data

Wolfgang Schell, Pauline Leoncio

·

·

Reading time: 10-12 minutes

Building massive knowledge graphs with automated ETL pipelines

Building massive knowledge graphs using automated ETL piplinesIn this blog post, we’ll explore how to build a massive knowledge graph from existing information or external sources in a repeatable and scalable manner. We’ll go through the process step-by-step, and discuss how the Graph-Massivizer project supports the development of multiple large knowledge graphs and the considerations you need to take when creating your own graph. Keep reading!

 

Building massive knowledge graphs using automated ETL pipelines

Is it possible to build a large-scale knowledge graph that can represent vast datasets with millions of diverse entities, data from multiple sources and express complex, intricate relations? 

 

A knowledge graph is a flexible semantic technology that can serve as the foundation for a wide variety of purposes and use cases, such as fast-tracking drug discovery and reducing research costs, smart manufacturing solutions to support human manufacturing planners, and global fraud detection and risk management. It can also unlock AI initiatives by enriching existing black-box solutions with machine-interpretable semantics and adding a layer of trust and transparency. While knowledge graphs are extremely versatile, there are a few considerations to keep in mind before you can start building your own, especially one of a grand scale. 

 

In this blog post, we’ll explore how to build a massive knowledge graph from existing information or external sources in a repeatable and scalable manner. We’ll go through the process step-by-step, and discuss how the Graph-Massivizer project supports the development of multiple large knowledge graphs and the considerations you need to take when creating your own graph. Keep reading! 

 

Table of contents

 

 

What is a knowledge graph?

In a previous metaphacts blog post, we delved into the characteristics and importance of the semantic knowledge graph. We defined knowledge graphs as: 

 

Large networks of entities representing real-world objects, like people and organizations, and abstract concepts, like professions and topics, and their semantic relations and attributes. While there are many other varying definitions that exist, our definition of the knowledge graph places emphasis on defining the semantic relations between entities, which is central to providing humans and machines with context and means for automated reasoning. 

 

Knowledge graphs can help organizations centralize, organize and understand internal data, often stored away in disparate sources. It can be created from existing data sources such as databases, documents and structured data (JSON, XML, CSV, etc.). Depending on the volume of your dataset(s), a knowledge graph varies in size, ranging from a simple knowledge graph to one with an extensive repository with millions of entities and interlinked relations.

 

Considerations before creating a knowledge graph

Before a knowledge graph can be deployed and implemented for a specific use case, it first needs to be built, and several factors need to be considered when creating one. 

 

1. Source data

A knowledge graph can be constructed using data from various sources, enabling the linking and consolidation of diverse data in heterogeneous formats into a centralized place. However, there are still a few aspects to keep in mind, including the various types, formats and partitioning of source data. 

 

Structured vs. Unstructured data

 

While a knowledge graph may be created from manually created data (e.g., using form-based authoring), in most cases data is imported into the knowledge graph from existing sources.

 

Data sources can be divided into two main types: structured data and unstructured data.

 

Structured data includes machine-readable files and formats like JSON, XML, CSV or data obtained from (relational) databases that follow a well-defined syntax and may optionally also have schema information (e.g., relational databases). This kind of data is easy to read by scripts or code and is generally accessible using ready-made libraries and parsers.

 

Unstructured data includes documents such as emails, PDF files, images and scanned paper documents, which generally do not make the contained information available in a machine-readable format. Reading, parsing, understanding and converting unstructured data is a difficult task and is often approached using machine learning (ML) techniques like Optical Character Recognition (OCR), Name Entity Recognition (NER), text classification, or Entity Linking (EL) using services or libraries like SpaCy.

Parsing and converting unstructured to RDF is worth a separate blog post, so below, we’ll concentrate on graph creation from structured data.

 

2. Graph data model

While structured source data may be stored in several different tabular or hierarchical formats, the target data model, a graph, is represented using the Resource Description Framework (RDF), a semantic web standard for data interchange on the web. By using RDF, a graph can be expressed as a set of statements (or triples), each of which describes a single fact, i.e., an attribute of a resource (or edge or node) or a relation between two resources. It allows for easy merging, linking and sharing of structured and semi-structured data across various systems and applications. 

 

While other graph models exist, such as Labeled Property Graphs (LPG), we are only reviewing RDF-based graphs. Using RDF-star, any graph can be expressed. Thereforre, RDF-star can also be used as a bridge to and from Labeled Property Graphs. RDF-star is an extension of RDF and also supports expressing statements on statements, which allows you to model edges with attributes.

 

3. Data partitioning

Any reasonable real-world dataset is often too large to treat as a single unit, so partitioning comes into play to help divide data into workable chunks. Data can be partitioned by several different criteria, including:

 

  • source/origin of data

  • application domain, i.e., entity types

  • licensing

  • visibility/security

  • size

 

The criteria used depend on the overall size and use case.

 

Partitioning source data

 

When converting big datasets to RDF, handling the whole dataset in a single operation won't work. Breaking data into manageable pieces, e.g., breaking down big datasets into individual files, helps to perform RDF conversion and ingestion in a scalable manner.

 

When integrating data from a database, the data should be exported into a set of files with an appropriate amount of data per conversion unit with a maximum size. This moves paging of data and partitioning outside of the conversion process, as that highly depends on the structure and amount of data as well as additional considerations like (de-)normalization, number and size of files to split into, handling the full dataset or just an increment, for example.

 

Partitioning graph data into named graphs

 

Partitioning the source data is mainly relevant for RDF conversion. In addition, data also needs partitioning within the knowledge graph.

 

Partitioning graph data is performed by storing sub-graphs in so-called named graphs. A named graph is identified by an IRI and groups a set of RDF statements. Each named graph is further described using graph metadata, which includes type, description or license information.

 

Graph data may be split into named graphs based on different criteria, such as:

  • named graph per data domain (i.e., entity type)
  • named graph per source
  • a mix of those approaches 

Note that size (amount of data) is not typically considered for data partitioning within a knowledge graph. The selected partitioning strategy highly depends on the use case and application data.

 

These are some of the considerations you need to keep in mind when first building your knowledge graph. Understanding and aligning on these factors from the outset will ensure a smooth creation process. 

 

Knowledge graph creation approach

The next step in creating a knowledge graph is deciding on your methodology. You can create a knowledge graph using different approaches optimized for either performance or loose coupling.

 

Materialization

The most common approach to creating a knowledge graph is to materialize RDF in a graph database. This involves retrieving source data from the original data source, converting it to RDF (if required), and persisting it in a graph database, for example.

 

A fully materialized knowledge graph provides the best performance for querying, as data is stored in a native graph format within a graph database and is optimized for graph querying and processing. Materialization works especially well for highly connected data, exposing relationships from data integrated from multiple sources.

 

Virtualization

With virtualization, data resides in its original data store and in its original format, e.g., as tabular data in a relational database. When querying the knowledge graph, data is retrieved from the original data source and the results are transformed to RDF on the fly.

 

This leads to loose coupling as data is only transferred and converted to RDF when requested and does not need to be maintained in two data stores at the same time. This approach is well suited for data that is required only rarely and follows a tabular structure, e.g., time series data or other types of mass data.

 

A drawback to this approach is that query performance is rather low since data has to be queried from another external source and transformed to RDF ad hoc.

 

Federation

Federation is a variant of virtualization: data is already available in RDF format and from a SPARQL endpoint, so the query semantics are the same as querying data from just a single RDF database. This approach is useful for combining data from public datasets like Wikidata with internal data or from multiple sources.

 

As data sources can be both internal and external, it allows you to combine confidential or sensitive data with public information. The original data is maintained by their respective owners or stewards but can be combined into one knowledge graph for specific purposes. The SPARQL 1.1 query language provides a built-in feature for expressing information needs over multiple RDF knowledge graphs. Using SERVICE clauses, it is possible to combine information from remote RDF sources with data from the local RDF database.

 

See our blog post on federation for more details.

 

Deciding on the approach

As described above, there are multiple approaches to creating a knowledge graph this large in size. Here, we follow the materialization approach, as this provides the best trade-off for huge and highly connected graphs created from multiple sources.

 

This can be implemented using an ETL (extract-transform-load) or ELT (extract-load-transform) pipeline, which we’ll explore below. 

 

An ETL pipeline was developed as part of Graph Massivizer, an EU-funded research project dedicated to researching and developing a scalable, sustainable and high-performing platform based on the massive graph representation of extreme data. 

 

Knowledge graph creation process

Once you’ve decided on your approach, you can start creating your knowledge graph, which involves many different tasks and phases. Below, we’ll explore in-depth the important factors to consider during the knowledge graph creation process.

 

FAIR Data Principles

FAIR data

 

A knowledge graph is only one system within an enterprise environment consisting of interconnected systems and data sources. One key aspect in this world of interconnected systems and data is following FAIR Data principles:

 

  • Findable: Metadata and data should be easy to find for both humans and computers. Machine-readable metadata is essential for automatic discovery of datasets and services.

 

  • Accessible: Documentation should be available on how data can be accessed, possibly including authentication and authorization.

 

  • Interoperable: Data should be stored using models and metadata that allow it to be integrated with other data and be used with applications or workflows for analysis, storage, and processing.

 

  • Reusable: The ultimate goal of FAIR data is to facilitate the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. 

 

Following the FAIR data principles ensures the reusability and interoperability of your knowledge graph, allowing you to enrich and extend your knowledge graph for various use cases and applications. 

 

A knowledge graph supports FAIR principles with the use of unique and persistent identifiers for entities, providing linked metadata describing the origin and modalities for accessing and using datasets, the use of semantic data models, and building on open standards for storing, accessing, and querying data. 

 

The following sections provide more information on how this works in detail.

 

Iterative approach

 When creating a knowledge graph from scratch, it is useful to apply an iterative approach, involving:

 

  • identifying source datasets and making them accessible

  • defining a semantic data model using ontologies and vocabularies

  • defining RDF mappings to convert from structured source data to RDF

  • pre-processing source data (per file), e.g., to clean up data

  • performing RDF conversion using the provided mappings

  • post-processing intermediate results (per file), e.g., to create additional relations or aggregate data

  • ingesting RDF data into the knowledge graph to persist the data in a graph database

  • post-process intermediate results (whole graph), e.g., to create additional relations or aggregate data

  • performing data validation to ensure the graph conforms to the defined data model

 

The iterative approach to creating a knowledge graph

 

When violations are observed during data validation, the results can be used as a starting point to improve the pipeline. For example, source data can be fixed by performing data cleansing, adjusting the ontology or RDF mappings, or performing another iteration of the data integration process or ETL pipeline.

 

Providing dataset metadata

Data catalogs are a core building block for any FAIR data implementation, as they connect the available data assets with the knowledge graph. They support both interoperability as well as accessibility, as defined in the FAIR data principles.

 

In this approach, the data catalog is represented as a knowledge graph itself. It is semantically described with descriptive metadata and access metadata and is interlinked with other parts of the knowledge graph—such as ontologies and vocabularies—and it is embedded into and connected with data assets. Therefore it provides a unified access layer for the end user with traceable and explainable connections down to original data sources.

 

Dataset descriptions (or data catalogs) are based on the open and extensible W3C standard Data Catalog Vocabulary (DCAT) to make the data discoverable, accessible and traceable.

 

Dataset metadata may provide information on several important aspects of a dataset, such as:

  • provenance information to describe where data originated from

  • lineage information to record previously applied processing steps and how data passed from one step in a data mesh to another

  • licensing information to specify whether and how data can be used, changed or published

  • the timestamp of creation and last update to help understand how recent the contained data is

  • links to distributions of the dataset, to allow automated access and selecting the right format

 

With dataset descriptions, humans and machines (i.e., AI/ML algorithms) can consume data in context since the data is directly linked to the models and dataset descriptions, which themselves are based on open standards, are shareable and can even be queried all at once through a single, semantic query language.

 

See our blog post on data catalogs for more details on dataset metadata.

 

Defining a semantic data model

The next step in creating your knowledge graph is defining your data model. A knowledge graph typically follows one or multiple well-defined schemas which are specified using ontologies and vocabularies.

 

Ontologies

 

Ontologies are semantic data models that define the types of entities that exist in a domain and the properties that can be used to describe them. An ontology combines a representation, formal naming and definition of the elements (such as classes, attributes and relations) that define the domain of discourse. You may think of it as the logical graph model that defines what types (sets) of entities exist, their shared attributes and logical relations. Ontologies can be specified using open standards like OWL and SHACL.

 

Vocabularies

 

Vocabularies are controlled term collections organized in concept schemes that support knowledge graph experts, domain experts and business users in capturing business-relevant terminology, i.e., synonyms, hyponyms, spelling and language variations. A term could include preferred and alternative labels (synonyms) in multiple languages and carries natural language definitions. Terms can be related to each other, i.e., a term is equal, broader/narrower (hyponyms), or defined as loosely related. The most common examples of different types of vocabularies are thesauri, taxonomies, terminologies, glossaries, classification schemes and subject headings, which can be managed using SKOS as an open standard.

 

Multiple uses of the semantic data model

 

A data model defined using ontologies and vocabularies can be used for multiple purposes:

 

  • It can be used for documentation (ideally including a graphical view) of the data model, which helps in understanding the data as well as finding connections between entities.

  • It can help when creating data mappings.  The source data model (i.e., the data model of a JSON file or schema of a relational database, for example) needs to be mapped to the RDF data model for the conversion process. Being able to easily identify properties and relations or connections between entity types greatly facilitates the authoring of mappings.

  • It allows for data validation.  When defining the ontology using OWL and SHACL, this can be used to automatically validate the database and ensure that data follows the defined data model.

  • It can drive an explorative user interface. When data is fully described, generic exploration of the dataset is much easier. Also, a knowledge graph engineer or application engineer may build a custom user interface for the dataset, which is greatly facilitated by good documentation of the data model.

 

See our blog post on ontologies and vocabularies for more information.

 

Develop RDF mappings

The mapping process enables simple conversion, from a huge amount of source data to RDF, in an automated fashion. Converting structure data to RDF can be done by mapping certain elements and attributes from the source files to RDF data using a set of mapping rules.

 

As an example, all values of a column in a CSV file or a table in a relational database are mapped to RDF statements with the row's unique key being mapped to a subject IRI, the column to a predicate and the row value to the object position of a triple. Mapping rules can be provided either in a declarative way or programmatically.

 

Declarative mappings

Declarative mappings follow the so-called no-code approach, meaning they can be defined using a simple text editor or visual tools, without requiring special programming skills.

 

The mappings are defined using the standardized RDF Mapping Language (RML). RML itself is also based on RDF, so both data model (ontology), mappings (RML maps) and instance data all use the same format. RML supports both tabular/relational and hierarchical data structures in formats like CSV, JSON or XML. Support for other formats can be provided as well.

 

RML defines just the mapping language. A wide range of implementations in the form of mapping engines (most of them open-source) are available. They can be used either as stand-alone tools or embedded into custom applications as a library.

 

Programmatic mappings

Implementing the mapping process using a custom program is the most flexible way to convert data to RDF. All means provided by the programming language and its ecosystem—such as frameworks and libraries—can be used (e.g., accessing data in various formats). Also, language-specific connectors, such as JDBC to access relational databases in the Java programming language, or web service connectors provide great flexibility. The biggest advantage is full control over the mapping process, as any kind of algorithm, data generation, use of caches and memory, navigating data structure or control flow is possible.

 

Choosing between a declarative or programmatic approach

Using declarative mappings based on RML is the quickest and easiest way to implement mappings from structured data to RDF as it follows a pre-defined approach that covers many use cases and formats and does not require special programming skills.

 

Only when declarative mappings do not suffice for the mapping at hand should mappings be implemented as a custom program. While programmatic mappings allow for greater flexibility, this approach also requires more effort and programmatic skills, which are not necessarily available to people implementing a data pipeline.

 

In some cases where declarative mappings support most data structures to be mapped to RDF and only a few more complicated cases cannot be covered, a hybrid approach may be suitable. In that case, most mappings would be implemented declaratively in RDF and only a few special cases be handled by custom coding.

 

Performing pre- and post-processing

Besides converting source data as-is to RDF, sometimes additional steps are required to conform to the graph data model. This may be performed as pre- or post-processing steps, either on the original source before the RDF conversion or after.

 

Pre-processing steps typically work on the unit of a single source file. Typical examples are:

  • data cleansing
  • filtering of invalid data
  • splitting out units from numerical values
  • datatype conversions to conform with certain numeric or date-time formats

 

Post-processing steps may either be performed on the intermediate RDF files or the whole graph. Typical examples are:

  • specify the named graph for a set of statements
  • update graph metadata, such as the timestamp of the last update of a dataset based on source data
  • create links between entities
  • generating a SKOS vocabulary out of keywords stored as attributes on some types of entities
  • rewrite or replace subject IRIs based on existing data
  • aggregate data

 

Data Ingestion

The result of the previous steps is composed of a set of files in RDF format. This set of files may already be used to distribute the data in RDF format, e.g., as a data product.

 

As a next step, ingesting this file-based dataset into a graph database provides a base for easy querying and graph analytics supported by the database engine.

 

In addition to loading the data into the database for querying using the SPARQL query language, creating a full-text search index enables additional capabilities when searching for textual data in the graph. This is typically handed off from the database to specialized and tightly integrated full-text search engines like Lucene, Solr, or Elasticsearch.

 

Performing data validation

Once all data has been converted to RDF and is ingested in the database, it can be submitted to data validation to ensure good data quality. 

 

When defining the ontology using OWL and SHACL, the model description can be used to automatically validate the database and ensure that data follows the defined model. This can be done using a so-called SHACL engine, which verifies that the data in the database adheres to the shapes defined in the ontology. SHACL engines are provided by (commercial) RDF databases as well as open-source projects or commercial tools such as metaphactory.

 

In addition to verifying the conformity of the data to the data model, it is possible to check for additional assumptions. Examples include verifying the existence of dataset metadata for each named graph or confirming an expected set of named graphs after full ingestion to detect any missing data. 

 

Graph updates

Once a knowledge graph is created it needs to be kept up-to-date to reflect changes in updated source datasets, and this can be done using different approaches.

 

Full update

The easiest approach (though not the most efficient as we will see in the next section) is to simply re-create the whole graph whenever new or updated source data is available. This follows the process of creating the original graph with the new version replacing the previous one. 

 

For interactive uses of a graph, a new version might be created while the previous one is still in use. Once the graph creation is complete, the graphs can simply be swapped/replaced.

 

Incremental update

Re-creating the full graph for any change might be costly, prolonged or might not be feasible for big graphs or for frequent yet minimal changes to the source datasets. Incremental updates may help to isolate changes to the source datasets and apply them selectively to the knowledge graph.

 

For new data, this can be done by extracting the newly added data as a set of files that are converted to RDF and ingested into the database. You need to be mindful of data partitioning during this process: depending on the newly-added data, and data partitioning strategy, updated data can be put in a new partition (i.e., named graph) with corresponding metadata (including updated timestamps) or simply be added to existing named graphs, e.g., for the entity type.

 

Any changes to the data or removal of data need to be reflected in the graph accordingly. Changes in source data translate to deleted and added statements in the corresponding RDF dataset, so careful tracking of obsolete data is required. 

 

Applying incremental graph updates is a complex and highly domain- and data source-specific task, but has great benefits with keeping the graph up-to-date without requiring a full re-creation of the whole knowledge graph.

 

Automation

The creation of a knowledge graph is often not merely a one-time effort but rather a process that needs to be repeated continuously to maintain the graph's currency and incorporate new data sources as necessary.

 

Automating this process is the key to quick and efficient graph creation with repeatable outcomes, and helps to avoid errors caused by the manual execution of the steps defined above. For example, automation can be achieved using an ETL pipeline, including a workflow engine that is well-integrated with all relevant systems and services. The overall workflow needs to be resilient against transient errors in a distributed system landscape, as well as customizable and adjustable to the user's runtime environment.

 

Implementation

The conceptual overview of an ETL pipeline, as outlined above, describes all relevant aspects of large-scale knowledge graph creation. The pipeline follows the Materialization approach and can be used whenever a knowledge graph must be created from highly connected source data.

 

In this section, we’ll cover a concrete implementation in the public cloud, taking advantage of many of the existing building blocks and scale-out options on the massive infrastructure provided by AWS (Amazon Web Services). As a graph database, we use GraphDB from Ontotext, which provides enterprise features like a performant and scalable database, integrated full-text search, native support for advanced graph analytics, data validation using SHACL and many more.

 

The following diagram describes the architecture of the ETL pipeline on AWS:

 

The pipeline uses a multitude of AWS services to implement the RDF conversion and ingestion process with a cloud-native approach resulting in high parallelization and efficient use of resources.

 

The following services are used:

 

Configuration files and user-provided RML mapping files are stored in an S3 bucket (an object-storing service in AWS), making them accessible from all AWS services. Similarly, the source files to be converted and ingested are expected in an S3 bucket, as well as the output of the RDF conversion. This allows us to keep the RDF version of the source data, and use or distribute them independently from the data in the RDF database, e.g., as a data product.

 

The pipeline uses a sophisticated workflow for conversion, ingestion, state management and plumbing. The individual workflow steps are described in detail in this architecture documentation.

 

Try it for yourself!

Despite having multiple external data sources and vast amounts of data, it's possible to construct a knowledge graph of considerable size in a manner that is fit to scale and can be extended. 

 

Ready to try it for yourself? For a public demonstration, we’ll use the Dimensions Covid Dataset as an example of a large, publicly available dataset to create a scientific knowledge graph using the ETL pipeline. 

 

The dataset provides information on global publications, academic papers, authors, research organizations, funders, grants, datasets and clinical trials. The zipped dataset (1.09GB) is available for download on Figshare. The data files are in CSV format and the fields are described in the documentation of the main Dimensions dataset (although not all documented fields are available in this publicly available subset).

 

The semantic data model and dataset description as well as the corresponding RML mappings are provided as an example in the ETL pipeline Git repository.

 

Download the Git repository here, which includes all artifacts, and follow the set-up guide and simple example based on the Dimensions Covid dataset. See here for the prerequisites for deploying and running the pipeline and an example on AWS.

 

We’d love to hear about your experience in constructing your knowledge graph. Share it with us on social media or This email address is being protected from spambots. You need JavaScript enabled to view it. to send us a note on how it went. 

 

Happy knowledge graph creation!

Wolfgang Schell

As a Principal Software Architect at metaphacts, Wolfgang works with the software engineering team to translate customer needs into sustainable features and implement these in a holistic architecture. As an enthusiastic software developer, he is also involved with the Mannheim Java User Group (Majug) and the JugendHackt Lab Mannheim.

Pauline Leoncio

Pauline Leoncio is an experienced copywriter and content marketer with over 6+ years in marketing. She's developed content plans and creative marketing material for growing B2B and B2C tech companies and covers a range of topics including finance, advanced tech, semantic web, food, art & culture and more.