Discover how transparent federation in metaphactory enables a unified view across distributed knowledge graphs, addressing challenges like data silos and compliance. Principal Engineer Andreas Schwarte discusses its evolution, optimization and real-world applications.

Transparent federation with FedX

In many enterprises, data is distributed across multiple knowledge graphs. This distribution supports considerations such as governance policies, decentralization strategies, regulatory compliance and access control mechanisms. Individuals or organizational units manage and govern their knowledge, and implement distinct applications and use cases.

On a global level, there is an opportunity to connect distributed knowledge, getting an integrated view of all available information - without requiring centralized integration into a single graph. For example, an enterprise operating in multiple regions, where business data is distributed by region (e.g., to achieve compliance). For management, a holistic view on all data is clearly desirable. As another example, consider multiple projects that populate information each in their own knowledge graph. When combining information from different, isolated databases, possibly even enriching it with information from public sources, new knowledge and insights may be derived.

In metaphactory, we provide transparent federation as a technology to respond to those requirements and opportunities. Transparent federation virtually integrates multiple knowledge graphs and further allows augmenting the information with data from public sources.

In this blog post, Andreas Schwarte, Principal Engineer at metaphacts, reflects on the history of transparent federation with FedX and provides an overview of the recently introduced optimization techniques that address our own and our customers’ challenges. Lastly, he will also present several techniques that support practical federated use-cases. Keep reading!

Table of contents

A retrospective on more than 10 years of FedX
Optimizations at source selection time
Optimizations during query execution
Further techniques supporting specific use cases
Summary & outlook

A retrospective on more than 10 years of FedX

FedX was initially designed and implemented while I was in my master's program in 2010/2011. It came at a time when federation was still in its beginnings. The subject of transparent federation over distributed SPARQL data sources (with a focus on finding practical approaches and optimization techniques) was a particularly hot topic, especially in research.

In that period, a number of relevant foundations were developed and published, such as our work on Optimization techniques for federated query processing on linked data at ISWC 2011. Other researchers put emphasis on federation techniques, and their works looked at various aspects of federation from different angles. I should note that the SPARQL 1.1 standard with the federated query extensions was not yet widely available at that time, only formally becoming a W3C recommendation in 2013.

The figure below depicts the federated query processing model in FedX. After a parsing step, the relevant sources for each statement pattern are determined—in FedX, it is being done with ASK queries in combination with a cache. Using this information, global optimizations are applied, which specifically include a grouping of statements that have the same relevant source and adjusting the overall join order. During query execution, the engine generates sub-queries and executes those concurrently at the relevant sources. The retrieved partial results are locally aggregated and supplied to subsequent operators. Finally, the query result is returned.

Image: Federated query processing model in FedX

A crucial performance aspect in federated query processing is the number of remote requests to the federation members. Next to the source selection (and thereby focusing on relevant sources that can contribute to the final result), the central technique of FedX is that of bind joins.

Bind joins attempt to group input bindings from intermediate results in blocks, and with that push a join to the database. In the initial implementation this was done using a complex SPARQL UNION construct, which was rewritten at a later point to make use of the SPARQL 1.1 VALUES clause. Together with the concurrent infrastructure of the engine FedX was and is able to achieve overall good performance in federated query processing.

Over the next few years, the FedX engine was integrated and used in commercial products. In 2019, the FedX engine was contributed to the open source community and integrated into the RDF4J project. As part of that, a major rewrite and modernization—specifically of the concurrency infrastructure—had been done. However, the central optimization techniques and their implementation are still very much the same as in their initial implementation.

FedX and metaphactory

Around the same time, the FedX engine was integrated into metaphactory as a feature to allow virtual integration of information from different managed repositories. Managed repositories here specifically means that with FedX being part of metaphactory, aspects like authentication to the individual databases and centralized configuration are possible. Additionally, FedX has become the underlying technical engine for Ephedra, supporting hybrid queries for augmenting data with information from non SPARQL sources. With this tight integration, over time the engine has further matured, both being maintained in the RDF4J project and its use for metaphactory.

While being used by metaphacts and our customers, we have identified several challenges and new requirements for specific use-cases. In the following, we describe some of our novel techniques to address these.

Optimizations at source selection time

a) Model-driven source selection

In enterprise federation scenarios, we often find that the data is distributed by domains. Let’s say there is a database for projects: one for products and one database for customers. There can be a variety of reasons for this, but one of them is governance and ownership (e.g., by different departments of an organization). Typically you find such data being modeled according to ontologies.

The motivation for model-driven source selection is to provide the federation engine with knowledge about the data distribution in form of such schema information: As part of source selection the ontology information can be used to statically provide locality information for statement patterns, i.e., additional database interaction using ASK queries for source selection can be avoided.

Following up on the example above with projects, products and customers. Let’s assume there exists a Product ontology with a class core:Product. In the FedX transparent federation, this can be associated to the “products” federation member as follows:

fedx:member [
  fedx:store "ResolvableRepository" ;
  fedx:repositoryName "products"
  fedx:model [
    fedx:ontology <https://example.com/product-ontology> 
  ]
]
...

Similarly, ontologies can be associated with other federation members, and it is also possible to associate multiple ontologies to a given federation member. Finally, for advanced cases it is possible to restrict the fedx:model to a subset of included ontology elements.

Model-driven source selection applies to two kinds of statement patterns:

?subject PREDICATE ?object, where PREDICATE is an attribute or relation defined in one of the ontologies
?subject a CLASS, where CLASS is a class defined in one of the ontologies

Technically, the FedX engine maintains an index with mappings for predicates to relevant endpoints (i.e., owl:DatatypeProperty and owl:ObjectProperty elements declared in the reference ontologies), as well as a mapping of type to relevant endpoints (i.e., owl:Class instances). This information is then used at source selection time to annotate applicable statement patterns with their sources. For non-handled patterns, the usual FedX source selection mechanism with ASK queries applies as fallback.

Please note that for model-driven source selection, a "white-listing" approach is applied: when a specific ontology is associated with a given federation member and an ontology element matches a statement pattern, the federation assumes that only this repository can provide information for the statement pattern. As a consequence, if another repository could provide information for a statement matching the ontology (however, it is not explicitly associated), it is not considered during source selection.

Example:

Member 1 has statement (:bob schema:name "Bob")
Member 2 has statement (:alice schema:name "Alice")
Schema.org ontology is associated to Member 1 via configuration
Source selection is applied on statement pattern ?person schema:name ?name

Here, model-driven source selection will derive Member 1 as the single relevant source, i.e., the federation will ignore Member 2 during query evaluation.

In addition to the static configuration of associating an ontology (and optionally specific elements) to a federation member, metaphactory supports a query-driven configuration, allowing to describe data sources and the model using RDF. As an outlook we foresee in the future to integrate the approach with dataset descriptions in metaphactory.

Reducing the source selection with static information is beneficial as it helps to reduce the number of remote requests, and hence improves performance. Moreover, it allows the engine to make use of a model that is describing the overall enterprise data federation.

b) Co-Located Statements optimizer

The co-located statement optimizer makes use of a data modeling practice that is often found in practice: co-location of (outgoing) statements belonging to a resource in a single database.

Consider for illustration again, the scenario with products, customers and projects. In this scenario, the products with their respective attributes (e.g., display names) can be found in one database, while customer information and projects reside in different databases.

Looking at a generic query as below, which tries to list products and their display name, we can see the difficulty in the federation layer. The federation engine, without further knowledge, must assume that the rdfs:label statement pattern needs to be evaluated at all federation members.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIY po: <https://example.com/product-ontology/>
SELECT * WHERE {
  ?product rdf:type po:Product .
  ?product rdfs:label ?label .
}

This optimization now takes data-locality into account: it assumes that in practice all outgoing properties of a resource are defined in the same repository. Hence, the optimizer identifies statement patterns of a join group with an unbound subject, and groups them by their subject variable name. The statement sources of those statements will be defined as the intersection of the group’s statement sources. In the example above, we can push the entire join to the products database. So, instead of first fetching all products, and then for each product attempting to lookup the label in all federation members, this concrete example can be answered with a single remote request to the person database.

The optimization generalizes to join groups within the query and specifically supports optimization for properties defined in upper ontologies. In the following example, both products and customers define an identifier using the core:id attribute, defined in the shared core ontology.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX core: <https://example.com/core-ontology/>
PREFIX po: <https://example.com/product-ontology/>
PREFIX co: <https://example.com/customer-ontology/>
SELECT * WHERE {
  ?product a po:Product .
  ?product core:id ?productId .
  ?product po:relevantFor ?customer .
  ?customer a co:Customer .
  ?customer core:id ?customerId . 
}

As applicability of the co-located statements optimizer depends on how the data is organized and partitioned, it is an opt-in optimizer which needs to be explicitly turned on. This can be done in the federated repository configuration using the fedx:coLocationOptimizer setting. Note that is can be globally enabled (fedx:coLocationOptimizer true) or restricted to a set of included predicates (fedx:coLocationOptimizer
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>,
<http://www.w3.org/2000/01/rdf-schema#label>, ... ;).

c) Namespace based source selection

In metaphactory, we find many resource-based access patterns. This goes from navigation to resources to display information for a resource, over to displaying and listing resources (e.g., in search results) to visual exploration in the Graph Canvas.

All these use-cases have in common that query patterns are subject-oriented, i.e., the subject is typically bound to a concrete resource:

# a concrete product in the products namesapce
<https://example.com/products/My-Product> rdf:type ?type

# a concrete customer in the customers namespace
<https://example.com/customers/ABC-Org rdfs:label | schema:name ?label

# the Wikidata identifier for metaphacts
<http://www.wikidata.org/entity/Q22132500> ?property ?value

In such scenarios, the federation (members) can be associated with static information about namespaces. In metaphactory, the information is defined using the void:uriSpace as part of the federation member:

@prefix config: <tag:rdf4j.org,2023:config/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix mph: <http://www.metaphacts.com/ontologies/platform/repository#> .
@prefix void: <http://rdfs.org/ns/void#> .

[] a config:Repository;
  config:rep.id "wikidata";
  rdfs:label "Wikidata member Repository";
  config:rep.impl [
      config:rep.type "metaphactory:SPARQLRepository";
      config:sparql.queryEndpoint <https://wikidata.metaphacts.com/sparql>;
  ];
  mph:extension [
      mph:extensionType "metaphacts:RepositoryMetadata";
      void:uriSpace <http://www.wikidata.org/entity/>
  ] .

This configuration means that when the source selection optimizer finds a statement with a bound subject in the http://www.wikidata.org/entity/ namespace, it statically assigns the wikidata federation member as the source. Therefore, a number of remote requests are avoided for source selection. (Note that depending on the data organization and partitioning this plays very well together with co-located statements.)

Optimizations during query execution

During query execution, the federation engine makes use of bind joins: intermediate results are grouped and then passed on to the next join operand as a block of bindings. In terms of query execution, this means that the bindings are injected into the sub-query using a VALUES clause. This optimization was one of the early core contributions of the FedX federation engine.

In our enterprise data federation, we observed scenarios where this technique can be improved further. We extended the federation engine with adaptive bind joins and transferred the idea of bind joins to the evaluation of OPTIONAL joins (i.e., left joins).

a) Adaptive bind joins

Adaptive bind joins are an improvement for bind joins with large intermediate result sets. The original bind join approach assumes equally sized blocks of intermediate inputs

Example:

With a bind join size of 50 and 200 bindings as intermediate inputs, the bind joins would be executed with 4 sub-queries taking 50 bindings each.

However, in practice, we see use cases where thousands or tens of thousands of intermediate bindings need to be processed. In such scenarios, a fixed bind join size would overall slow down the query execution pipeline with a high number of required sub-queries.

Adaptive bind joins now uses a dynamic bind join size for the blocks: for the first 500 intermediate results, the default bind join size of 50 applies; then for processing more items (up to 10000) the bind join size is set to 1000; and finally if more than 10000 bindings need to be processed, the engine uses a bind join size of 10000. The exact split points are configurable, where the previous values in our evaluations turned out to be the best-performing ones.

Image: Adaptive bind joins

b) Left Bind Joins

The Left Bind Join algorithm adapts the ideas of regular bind joins and applies them to OPTIONAL joins. The goal of this optimization is the same as for bind joins: reducing the number of remote requests to federation members.

The implementation also makes use of VALUES clauses to inject a block of bindings into sub-queries. However, compared to regular bind joins, the post-processing is different to respect the optional character, i.e., if for a given subquery and input binding the remote source does not provide bindings, for OPTIONAL queries we still return the input binding.

Further techniques supporting specific use cases

In the following, we’ll focus on use-cases of federation technology and specific features of the federation to support those.

a) Resilient Federation

metaphactory offers an opt-in feature for the transparent FedX federation for resilience. This means that if an endpoint is not accessible in the given query execution context for whatever reason, the endpoint is ignored right at source selection time.

This feature can, for instance, be applied in an enterprise data federation, where database access permissions are defined on the database level. Consider an enterprise federation with several databases owned by individual departments and a few shared databases. An employee of a department may be granted permission to see information from their own department’s database, and maybe some additional shared ones. In contrast, the executive management may have access to all information.

In such a scenario, the enterprise data federation can be configured with all databases. The resilience feature of the federation determines whether a given federation member is accessible in the current user context, and if not, it is ignored for processing the query. Note that such a use-case requires a corresponding security scheme implemented inside the database as well as impersonation of database requests (i.e., executing database requests on behalf of the user instead of a technical service user).

On the application user interface, this means that the user will only see information from accessible data sources (e.g., only information on products, but no customer information).

Resilience can be activated using the fedx:isResilientFederation true setting. Please refer to the section on Advanced Configuration Parameters for parameter configuration. Note that resilience may mean that queries return empty results without giving the user explicit feedback.

b) Federated Search

As an abstraction layer, metaphactory offers search services for different databases (e.g., to benefit from full text indices available in a specific database). Such service in metaphactory is registered as a companion to the repository (i.e., the database instance) and becomes accessible to the federation through the federated search.

For the scope of this blog post, we won’t go into details about the configuration of individual search services and refer the interested reader to the Search Service documentation.

For federations, metaphactory provides a federated search service. This service performs the search by automatically delegating entire search requests to the corresponding search service for each active federation member. The individual results from all federated search services are then aggregated and returned to the caller.

This means that in a federation we can benefit from database specific full-text indices and we do not rely on performing a search through SPARQL over the transparent federation (e.g., using regex filtering). Users do not only benefit from better results, but more crucially from the performance gain.

c) Federated Enterprise Knowledge Graph

As we mentioned previously, we often find that enterprises have multiple existing knowledge graphs. One of the reasons for these varying knowledge graphs clearly lies in the nature of ownership of information by individual departments of the organization. Naturally, enterprises aim for an integrated view of their knowledge, and in some cases beyond that, to augment private data with publicly available information.

The federation technology provided by metaphactory can help to build applications operating on a federated enterprise knowledge graph. One of the main use cases is to provide an entry point to the information available in the enterprise: applications on top of metaphactory provide tailored dashboards or search interfaces allowing to find relevant resources. Those resources then can be explored, giving the user a summary of all available information for the resource from the distributed knowledge graphs. For more in-depth use-cases, additional links into metaphactory deployments operating on an individual knowledge graph can be provided.

In such application use-cases, the individual optimizations and features described in this blog post can be combined to establish an overall good user experience. This goes from describing the information of each individual knowledge graph with a model, over specific search services configurations that are exposed as a federated search service, to the access of information - where with resilient federation metaphactory offers a tool to let users only see information that they are allowed to access on the database level.

Summary & outlook

Transparent federation and federated environments are a complex field, where metaphactory provides a solution that enables our customers to build tailored applications for their use-cases. It can often be a longer journey working together with the customer to set up the federation and combine the individual optimizations and features in the optimal way, but can lead to these custom apps for unique use cases.

While on this journey, we often discover new environment-specific challenges (e.g., due to the distribution of data across the knowledge graphs and the size of the information) that we learn from and try to derive suitable enhancements to further improve our federation technology.

One of the next steps we will be looking into is that of fair query execution: In one of our use-cases, we observed that longer running analytical queries interfere with short running resource access queries required for federation. We found that we can further fine-tune the schedulers in the federation engine to execute the sub-queries in a way to gain higher overall throughput.

Try it for yourself

Now that you’ve learned about the history of transparent federation with FedX and how it works with metaphactory, you can try it for yourself!

Speak with one of our experts to discuss your organization and specific use case, and learn how metaphactory can support your journey. Ask for a demo or free 4-week trial of metaphactory.

Contact us»