In this article, we’ll discuss how LLMs and Symbolic AI drive precise, comprehensive and contextualized Causal Relationship Search and how it works in the Dimensions Knowledge Graph.
Causal relationship search and knowledge graphs and LLMS
As new and improved data annotation and analysis techniques have revolutionized the life sciences industry, a significant challenge remains: how to understand the full implications and relationships between proteins, diseases, compounds etc. mentioned in all scientific literature being produced?
It is impossible for an organization to read all available publications considering the rate at which information is published, much less identify all causal relations that are mentioned from this imposing amount of data. Information is also sometimes represented differently based on each paper, scientist or research group, and is often buried beneath highly technical language, thus adding to the problem.
Previous AI models have emphasized generality to be as applicable as possible to any use case and so cannot easily be applied to the hyper-specific language of scientific research. However, the potential reward of using AI to support certain tasks such as, automating or streamlining, the identification of causal relations is of enormous value societally and economically; we can support the development of new life-saving drugs and kickstart new groundbreaking research projects.
In this article, we’ll discuss the pressing data challenges pharma and life sciences organizations and researchers are currently facing and how LLMs and knowledge graphs can offer a precise, comprehensive and contextualized solution. Keep reading!
Table of contents
Drowning in data: A problem of too much research
Imagine this, pharmaceutical companies sometimes have to browse through over 140 million publications from just one dataset, for example. The number of reports and publications that exist make it nearly impossible to review all necessary publications, yet time and innovation are of the essence in an industry where it typically takes 10-15 years to develop a drug through the necessary stages of research, clinical trials, and approval from authorities.
Even with the existing AI technologies and search tools available, current search capabilities are cumbersome and time-consuming; a faster tool that can handle complexity is needed. Such progress would save time, manpower, money and aid in identifying novel relationships within the literature. In addition, scientists are also limited by their own research bias, and humans tend to pursue familiar pathways, at the risk of missing out on connections hiding across large information sets.
Today we must identify new antibiotics in the face of microbial resistance, new vaccines for the most populous human race in history, and take advantage of the promise of targeted drugs and personalized medicine based on genetic profiles. Time is of the essence and a robust and accurate solution is needed—and this is where Causal Relationship Search offers significant opportunity.
Causal Relationship Search within the Dimensions Knowledge Graph
In 2024, our team at metaphacts had the honor to introduce the new Dimensions Knowledge Graph, a groundbreaking collaboration between Digital Science and metaphacts. The Dimensions Knowledge Graph, powered by metaphactory, is a large knowledge graph with over 32 billion statements, built on top of a unified semantic model that delivers a layer of trust and explainability for AI algorithms and applications.
As a result of the positive feedback we have received from academic researchers and our customers at pharmaceutical companies, we have refined this new offering and launched a beta version of Causal Relationship Search within the Dimensions Knowledge Graph.
Causal relationship search is a search method designed to identify the cause-and-effect relationships between agents such as proteins and drugs, for example, to gain novel insights and discover unseen relations that could support one’s research. By harnessing the power of Large Language Models (LLMs) and Symbolic AI, Causal Relationship Search in the Dimensions Knowledge Graph identifies causal relationships across vast amounts of scientific literature, providing a transformative solution to the challenge of information overload. The image below exemplifies CSR’s functionalities, where we can see results for proteins that induce/increase the secretion of Tumor Necrosis Factor-alpha / TNFa.
Image: Relationship search in the Dimensions Knowledge Graph
This high-precision search method is tailored to researchers navigating the complex fields of pharmacology and biology, which are central to drug development. Whether you're exploring drug target identification, biomarker discovery, or compound development, the tool delivers targeted, actionable insights that would be impossible, difficult or too time-consuming to uncover through traditional methods.
Conventional free-text searches often yield thousands of documents, many of which are only tangentially related to the specific question at hand. This deluge of information can create a knowledge gap, as researchers struggle to sift through irrelevant results to find meaningful insights.
Causal Relationship Search eliminates this hurdle by focusing not just on named entities of interest, but also on the specific relationships between them. For example, in the case of protein-protein interactions, the system can identify and categorize connections across eleven distinct biological processes, providing unparalleled clarity and precision.
This tool is designed to exceed the capabilities of traditional search engines in two critical ways:
- Comprehensive Coverage: The search is not restricted to open-access publications, ensuring that users have access to a broader range of relevant scientific content.
- Deep Contextual Insights: Beyond simply delivering lists of relevant references and text snippets, the tool provides a factual overview of entities involved in specific relationships. For example, researchers can instantly view all proteins that increase the phosphorylation of TNF, enabling them to pinpoint crucial connections at a glance.
By transforming how researchers approach the vast body of scientific literature, Causal Relationship Search empowers users to uncover insights with speed and accuracy, advancing the pace of discovery and validation in critical areas of science.
How Causal Relationship Search works
How does the system function? The Dimensions Knowledge Graph has six general relationship types linking the separate entities that the user wishes to investigate. In order of strength, they are: direct, increase, decrease, modulates, relates and potentially relates, where the last two are vague or not necessarily causal. Causal Relationship Search includes multiple biological relations such as binding, transcription and metabolization as viewed in the image below (although the list shown is not exhaustive). These expressions are not possible in most other offerings in the industry. Large Language Models (LLMs) have become useful in recent years due to advances in computing affordability and intensity. An LLM is a probabilistic approach to artificial intelligence based on a machine’s understanding of statistical relationships between words in text.
Image: Supported causal relationship types in the Dimensions Knowledge Graph
The LLM can understand the strength of the author’s statements because it was trained by human annotators tasked with identifying the relationships between entities within a sentence that is then fed into the LLM. These relationships and entities are organized within an ontology, serving as the foundational semantic model that facilitates the utilization of the data within the Dimensions Knowledge Graph. This framework links the dataset to other integrated data sources, such as OpenTargets, STRINGdb and others.
Classic search algorithms often struggle because the specific keywords describing a biological process are not explicitly mentioned in the text. Instead, the biological process might be described in a more nuanced way, making keyword-based search approaches ineffective at finding relevant documents. For instance, the LLM accurately identified the relationship “increases secretion” from the text, which a term-based search would have missed: “…and in some systems IL-17 directly stimulates release of TNFα.”
Image: A text snippet taken from a scientific publication as a starting point
Here we see a text snippet taken from a scientific publication as a starting point. This text is then converted into a data structure using symbolic AI (entity recognition) in combination with large language models. the result can be import directly into the knowledge graph: so-called triples that consist of a subject (here protein), the predicate (type of relationship or effect on the object) and the object (again a protein). In this way, millions of documents are used to create a huge knowledge graph that describes how proteins interact with each other.
The first release of Relationship Search supports causal relationship search across 3 types: genes and proteins, diseases, and substances. In total, the system can distinguish more than 200 relationships. We aim to introduce further parameters to expand the number of offered relationships in future iterations.
Managing contradictions
Contradictions provide a valuable metric for evaluating the precision of a system's ability to identify causal relationships. This involves analyzing instances where the system identifies conflicting relationships, such as "A increases B" and simultaneously "B increases A," which constitute contradictions. While such contradictions can sometimes be contextually accurate, in most cases, they arise from errors in the system's interpretation. Incorporating a categorization of the strength of the authors' statements significantly enhanced the system's precision. This approach helped reduce contradictions to below 10% overall, ensuring more reliable identification of causal relationships.
In the image below, you can see that 1,241 publications provide evidence that insulin induces or increases the activity of Protein Kinase B, while only 27 publications suggest that insulin inhibits or decreases its activity. By incorporating a categorization system that evaluates the strength of the authors' statements, we have been able to reduce contradictions to below 10%, as reflected in this data.
Image: The image illustrates how most publications support insulin increasing Protein Kinase B activity, with contradictions reduced to below 10% through a statement-strength categorization system.
Use cases for Causal Relationship Search
Relationships as we identified and made searchable in the dim KG between protein/genes-diseases-compounds are relevant for the core use cases in drug develpment: drug target and biomarker identification, drug repurposing, toxicity anylsis and compound development. Literature is a very important source to inform researchers working on such use cases. With the help of relationship search in full text scientific documents they have now a tool that provides them with a never seen efficiency using scientific papers to validate experimental results or to create new hypothesis. The screenshot below shows an example of using CSR for drug target discovery, how users can go from the the actual search and identification of the relationships to looking at literature that supports those relationships.
Image: example of using Relationship Search for Drug Target Discovery
Summary
With so much information at their disposal, researchers need to organize and filter the knowledge to suit a defined query. With Dimensions Knowledge Graph, the querier is able to search massive data sets in seconds to produce results more likely to generate novel ideas based on relationships supported by the literature. Work smarter, not harder!
Causal Relationship Search is at the cutting edge of the AI transformation of life sciences. It is both true that the amount of data is increasing at a faster rate than ever before, and also that database and search technology have never had the capabilities to deliver such a solution prior to this past year. We look forward to adding new features in the future.
We had the pleasure of introducing Causal Relationship Search at BioTechX in Basel, Switzerland in October 2024, at our session delivered by Dr. Peter Doerr, Director of Presales at metaphacts. The Q&A discussion following Peter’s speech focused on how to deal with contradictions in scientific literature, the importance of context in understanding interactions in different parts of the human body, and the question of how customer's internal data sets, such as multiomics data, can be mapped to data points in the Dimensions Knowledge Graph. All of these topics were addressed with an approach to manage contradictions, providing users with context and integrating internal data. We thank BioTechX for the opportunity to present at the conference, and appreciate the enthusiasm and engagement from our attendees. If you would like to learn more about Causal Relationship Search and the Dimensions Knowledge Graph, you can try it for yourself.
Try it for yourself
The Dimensions Knowledge Graph is designed to significantly accelerate the process of generating and validating insights, providing a comprehensive and unified dataset to speed up development processes and reduce the risk of failure. By doing so, it tackles some of the most pressing challenges facing the industry today.
Relationship Search is one of the most advanced features of the Dimensions Knowledge Graph, designed to achieve these goals. It even goes beyond scientific publications and can also be applied to internal scientific documents, enabling seamless integration of unstructured internal data.
The currently available version of the Relationship Search is in beta release. Interested in requesting a Demo?