Knowledge graphs have emerged as a powerful tool in the field of life sciences, revolutionizing drug development, literature review, and regulatory reporting. With an overwhelming amount of information to navigate, researchers can now leverage knowledge graphs to quickly identify relevant insights.
By connecting disparate data sources and visualizing relationships, knowledge graphs enable researchers to integrate data from multiple domains and gain a comprehensive understanding of diseases, genes, and treatments and how they affect patients. Leading companies in the industry are leveraging knowledge graphs to accelerate drug discovery, identify potential drug candidates, and optimize their design.
One key advantage is their ability to integrate structured and unstructured data, allowing researchers to uncover hidden patterns and relationships. Layering different types of evidence to support relationships between entities of the same type (e.g. protein-protein interactions) or different types (e.g. protein-drug interactions) provides a rich resource for finding these indirect connections. That said, building a useful knowledge graph can require significant investment in time, resources, and expertise in data modeling and graph database technology.
In this post, our Data Scientists and Data Engineers explore use cases and best practices for companies looking to maintain their competitive edge by fully leveraging knowledge graphs.
Developing and running clinical trials requires the coordination of many types of disparate data. These include biological, chemical, and pharmacological information about the drug and information about the indication being tested. While the trial is being run, enrollment, progress of the trial, and specific results must be tracked. And after the trial is completed, updates to the status of the drug from regulatory bodies, or reports from patients taking the drug (such as those tracked in the FAERS database) must also be managed. The relationships between these different types of information are easily modeled in a graph database.
Not only are these relationships easy to represent, but having them in a graph database also makes them much easier to query for deeper insights. As a specific example, one goal of a clinical trial might be to improve patient outcomes with fewer adverse events or less severe adverse events than existing treatments. It would be easy to query adverse events reported for existing drugs for the same indication to determine the benchmarks that need to be met in the new trial.
Building knowledge graphs including drugs, diseases or conditions, and targets helps to surface hidden connections between these different types of entities. Being able to query how a target is associated with specific diseases has the potential to highlight a previously overlooked use case for a drug that is currently under development. As an example, many drugs have been repositioned after observation of specific side effects. Many may have heard of semaglutide injectables that were originally used for treatment of diabetes and are now widely approved for weight loss, which was a prominent side effect observed in its first application. Similar repositioning findings could be made more quickly with appropriately constructed knowledge graphs.
A key benefit of using biological networks for identifying drivers of biological phenotypes is having a tool to manage all the known pathways by which that phenotype can be reached. A classical example of this is tracking cell metabolism, which can behave differently under normal circumstances but can also be dysregulated in disease phenotypes. Knowledge graphs can easily recapitulate findings that took years to determine in the lab now that they are becoming sufficiently elaborated. Using traditional graph theory and biological techniques (such as flux analyses) or newer machine learning assisted methods can help uncover changes such as the changes in metabolism that take place in cancer cells (e.g. the Warburg effect). Explore the potential of knowledge graphs in deciphering complex biological relationships and consider utilizing internal knowledge assets to enhance analysis and generate actionable insights.
When combining data from multiple sources, whether they are public or proprietary, avoiding duplication and properly mapping entities in the graph is important. In biological life sciences this is particularly difficult because entity types (e.g. genes) often have multiple valid names (e.g. SHH and Sonic Hedgehog Signaling Molecule). Identifying and harmonizing will have a significant impact on the usefulness of the graph, especially if it is used in applications such as natural language processing where differentiating between different contexts is essential.
Recommendation When initially structuring your knowledge graph, you should define a taxonomy or ontology that you will use as your standard. Ideally, start from a foundation of a widely-used, publicly-available ontology, such as MeSH. Before attempting to integrate a new knowledge source into a knowledge graph, survey the types of entities it will add and the nomenclature used to describe those entities. Choose how you will handle mapping outliers. Always keep the purposes of your knowledge graph in mind.
When creating a knowledge graph, there are often multiple different ways of representing the same information. While all of them may be valid data models, the choice can drastically influence things like performance, extensibility, and interoperability down the road. The data model should be designed with these factors in mind. One widely applied data standard that guides knowledge graph creation is the FAIR method. The acronym stands for Findable, Accessible, Interoperable, Reusable. The steps of this method ensure the quality and compatibility of data within knowledge graphs. In following these steps, it is crucial to establish context-specific minimum information standards. These standards define the level of information and metadata annotation required for effective analysis. Although the specific standards may vary depending on the research question, adhering to them helps maintain data integrity and facilitates reliable analysis.
Recommendation When defining a data model, start by defining the types of questions you need the knowledge graph to be able to answer. Then, define a data model capable of answering those questions, making sure you have the data to support any nodes, relationships, or attributes you define. Ensure that each type of node and edge include the minimum set of attributes required to make them useful. Research and analysis can be hampered if nodes or relationships in a knowledge graph do not have enough real world evidence to support the work users are trying to do.
When knowledge graphs generate recommendations or insights that impact patient care, it is essential to validate the findings and comply with relevant regulations, guidelines, ethics, and privacy requirements. Rigorous validation processes and adherence to data protection regulations, such as HIPAA or GDPR, ensure that knowledge graphs contribute to responsible and ethical decision-making. If any regulations apply to your data, make sure the deployment of your knowledge graph meets the minimum requirements to protect that data. This will likely require you to protect your knowledge graph from outside users, and may also require you to restrict access to certain information even from members of your own organization to ensure that only those who should be able to view certain sensitive data can do so.
Recommendation To determine what security standards your knowledge graph deployment must meet, check whether any of the data provide information about patients or other protected populations. To ensure the data is being used appropriately, define queries with known answers to check that the data has been imported into the knowledge graph correctly, and that the connections between the data do not misrepresent it.
The rise of knowledge graphs in life sciences is ushering in a new era of accelerated discoveries and enhanced patient care. Knowledge graphs are also empowering the current boom in the use of large language models (LLMs). Knowledge graphs provide a very useful source for LLMs to draw on, especially in the case of retrieval-augmented generation (RAG) which provides transparency into the sources the LLM used in generating an answer. With proper data modeling and LLM model tuning and prompting, an LLM can even write the queries needed to help answer users' questions. By harnessing the power of knowledge graphs, researchers can navigate complex biomedical landscapes more efficiently, identify viable drug targets, and make informed clinical decisions.