Biomedical science is a broad and an interdisciplinary field which encompasses any application of biology, medicine, and other related sciences to understand, diagnose, treat, and prevent diseases. Biomedical science relies on the availability and quality of biological data, which is generated and published at an unprecedented rate and scale. However, managing and integrating this data is a challenging and time-consuming task that requires expert knowledge and skills.
Biocurators play a vital role in this field by extracting knowledge from unstructured biological data (most commonly, published literature) structured data (such as existing databases with structured information about genes, proteins, pathways, and other biological entities) or primary data into a structured or organized and computable form. The goal is to make the data Findable, Accessible, Interoperable, and Reproducible (FAIR) — the guiding principles for researchers. In this blog, we will review the benefits of biocuration, and how our BioMedInsights offering, powered by Generative AI (GenAI), assists researchers make strides in their incredibly important and impactful work for better healthcare outcomes.
The Benefits of Biocuration
Biocuration helps to create reliable, standard, and comprehensive resources to empower researchers in many ways:
- Faster and more efficient literature review: Enables researchers to quickly find relevant information without wasting time sifting through vast amounts of literature
- Enhanced research quality: Facilitates design and execution of robust research studies by virtue of allowing easy access to accurate and standardized data, helps in hypothesis generation, experimental design, drug discovery, etc.
- Improved collaboration and knowledge quality: Provides a common ground for researchers to share and access knowledge
- Knowledge discovery and data integration: Allows for standardized data organization with improved data quality
The traditional biocuration method involves manually reading and extracting relevant information from scientific articles, annotating the data with controlled vocabulary or ontologies, and storing them in a structured format. This process is prone to human errors, inconsistencies, and delays, and cannot keep up with the exponential growth of biomedical insights from literature. Moreover, the data curated by different databases may not be easily integrated or shared due to the lack of common standards and formats. This creates issues related to scalability, consistency, and efficiency.
GenAI offers a promising approach to transform biocuration processes, address the limitations of the traditional counterparts in automated literature screening, enhance information extraction, and improve efficiency and scalability. Additionally, collaboration with human experts, such as biocurators, researchers, and clinicians, can lead to annotation generation that is interactive, explainable, and verifiable. One of the tools that implements the GenAI approach for biocuration is Langchain, which is a framework that integrates Natural Language Processing (NLP), machine learning (ML) and other technologies to create a decentralized and collaborative platform for literature curation.
The BioMedInsights Solution
Leveraging the power of GenAI, we have developed an in-house solution, BioMedInsights, which helps in deriving actionable insights from a large volume of textual biomedical data analysis from any source. In pre-clinical and clinical research, there is a dependency on biomedical literature for experimental trial design and validation. Moreover, when conventional methods are not applicable for a patient’s profile, clinicians refer biomedical and clinical literature (publicly or/and in-house) and population studies to explore alternative therapeutic options.
We developed BioMedInsights utilizing Python, Chainlit, Azure/OpenAI, and GPT-3.5-Turbo-instruct. BioMedInsights’ workflow takes keywords-based queries from users and fetches associated abstracts from PubMed (Figure 1). For a quick overview of the different themes obtained in the results, it performs topic modeling on the abstracts to cluster them based on recurring words similarity, and in the creation of topics based on the representative words. Abstracts are assigned a probability score to represent their association to a topic and user can select topics of interest based on the representative keywords and scores. The solution’s GenAI-powered virtual assistant, which is Retrieval-Augmented Generation (RAG)-based, facilitates easy interactions with the data to derive faster insights supported with appropriate references.
BioMedInsights in Action
In one representative example, we demonstrate evidence extraction using BioMedInsights in the context of identifying suitable drugs for a patient with lung cancer, with an absence of known actionable biomarkers.
We looked at PubMed articles associated to the biomarker gene EGFR in lung cancer and retrieved more than 9,000 abstracts, which were pre-processed and used for building the topic model. After optimizing the model for performance, we grouped the articles into 147 thematic clusters or topics. Associated visualizations of the model assist in getting a topic overview and in selecting topics of interest for further exploration.
For this example, we looked at topics related to EGFR negative, chemotherapy drugs, and drug resistance to understand the therapeutic options available for the patient. The selected topics have approx. 80 articles whose full text was used as the backend data for our RAG-based virtual assistant. Our solution provides quick data summary, succinct answers to a user’s query (Figure 2), supported with references and minimal hallucinations. This solution can further be used to expedite the medical writing process for document summarization.
The solution can be enriched by using more varied knowledge sources, like clinical trial databases, systematic reviews, or expert annotations that can help users make the results more comprehensive and reliable. Additionally, handling complex and longer documents such as different sections, tables, figures, and citations will add value to the process of document summarization for actionable insights. BioMedInsights transforms deriving actionable outcomes from biomedical and clinical literature mining and can have a significant impact in text mining in healthcare, report generation, and medical document drafting. To learn more about BioMedInsights, and our healthcare and life sciences offerings, reach out to us.