Scientific papers are highly heterogeneous, often employing diverse terminologies for the same entities, using varied methodologies to study biological phenomena, and presenting findings within distinct contexts. Extracting meaningful insights from these papers requires a profound understanding of biology, a critical evaluation of methodologies, and the ability to discern robust findings from irrelevant or less reliable ones.
Scientists must carefully interpret the context, assess the reliability of experimental evidence, and identify potential biases or limitations in studies. Given the high-precision demands to support critical decision-making in disease modeling, it is imperative that the biological findings incorporate only high-quality knowledge.
Large language models (LLM), when integrated into a retrieval-augmented generation (RAG) pipeline, present a game-changing opportunity to automate and expedite the curation of biological findings. By optimizing the extraction of insights from scientific papers, LLMs dramatically improve the scalability of this process. These language models can sift through far more papers than any individual could manually review and uncover a significantly larger volume of relevant findings.
The team at CytoReason, a member of the NVIDIA Inception program, develops computational disease models, harnessing AI to mine vast amounts of molecular and textual data to support biopharma’s decision-making. By capturing mechanisms of action (MOAs), gene regulation, patient responses, and more, these models can simulate human diseases at the tissue, cellular, and gene levels.
This enables researchers to predict disease progression, evaluate treatment responses, prioritize biological targets, and identify relevant patient subpopulations. One of the analyses in the CytoReason computational disease models is based on biological findings in literature. Mining the growing number of scientific papers manually requires a complex understanding of biology and considerable time.
This post introduces CytoReason’s method for expediting the curation process of biological insights from literature.
RAG pipeline powered by NVIDIA NIM
The CytoReason team developed a RAG pipeline powered by NVIDIA NIM microservices to scale up the mining of biological findings integrated in CytoReason’s computational disease models. Figure 1 illustrates the flow.

The output of the pipeline is a list of biological evidence extracted from literature. This evidence is aggregated across entity types and conditions, providing a comprehensive summary that offers valuable insights into disease biology. An example of an output supporting the increased expression of the IL6 gene in ileal Crohn’s disease patients is shown in Figure 2.
NVIDIA reasoning LLM NIM microservices, such as Mistral 12B Instruct, provide remarkable ease of use, enabling seamless integration into this pipeline. By using the NIM, the team achieved high throughput, reducing the time immunologists spend constructing such a list from days to just hours, while also achieving higher coverage.
Structured input
The RAG pipeline begins with a structured input designed to meet the needs of the team’s biologists. This input is defined by four key parameters: entity type (gene, pathway, or cell type, for example), disease, tissue, and conditions. For example, an input might involve retrieving literature evidence to support changes in gene expression linked to Crohn’s disease in ileum tissue, comparing healthy versus inflamed conditions.
Retrieval engine
The retrieval module is responsible for querying databases, such as Google Scholar, PubMed, or other scientific repositories, to fetch relevant papers based on the input. To increase the potential for diverse findings, the retrieval engine processes dozens of queries compiled from the same input. The scientific papers retrieved from these queries are then consolidated into a single unified set. Each paper is stored with detailed metadata, including the title, authors, publication date, abstract, Google snippet, journal or source, and DOI/URL.
Biological guardrails
After the retrieval component compiles a repository of papers and associated metadata, a guardrails process using the Mistral 12B instruct NIM is applied to refine the collection into a highly specific and relevant set of papers. This step is guided by a prompt consisting of the following three criteria:
- Human sample-based studies: Excluding papers that are based only on nonhuman samples, such as animal models or in vitro studies.
- Relevance to disease and tissue: Ensuring the papers focus on the specific disease and tissue of interest. For example, a single paper might include data on multiple IBD conditions affecting different locations in the bowel. This step ensures that the specific condition and tissue (Crohn’s disease in the ileum, for example) fall within the paper’s scope.
- Presence of comparison conditions: Comparative studies are essential for deriving meaningful insights, such as identifying differential gene expression or discovering biomarkers. Papers lacking clear information on comparison conditions, such as ‘diseased versus healthy’ or ‘treated versus untreated,’ are excluded, as they are less likely to align with the analytical objectives.
In addition, the prompt consists of elements such as instructions, few-shot examples, guided steps for the solution (chain of thought), questions, and requirements for high-confidence results.
Biological proof extraction
During this stage, the scientific content of each remaining paper is processed, chunk by chunk. For each chunk, an NVIDIA LLM NIM is employed to extract evidence about the entities of interest in relation to disease, tissue, and conditions. The prompt provided to the LLM is carefully designed, similarly to the paradigm in the biological guardrails stage.
The extracted information is organized in a structured format (for example, JSON), facilitating efficient downstream processing and analysis. Finally, the output includes proofs with links to the paper as depicted in Figure 2. Genes are classified based on the change in their expression (increased, decreased, unchanged, or unknown) across two conditions (for example, disease versus healthy). Evidence from the literature supporting the increased expression of the IL6 gene in ileal Crohn’s disease patients is presented.

Results
The team evaluated the RAG pipeline using a benchmark focused on gene expression in Crohn’s disease in the ileum. In this case, in a manual curation process that took days by the immunologist, a total of 101 genes were identified as differentially expressed (either upregulated or downregulated) between healthy and inflamed conditions.
The RAG pipeline extracted information about 99 genes in a matter of minutes, 70 of which overlapped with those identified through manual curation. The remaining 29 genes were new discoveries and were subsequently validated for accuracy by an expert. The evidence produced by the pipeline for all genes was accurate in 96% of the cases.
Notably, the pipeline successfully identified 13 out of 14 hallmark genes with a substantial number of evidence sentences for each. This highlights its ability to extract critical information with high accuracy, as hallmark genes are strongly associated with a particular disease and frequently discussed in scientific literature.
Summary
Mining biological insights from literature is a complex task that traditionally takes days and requires deep expertise in biology. By leveraging NVIDIA NIM and LLM technology, CytoReason has significantly reduced the time required for this process—from days to just a few hours. These results demonstrate that the precision of these insights is remarkably high, with even greater coverage of biological entities compared to those identified by human scientists.
To get started with NVIDIA NIM, visit NVIDIA NIM for Developers.
Acknowledgments
We would like to thank NVIDIA for their professional, patient, and welcoming support throughout this project. We are also grateful to our colleagues at CytoReason who contributed their time and expertise. A special thanks to Greg Minevich, Shimon Sheiba, Inbal Beracha, Dan Aizik, Jonatan Enk, Elina Starosvetsky, Zeev Benshachar, Yoav Schumacher, and Ronen Schuster for their critical role in designing, implementing, and reviewing the technology discussed in this post. Their insights and feedback were invaluable in shaping both the development process and the content.