From Dataset to Forest Plot

Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies

Massimiliano Pronesti^1,2, Joao Bettencourt-Silva¹, Paul Flanagan², Alessandra Pascale¹, Oisín Redmond², Anya Belz², Yufang Hou^1,3

¹IBM Research Europe - Ireland, ²Dublin City University
³IT:U University Transformation Austria

Paper arXiv

Abstract

Extracting scientific evidence from biomedical studies for clinical research questions (e.g., Does stem cell transplantation improve quality of life in patients with medically refractory Crohn's disease compared to placebo?) is a crucial step in synthesising biomedical evidence.

In this paper, we focus on the task of document-level scientific evidence extraction for clinical questions with conflicting evidence. To support this task, we create a dataset called CochraneForest, leveraging forest plots from Cochrane systematic reviews. It comprises 202 annotated forest plots, associated clinical research questions, full texts of studies, and study-specific conclusions. Building on CochraneForest, we propose URCA (Uniform Retrieval Clustered Augmentation), a retrieval-augmented generation framework designed to tackle the unique challenges of evidence extraction. Our experiments show that URCA outperforms the best existing methods by up to 10.3% in F1 score on this task. However, the results also underscore the complexity of CochraneForest, establishing it as a challenging testbed for advancing automated evidence synthesis systems.

The CochraneForest Dataset

The CochraneForest dataset comprises 202 forest plots extracted from 48 Cochrane systematic reviews, covering 263 unique studies and 923 research question-study pairs. Each plot is annotated with a clinical research question, study-level conclusions, and full-text papers associated with each study. To build the dataset, reviews were filtered to include only those with fully accessible study texts and at least two studies with conflicting conclusions. Research questions were generated and refined using LLMs, and study conclusions were labeled based on forest plot confidence intervals. Three annotation tasks ensured consistency across research questions, intervention labels, and conclusions. Inter-annotator agreement showed high semantic consistency, establishing CochraneForest as a reliable and challenging benchmark for evidence synthesis.

Method

Our method is divided into three phases:

Uniform Retrieval
Given a clinical question and the associated set of study papers, we retrieve a fixed number of passages from each source to ensure balanced evidence coverage. This prevents over-representation of longer studies and improves the diversity of retrieved content.
Clustering and Knowledge Extraction
Retrieved passages are embedded, reduced in dimensionality with UMAP, and grouped using Gaussian Mixture Models. For each cluster, a large language model (LLM) extracts query-relevant evidence, discarding unrelated content and highlighting meaningful insights.
Answer Generation
The extracted evidence passages are fed to the LLM alongside the clinical question to generate the final answer. This stage synthesizes semantically aligned, cluster-aware information into a concise conclusion for each study.

Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning

Massimiliano Pronesti^1,2, Michela Lorandi¹, Paul Flanagan², Oisín Redmond², Anya Belz², Yufang Hou²

¹IBM Research Europe - Ireland, ²Dublin City University
³IT:U University Transformation Austria

Paper arXiv

Abstract

Systematic reviews in medicine play a critical role in evidence-based decision-making by aggregating findings from multiple studies. A central bottleneck in automating this process is extracting numeric evidence and determining study-level conclusions for specific outcomes and comparisons. Prior work has framed this problem as a textual inference task by retrieving relevant content fragments and inferring conclusions from them. However, such approaches often rely on shallow textual cues and fail to capture the underlying numeric reasoning behind expert assessments.

In this work, we conceptualise the problem as one of quantitative reasoning. Rather than inferring conclusions from surface text, we extract structured numerical evidence (e.g., event counts or standard deviations) and apply domain knowledge informed logic to derive outcome-specific conclusions. We develop a numeric reasoning system composed of a numeric data extraction model and an effect estimate component, enabling more accurate and interpretable inference aligned with the domain expert principles. We train the numeric data extraction model using different strategies, including supervised fine-tuning (SFT) and reinforcement learning (RL) with a new value reward model.

When evaluated on the CochraneForest benchmark, our best-performing approach — using RL to train a small-scale number extraction model — yields up to a 21% absolute improvement in F1 score over retrieval-based systems and outperforms general-purpose LLMs of over 400B parameters by up to 9%. Our results demonstrate the promise of reasoning-driven approaches for automating systematic evidence synthesis.

Method

Our method is divided into two stages:

Numeric Evidence Extraction: A compact language model is trained to extract structured numerical data (e.g., means, standard deviations, event counts) from full-text clinical trial papers. We explore three training strategies:
- Supervised Fine-Tuning (SFT)
- SFT with reasoning traces augmentation
- Reinforcement Learning (RL) with Group Relative Policy Optimization and custom rule-based rewards
Effect Estimation and Inference: Using domain-specific statistical formulas, we compute effect sizes (mean differences or risk ratios) and corresponding 95% confidence intervals. Study-level conclusions are then derived by checking whether the confidence interval supports a statistically significant effect.

This structured approach allows us to directly generate entries for forest plots, enabling more transparent and automatable evidence synthesis.

Key Results

The RL-trained numeric reasoning model achieves substantial improvements:

Up to 21 F1 points higher than retrieval-based baselines on CochraneForest.
Outperforms GPT-4 and other large models by up to 9 F1 points, despite being significantly smaller (7B vs 400B+).
Produces better-structured, traceable, and factually grounded reasoning with fewer hallucinations.
Develops numerical reasoning capabilities beyond the mere extraction.

Model performance vs retrieval precision

AutoForest: Automatically Generating Forest Plots from Biomedical Studies

Massimiliano Pronesti^1,2, Paul Flanagan², Oisín Redmond², Joao Bettencourt-Silva¹, Gurdeep S. Mannu³, Anya Belz², Yufang Hou^1,4

¹IBM Research Europe - Ireland, ²Dublin City University, ³University of Oxford
⁴IT:U University Transformation Austria

Paper - Coming soon arXiv - Coming soon

Abstract

Systematic reviews rely on forest plots to synthesise quantitative evidence across biomedical studies, but generating them remains a fragmented and labour-intensive process. Researchers must interpret complex clinical texts, manually extract outcome data from trials, define appropriate interventions and comparators, harmonise inconsistent study designs, and carry out meta-analytic computations—typically using specialised software that demands structured inputs and domain expertise. While recent work has demonstrated that large language models can extract study-level data from unstructured text, no existing system automates the complete pipeline from raw documents to synthesised forest plots.

To address this gap, we introduce AutoForest, the first end-to-end system that generates publication-ready forest plots directly from biomedical papers. Given one or more study papers, AutoForest automatically suggests ICO (Intervention, Comparator, Outcome) elements, extracts outcome data, performs statistical synthesis, and renders the final forest plot—all with minimal human input. We describe the system architecture, user interface and demonstrate its effectiveness on real-world examples through a user study, showing how AutoForest accelerates evidence synthesis and substantially lowers the barrier to conducting meta-analyses.

Want to see a live demo and you're at ACL? Visit the IBM booth on Monday from 10:30 am to 11:30 am and from 6:30 pm to 7:30 pm!

BibTeX

@inproceedings{pronesti-etal-2025-query,
    title = "Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies",
    author = "Pronesti, Massimiliano  and
      Bettencourt-Silva, Joao H  and
      Flanagan, Paul  and
      Pascale, Alessandra  and
      Redmond, Ois{\'i}n  and
      Belz, Anya  and
      Hou, Yufang",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.acl-long.1359/",
    pages = "28034--28051",
    ISBN = "979-8-89176-251-0",
    abstract = "Extracting scientific evidence from biomedical studies for clinical research questions (e.g., Does stem cell transplantation improve quality of life in patients with medically refractory Crohn{'}s disease compared to placebo?) is a crucial step in synthesising biomedical evidence. In this paper, we focus on the task of document-level scientific evidence extraction for clinical questions with conflicting evidence. To support this task, we create a dataset called CochraneForest leveraging forest plots from Cochrane systematic reviews. It comprises 202 annotated forest plots, associated clinical research questions, full texts of studies, and study-specific conclusions. Building on CochraneForest, we propose URCA (Uniform Retrieval Clustered Augmentation), a retrieval-augmented generation framework designed to tackle the unique challenges of evidence extraction. Our experiments show that URCA outperforms the best existing methods by up to 10.3{\%} in F1 score on this task. However, the results also underscore the complexity of CochraneForest, establishing it as a challenging testbed for advancing automated evidence synthesis systems."
}

@article{pronesti2025enhancing,
  title={Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning},
  author={Pronesti, Massimiliano and Lorandi, Michela and Flanagan, Paul and Redmond, Ois\'in and Belz, Anya and Hou, Yufang},
  journal={arXiv preprint arXiv:2505.22928},
  year={2025},
  url={https://arxiv.org/abs/2505.22928}
}

@unpublished{pronesti2025autoforest,
  title={AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis},
  author={Pronesti, Massimiliano and Flanagan, Paul and Redmond, Ois\'in and Bettencourt-Silva, Joao and Mannu, Gurdeep S. and Belz, Anya and Hou, Yufang},
  note={Coming soon},
  year={2025}
}

Automating Biomedical Evidence Synthesis: From Retrieval to Reasoning

Query-driven Document-level Scientific Evidence Extraction from Biomedical Studies

Abstract

The CochraneForest Dataset

Method

Enhancing Study-Level Inference from Clinical Trial Papers via RL-based Numeric Reasoning

Abstract

Method

Key Results

AutoForest: Automatically Generating Forest Plots from Biomedical Studies

Abstract

BibTeX