mTREE: Multi-level Text-guided Representation End-to-end Learning for Whole Slide Image Analysis

Quan  Liu; Ruining  Deng; Can  Cui; Tianyuan  Yao; Yuechen  Yang; Vishwesh  Nath; Bingshan  Li; You  Chen; Yucheng  Tang; Yuankai  Huo

doi:10.2352/EI.2025.37.12.HPCI-183

Abstract

Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging, particularly with large, high-resolution images like gigapixel Whole Slide Images (WSIs). Current methods typically rely on manual region labeling or multi-stage learning to assemble local representations (e.g., patch-level) into global features (e.g., slide-level). However, there is no effective way to integrate multi-scale image representations with text data in a seamless end-to-end process. In this study, we introduce Multi-Level Text-Guided Representation End-to-End Learning (mTREE). This novel text-guided approach effectively captures multi-scale WSI representations by utilizing information from accompanying textual pathology information. mTREE innovatively combines – the localization of key areas (“global-tolocal”) and the development of a WSI-level image-text representation (“local-to-global”) – into a unified, end-to-end learning framework. In this model, textual information serves a dual purpose: firstly, functioning as an attention map to accurately identify key areas, and secondly, acting as a conduit for integrating textual features into the comprehensive representation of the image. Our study demonstrates the effectiveness of mTREE through quantitative analyses in two image-related tasks: classification and survival prediction, showcasing its remarkable superiority over baselines. Code and trained models are made available at https://github.com/hrlblab/mTREE.

Electronic Imaging

2470-1173

Society for Imaging Science and Technology

IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA

10.2352/EI.2025.37.12.HPCI-183

HPCI-183

Proceedings Paper

mTREE: Multi-level Text-guided Representation End-to-end Learning for Whole Slide Image Analysis

LiuQuan

Vanderbilt University, US

DengRuining

Vanderbilt University, US

CuiCan

Vanderbilt University, US

YaoTianyuan

Vanderbilt University, US

YangYuechen

Vanderbilt University, US

NathVishwesh

NVIDIA

LiBingshan

Vanderbilt University, US

ChenYou

Vanderbilt University, US

TangYucheng

NVIDIA

HuoYuankai

Vanderbilt University, US

Abstract

222025

HPCI

High Performance Computing for Imaging 2025

183-1

183-7

2025

Visual language modelRepresentation leaningPrognosis analysisPathology

articleview.keywords