Unified Multimodal Model for Brain MRI Imputation and Understanding

Song, Zhiyun; Liu, Che; Xia, Tian; Kori, Avinash; Bai, Wenjia

Unified Multimodal Model for Brain MRI Imputation and Understanding

Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

Imperial College London
Early accepted to MICCAI 2026

Paper Code 🤗 Model

UniBrain integrates missing modality imputation and clinical diagnosis within a single autoregressive process.

Abstract

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on a multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

Methodology

Problem

Lack of suitable datasets for unified modeling

Solution

Interleaved data flow

We interleave image and text data by formulating sequential reasoning-enriched generation tasks before the final understanding.

Problem

Gap between medical understanding and image generation

Solution

Self-alignment refinement

Dense ViT-guided reconstruction reduces domain gap and enriches understanding-enhanced generation in a self-supervised manner.

Problem

Exposure bias in long-context medical reasoning

Solution

Dynamic hidden states

Training-time KV-cache conditioning improves robustness to generated visual context.

Experimental Results

MRI Diagnosis and Report Generation

Methods	T1w only		T1w + T2w		T1w+T2w+T2f		Complete Data
Methods	Top-1	ROUGE	Top-1	ROUGE	Top-1	ROUGE	Top-1	ROUGE
SimMLM (Implicit)	65.98	-	74.47	-	76.60	-	78.72	-
M2DN + UniBrain (Explicit)	56.03	33.93	75.18	36.38	76.60	38.36	-	-
UniMedVL (MLLM)	29.79	13.90	30.50	14.73	32.62	13.81	38.30	13.70
Lingshu (MLLM)	21.99	18.08	24.82	19.49	29.79	18.13	41.13	20.26
UniBrain Und.	69.50	37.35	73.05	35.94	76.60	38.05	82.06	38.94
UniBrain (Ours)	74.47	36.93	76.60	38.23	78.01	38.68	82.06	38.94

MRI Modality Imputation

While dedicated explicit modality imputers prioritize low-level pixel similarity (PSNR/SSIM), they suffer from disjointed feature spaces. UniBrain generates clinically usable outputs that drastically boost downstream Top-1 accuracy.

Task	MM-GAN	ResViT	M2DN	UniMedVL	UniBrain (Gen.)	UniBrain	Target
T1w → T2w
T1w, T2w → T2f
T1w, T2w, T2f → T1c

Methods	T1w → T2w		T1w, T2w → T2f		T1w, T2w, T2f → T1c
Methods	PSNR	Top-1	PSNR	Top-1	PSNR	Top-1
MM-GAN (GAN)	23.08	56.74	23.32	56.03	23.40	60.19
ResViT (Transformer)	22.81	57.45	23.13	67.38	23.00	61.70
M2DN (Diffusion)	22.79	51.06	22.46	51.06	22.05	61.70
UniMedVL (UMM)	19.82	56.03	19.96	63.12	21.53	56.74
UniBrain	22.23	68.09	22.58	67.38	22.26	74.47
UniBrain (Ensemble)	23.43	63.83	23.49	68.08	23.52	76.60

Ablation Studies

We evaluated the contributions of each core component in UniBrain: Unified modeling with interleaved data, SA (Self-Alignment) for fine-grained representation, and DHS (Dynamic Hidden States) for robust autoregressive generation. Starting from a vanilla baseline (Model A), unified modeling greatly improves diagnosis performance. Adding SA benefits generation quality, while the final DHS mechanism achieves the optimal overall balance for both generation and understanding tasks.

Model	Components			Understanding (T1w only)			Generation (T1w → ... → T1c)
Model	Unified	SA	DHS	Acc-1	ROUGE	RaTEScore	PSNR	SSIM	Top-1
Model A (Baseline)				70.05	35.71	60.12	-	-	-
Model B				75.11	36.35	60.52	21.28	0.8329	70.12
Model C				73.76	35.34	59.23	22.09	0.8456	74.03
UniBrain (Ours)				74.47	36.93	61.57	22.47	0.8519	76.60

Future Works & Limitations

Main limitation: current framework only support 2D modeling, resulting in a visible flickering effect for generation task and biased textual description for understanding task.

Example of 2D slice-by-slice generation flickering.

Ground Truth 3D Volume for reference.

In addition, external evaluations including radiologist assessments, generalization to other datasets, and extended data modalities, are valuable future directions.

BibTeX

@inproceedings{song2026unibrain,
  title={Unified Multimodal Model for Brain MRI Imputation and Understanding},
  author={Song, Zhiyun and Liu, Che and Xia, Tian and Kori, Avinash and Bai, Wenjia},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)},
  year={2026}
}