Unified Multimodal Model for Brain MRI Imputation and Understanding

Imperial College London
Early accepted to MICCAI 2026

UniBrain integrates missing modality imputation and clinical diagnosis within a single autoregressive process.

Abstract

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on a multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

Methodology

A
Problem

Lack of suitable datasets for unified modeling

Solution

Interleaved data flow

We interleave image and text data by formulating sequential reasoning-enriched generation tasks before the final understanding.

B
Problem

Gap between medical understanding and image generation

Solution

Self-alignment refinement

Dense ViT-guided reconstruction reduces domain gap and enriches understanding-enhanced generation in a self-supervised manner.

C
Problem

Exposure bias in long-context medical reasoning

Solution

Dynamic hidden states

Training-time KV-cache conditioning improves robustness to generated visual context.

Experimental Results

MRI Diagnosis and Report Generation

MRI diagnosis and report generation results
Methods T1w only T1w + T2w T1w+T2w+T2f Complete Data
Top-1 ROUGE Top-1 ROUGE Top-1 ROUGE Top-1 ROUGE
SimMLM (Implicit) 65.98- 74.47- 76.60- 78.72-
M2DN + UniBrain (Explicit) 56.0333.93 75.1836.38 76.6038.36 --
UniMedVL (MLLM) 29.7913.90 30.5014.73 32.6213.81 38.3013.70
Lingshu (MLLM) 21.9918.08 24.8219.49 29.7918.13 41.1320.26
UniBrain Und. 69.5037.35 73.0535.94 76.6038.05 82.0638.94
UniBrain (Ours) 74.4736.93 76.6038.23 78.0138.68 82.0638.94

MRI Modality Imputation

While dedicated explicit modality imputers prioritize low-level pixel similarity (PSNR/SSIM), they suffer from disjointed feature spaces. UniBrain generates clinically usable outputs that drastically boost downstream Top-1 accuracy.

Task MM-GAN ResViT M2DN UniMedVL UniBrain (Gen.) UniBrain Target
T1w → T2w MM-GAN ResViT M2DN UniMedVL UniBrain Gen UniBrain Target
T1w, T2w → T2f MM-GAN ResViT M2DN UniMedVL UniBrain Gen UniBrain Target
T1w, T2w, T2f → T1c MM-GAN ResViT M2DN UniMedVL UniBrain Gen UniBrain Target
Methods T1w → T2w T1w, T2w → T2f T1w, T2w, T2f → T1c
PSNR Top-1 PSNR Top-1 PSNR Top-1
MM-GAN (GAN) 23.0856.74 23.3256.03 23.4060.19
ResViT (Transformer) 22.8157.45 23.1367.38 23.0061.70
M2DN (Diffusion) 22.7951.06 22.4651.06 22.0561.70
UniMedVL (UMM) 19.8256.03 19.9663.12 21.5356.74
UniBrain 22.2368.09 22.5867.38 22.2674.47
UniBrain (Ensemble) 23.4363.83 23.4968.08 23.5276.60

Ablation Studies

We evaluated the contributions of each core component in UniBrain: Unified modeling with interleaved data, SA (Self-Alignment) for fine-grained representation, and DHS (Dynamic Hidden States) for robust autoregressive generation. Starting from a vanilla baseline (Model A), unified modeling greatly improves diagnosis performance. Adding SA benefits generation quality, while the final DHS mechanism achieves the optimal overall balance for both generation and understanding tasks.

Model Components Understanding (T1w only) Generation (T1w → ... → T1c)
Unified SA DHS Acc-1 ROUGE RaTEScore PSNR SSIM Top-1
Model A (Baseline) 70.05 35.71 60.12 - - -
Model B 75.11 36.35 60.52 21.28 0.8329 70.12
Model C 73.76 35.34 59.23 22.09 0.8456 74.03
UniBrain (Ours) 74.47 36.93 61.57 22.47 0.8519 76.60

Future Works & Limitations

Main limitation: current framework only support 2D modeling, resulting in a visible flickering effect for generation task and biased textual description for understanding task.

Example of 2D slice-by-slice generation flickering.

Ground Truth 3D Volume for reference.

In addition, external evaluations including radiologist assessments, generalization to other datasets, and extended data modalities, are valuable future directions.

BibTeX

@inproceedings{song2026unibrain,
  title={Unified Multimodal Model for Brain MRI Imputation and Understanding},
  author={Song, Zhiyun and Liu, Che and Xia, Tian and Kori, Avinash and Bai, Wenjia},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)},
  year={2026}
}