Lack of suitable datasets for unified modeling
Interleaved data flow
We interleave image and text data by formulating sequential reasoning-enriched generation tasks before the final understanding.
Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on a multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.
Lack of suitable datasets for unified modeling
We interleave image and text data by formulating sequential reasoning-enriched generation tasks before the final understanding.
Gap between medical understanding and image generation
Dense ViT-guided reconstruction reduces domain gap and enriches understanding-enhanced generation in a self-supervised manner.
Exposure bias in long-context medical reasoning
Training-time KV-cache conditioning improves robustness to generated visual context.
| Methods | T1w only | T1w + T2w | T1w+T2w+T2f | Complete Data | ||||
|---|---|---|---|---|---|---|---|---|
| Top-1 | ROUGE | Top-1 | ROUGE | Top-1 | ROUGE | Top-1 | ROUGE | |
| SimMLM (Implicit) | 65.98 | - | 74.47 | - | 76.60 | - | 78.72 | - |
| M2DN + UniBrain (Explicit) | 56.03 | 33.93 | 75.18 | 36.38 | 76.60 | 38.36 | - | - |
| UniMedVL (MLLM) | 29.79 | 13.90 | 30.50 | 14.73 | 32.62 | 13.81 | 38.30 | 13.70 |
| Lingshu (MLLM) | 21.99 | 18.08 | 24.82 | 19.49 | 29.79 | 18.13 | 41.13 | 20.26 |
| UniBrain Und. | 69.50 | 37.35 | 73.05 | 35.94 | 76.60 | 38.05 | 82.06 | 38.94 |
| UniBrain (Ours) | 74.47 | 36.93 | 76.60 | 38.23 | 78.01 | 38.68 | 82.06 | 38.94 |
While dedicated explicit modality imputers prioritize low-level pixel similarity (PSNR/SSIM), they suffer from disjointed feature spaces. UniBrain generates clinically usable outputs that drastically boost downstream Top-1 accuracy.
| Task | MM-GAN | ResViT | M2DN | UniMedVL | UniBrain (Gen.) | UniBrain | Target |
|---|---|---|---|---|---|---|---|
| T1w → T2w | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| T1w, T2w → T2f | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| T1w, T2w, T2f → T1c | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
| Methods | T1w → T2w | T1w, T2w → T2f | T1w, T2w, T2f → T1c | |||
|---|---|---|---|---|---|---|
| PSNR | Top-1 | PSNR | Top-1 | PSNR | Top-1 | |
| MM-GAN (GAN) | 23.08 | 56.74 | 23.32 | 56.03 | 23.40 | 60.19 |
| ResViT (Transformer) | 22.81 | 57.45 | 23.13 | 67.38 | 23.00 | 61.70 |
| M2DN (Diffusion) | 22.79 | 51.06 | 22.46 | 51.06 | 22.05 | 61.70 |
| UniMedVL (UMM) | 19.82 | 56.03 | 19.96 | 63.12 | 21.53 | 56.74 |
| UniBrain | 22.23 | 68.09 | 22.58 | 67.38 | 22.26 | 74.47 |
| UniBrain (Ensemble) | 23.43 | 63.83 | 23.49 | 68.08 | 23.52 | 76.60 |
We evaluated the contributions of each core component in UniBrain: Unified modeling with interleaved data, SA (Self-Alignment) for fine-grained representation, and DHS (Dynamic Hidden States) for robust autoregressive generation. Starting from a vanilla baseline (Model A), unified modeling greatly improves diagnosis performance. Adding SA benefits generation quality, while the final DHS mechanism achieves the optimal overall balance for both generation and understanding tasks.
| Model | Components | Understanding (T1w only) | Generation (T1w → ... → T1c) | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Unified | SA | DHS | Acc-1 | ROUGE | RaTEScore | PSNR | SSIM | Top-1 | |
| Model A (Baseline) | 70.05 | 35.71 | 60.12 | - | - | - | |||
| Model B | 75.11 | 36.35 | 60.52 | 21.28 | 0.8329 | 70.12 | |||
| Model C | 73.76 | 35.34 | 59.23 | 22.09 | 0.8456 | 74.03 | |||
| UniBrain (Ours) | 74.47 | 36.93 | 61.57 | 22.47 | 0.8519 | 76.60 | |||
Main limitation: current framework only support 2D modeling, resulting in a visible flickering effect for generation task and biased textual description for understanding task.
Example of 2D slice-by-slice generation flickering.
Ground Truth 3D Volume for reference.
In addition, external evaluations including radiologist assessments, generalization to other datasets, and extended data modalities, are valuable future directions.
@inproceedings{song2026unibrain,
title={Unified Multimodal Model for Brain MRI Imputation and Understanding},
author={Song, Zhiyun and Liu, Che and Xia, Tian and Kori, Avinash and Bai, Wenjia},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)},
year={2026}
}