InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Apr 14, 2025·

Jinguo Zhu*

Weiyun Wang*

Zhe Chen*

Zhaoyang Liu*

Shenglong Ye*

Lixin Gu*

Tian Hao*

Yuchen Duan*

Weijie Su

Jie Shao

Zhangwei Gao

Erfei Cui

Xuehui Wang

Yue Cao

Yangzhou Liu

Xingguang Wei

Hongjie Zhang

Haomin Wang

Weiye Xu

Hao Li

Jiahao Wang

Nianchen Deng

Songze Li

Yinan He

Tan Jiang

Jiapeng Luo

Yi Wang

Conghui He

Botian Shi

Xingcheng Zhang

Wenqi Shao

Junjun He

Yingtong Xiong

Wenwen Qu

Peng Sun

Penglong Jiao

Han Lv

Lijun Wu

Kaipeng Zhang

Huipeng Deng

Jiaye Ge

Kai Chen

Limin Wang

Min Dou

Lewei Lu

Xizhou Zhu

Tong Lu

Dahua Lin

Yu Qiao

Jifeng Dai

Wenhai Wang

· 1 min read

PDF Cite Code Models

InternVL3

Abstract

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

Type

Preprint

Citation

If you find this project useful in your research, please consider cite:

@article{zhu2025internvl3,
  title={InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models},
  author={Zhu, Jinguo and Wang, Weiyun and Chen, Zhe and Liu, Zhaoyang and Ye, Shenglong and Gu, Lixin and Duan, Yuchen and Tian, Hao and Su, Weijie and Shao, Jie and others},
  journal={arXiv preprint arXiv:2504.10479},
  year={2025}
}

Last updated on Apr 14, 2025

Multimodal Large Language Models InternVL Series

Authors

Haomin Wang

Research Intern

← Point or Line? Using Line-based Representation for Panoptic Symbol Spotting in CAD Drawings May 29, 2025

ArchCAD-400K: An Open Large-Scale Architectural CAD Dataset and New Baseline for Panoptic Symbol Spotting Mar 28, 2025 →