InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Aug 25, 2025·,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,·
1 min read
Weiyun Wang*
Zhangwei Gao*
Lixin Gu*
Hengjun Pu*
Long Cui*
Xingguang Wei*
Zhaoyang Liu*
Linglin Jing*
Shenglong Ye*
Jie Shao*
Zhaokai Wang*
Zhe Chen*
Hongjie Zhang
Ganlin Yang
Haomin Wang
Qi Wei
Jinhui Yin
Wenhao Li
Erfei Cui
Guanzhou Chen
Zichen Ding
Changyao Tian
Zhenyu Wu
Jingjing Xie
Zehao Li
Bowen Yang
Yuchen Duan
Xuehui Wang
Zhi Hou
Haoran Hao
Tianyi Zhang
Songze Li
Xiangyu Zhao
Haodong Duan
Nianchen Deng
Bin Fu
Yinan He
Yi Wang
Conghui He
Botian Shi
Junjun He
Yingtong Xiong
Han Lv
Lijun Wu
Wenqi Shao
Kaipeng Zhang
Huipeng Deng
Biqing Qi
Jiaye Ge
Qipeng Guo
Wenwei Zhang
Songyang Zhang
Maosong Cao
Junyao Lin
Kexian Tang
Jianfei Gao
Haian Huang
Yuzhe Gu
Chengqi Lyu
Huanze Tang
Rui Wang
Haijun Lv
Wanli Ouyang
Limin Wang
Min Dou
Xizhou Zhu
Tong Lu
Dahua Lin
Jifeng Dai
Weijie Su
Bowen Zhou
Kai Chen
Yu Qiao
Wenhai Wang
Gen Luo
Abstract
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process, offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively enable InternVL3.5 to achieve up to a +16.0% gain in overall reasoning performance and a 4.05× inference speedup compared to its predecessor, i.e., InternVL3. In addition, InternVL3.5 supports novel capabilities such as GUI interaction and embodied agency. Notably, our largest model, i.e., InternVL3.5-241B-A28B, attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks – narrowing the performance gap with leading commercial models like GPT-5. All models and code are publicly released.
Type
Citation
If you find this project useful in your research, please consider cite:
@article{wang2025internvl3,
title={Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency},
author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
journal={arXiv preprint arXiv:2508.18265},
year={2025}
}