InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Oct 13, 2025·

Haomin Wang*

Jinhui Yin*

Qi Wei*

Wenguang Zeng

Lixin Gu

Shenglong Ye

Zhangwei Gao

Yaohui Wang

Yanting Zhang

Yuanqi Li

Yanwen Guo

Wenhai Wang

Kai Chen

Yu Qiao

Hongjie Zhang

· 1 min read

PDF Cite Code Homepage

Abstract

General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.

Type

Preprint

Citation

If you find this project useful in your research, please consider cite:

@article{wang2025internvl3,
  title={Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency},
  author={Wang, Weiyun and Gao, Zhangwei and Gu, Lixin and Pu, Hengjun and Cui, Long and Wei, Xingguang and Liu, Zhaoyang and Jing, Linglin and Ye, Shenglong and Shao, Jie and others},
  journal={arXiv preprint arXiv:2508.18265},
  year={2025}
}

Last updated on Oct 13, 2025

Multimodal Large Language Models InternVL Series

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency Aug 25, 2025 →