InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Haomin Wang1,2*, Jinhui Yin3,2*, Qi Wei3,2*, Wenguang Zeng4, Lixin Gu2, Shenglong Ye2, Zhangwei Gao1,2, Yaohui Wang2, Yanting Zhang4, Yuanqi Li3, Yanwen Guo3, Wenhai Wang5, Kai Chen2, Yu Qiao2, Hongjie Zhang2†
1Shanghai Jiao Tong University, 2Shanghai AI Laboratory, 3Nanjing University, 4Donghua University, 5The Chinese University of Hong Kong
*Indicates Equal Contribution,Indicates Correspondence Authors
Overview of our InternSVG family

Overview of our InternSVG family

Abstract

Vector graphics, represented in Scalable Vector Graphics (SVG) format, serve as a core medium for digital design and web rendering. Existing works on SVG tasks often focus on isolated subtasks such as generation, editing, or understanding. In this paper, we propose InternSVG, a unified framework based on multimodal large language models that jointly addresses SVG-related tasks across perception and creation. By representing SVGs as structured sequences and aligning them with textual descriptions and raster renderings, InternSVG enables a generalizable interface for vector reasoning, generation, and manipulation. Extensive experiments demonstrate its versatility and performance across diverse SVG benchmarks.

SAgoge: A Comprehensive Multimodal SVG Dataset

We introduce SAgoge, a large-scale and comprehensive dataset for SVG tasks with more than 16 million training samples spanning icons, illustrations, chemical structures, and animations.

Dataset Pipeline

Raw SVGs are gathered from the web and a custom synthesis pipeline, then normalized to a 128 × 128 canvas and simplified to shorten code. The rendered images or videos, processed SVG code, and handcrafted prompts are fed to an MLLM to synthesize high-quality training samples for understanding, editing, and generation.

InternSVG: A Unified MLLM for SVG Understanding, Editing, and Generation

method

InternSVG follows the “ViT–MLP–LLM” paradigm , using InternViT-300M as the vision encoder and Qwen2.5-7B as the language model. We further design SVG-specific special tokens and introduce a tailored embedding initialization strategy to incorporate SVG content effectively.

See InternSVG in Action!

demo

demo

demo

Sample SVG 1 Sample SVG 2 Sample SVG 3 Sample SVG 4 Sample SVG 5 Sample SVG 6 Sample SVG 7 Sample SVG 8 Sample SVG 9 Sample SVG 10 Sample SVG 11 Sample SVG 12 Sample SVG 13 Sample SVG 14 Sample SVG 15 Sample SVG 16 Sample SVG 17 Sample SVG 18 Sample SVG 19 Sample SVG 20 Sample SVG 21 Sample SVG 22 Sample SVG 23 Sample SVG 24 Sample SVG 25 Sample SVG 26 Sample SVG 27 Sample SVG 28 Sample SVG 29 Sample SVG 30

SArena: A Companion Benchmark

To enable systematic evaluation across SVG understanding, editing, and generation, we introduce SArena, a benchmark that aligns with the domains and difficulty spectrum covered by SAgoge and provides standardized tasks and metrics.SArena includes 4 sub-benchmarks, i.e., icons, illustrations, chemical structures, and animation.

Generation Performance

Model Text-to-SVG Image-to-SVG
FID ↓ FID-C ↓ CLIP-T2I ↑ CLIP-I2I ↑ Tokens DINO ↑ SSIM ↑ LPIPS ↓ PSNR ↑ Tokens
Llama-3.1-8B19.42811.24721.86371.859280----------
Qwen2.5-VL-7B24.78115.45421.53871.3842490.7810.5060.3786.534281
Keye-VL-8B21.96114.39321.55771.1672270.8010.5310.3686.939286
GLM-4.1V-9B22.68410.44722.56273.1972690.8200.5390.3457.329289
InternVL3-8B23.06114.30321.89771.4502690.8120.5570.3617.220256
Llama-3.2-11B28.15614.34521.71171.4852610.7590.4670.3895.908216
Gemma-3-12B17.13710.40922.02371.6222900.8210.5760.3527.632360
InternVL3-14B18.99613.22422.06671.4932270.8250.5620.3597.343216
Kimi-VL-A3B30.80716.99621.43970.5362280.7980.5620.3627.179245
Gemma-3-27B15.1459.30322.52673.2772490.8260.5950.3547.833267
Qwen2.5-VL-32B20.04310.39322.78373.2283170.8360.5620.3577.503309
InternVL3-38B18.01411.04222.79573.0772510.8290.5490.3517.305230
Grok-321.9678.69424.12276.797346----------
Llama-3.1-70B18.0328.30022.74773.876255----------
Llama-3.1-405B16.7948.39022.82273.920236----------
DeepSeek-V324.9908.80323.79076.470251----------
GPT-4o15.1786.76324.61777.7422460.8740.6160.3168.435231
Gemini-2.5-Flash16.7205.20824.65878.2184510.8760.5870.3168.324533
Claude-Sonnet-3.714.3833.49925.29480.7864170.9090.6470.2909.259389
Claude-Sonnet-415.8404.29125.42180.5794440.9150.6650.2769.855541
Llama-3.2-90B19.3098.55022.84174.0062490.7570.4370.3775.777192
Llama-4-Scout17.9089.38222.84973.5632560.8440.5820.3467.736246
Llama-4-Maverick14.9316.52623.57075.8162650.8630.5960.3298.027255
GLM-4.5V16.6415.09324.45078.3493720.8720.6270.3158.666322
Step3-321B20.0619.70623.05374.1843080.8340.5550.3407.516301
Qwen2.5-VL-72B15.9489.87522.94673.6812750.8370.5840.3467.834372
InternVL3-78B17.58010.59622.80573.1232520.8500.5840.3397.802234
Starvector 8B----------0.8710.6230.20613.595951
LLM4SVG 7B21.9398.61119.45870.7267050.7480.4720.4095.375485
OmniSVG 3B28.29211.31821.67974.8311.7k0.8940.7560.18612.6692.4k
InternSVG 8B 8.715 1.876 23.916 80.911 1.0k 0.949 0.811 0.127 18.226 1.3k

SGP-Bench

To further validate the effectiveness of SAgoge in enhancing model capabilities for SVG modeling, we conduct comparative experiments on SGP-Bench, a benchmark specifically designed to evaluate semantic and structural understanding of symbolic graphic programs.

Model Semantics ↑ Count ↑ Color ↑ Shape ↑ Reasoning ↑ Overall ↑
Gemma-1.1-2B32.133.325.035.628.731.7
InternLM2.5-7B27.331.759.851.528.242.1
Keye-VL-8B41.447.571.454.940.652.2
GLM-4.1V-9B41.655.679.161.540.057.1
InternVL3-8B33.746.569.859.136.150.6
Gemma-3-12B24.830.847.225.722.830.5
DeepSeek-Coder-V2-16B30.937.963.754.826.845.1
InternVL3-14B38.252.974.454.141.752.9
Kimi-VL-A3B-250631.141.567.047.432.444.9
Gemma-3-27B36.751.476.362.139.454.7
Qwen2.5-VL-32B40.055.776.361.243.956.5
InternVL3-38B40.858.782.263.643.959.1
GPT-4o45.956.887.375.250.464.8
Gemini-2.5-Flash53.857.888.175.655.567.6
Claude-Sonnet-455.967.689.579.058.971.5
GLM-4.5V47.363.787.372.355.866.1
Qwen2.5-VL-72B40.255.180.162.041.157.1
InternVL3-78B41.059.184.065.247.060.3
Step3-321B-A38B35.954.082.863.238.656.5
InternSVG 8B 54.6 70.7 85.5 82.4 57.5 72.3

Comparison with Baselines

We compare the generated SVGs with those produced by baseline methods to assess visual quality.

SArena-Icon

SArena-Illustration

SArena-Chemistry

SArena-Animation

BibTeX

@article{wang2025internsvg,
    title={InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models},
    author={Wang, Haomin and Yin, Jinhui and Wei, Qi and Zeng, Wenguang and Gu, Lixin and Ye, Shenglong and Gao, Zhangwei
    and Wang, Yaohui and Zhang, Yanting and Li, Yuanqi and others},
    journal={arXiv preprint arXiv:2510.11341},
    year={2025}
}