InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Wang, Haomin; Yin, Jinhui; Wei, Qi

InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models

Haomin Wang^1,2*, Jinhui Yin^3,2*, Qi Wei^3,2*, Wenguang Zeng⁴, Lixin Gu², Shenglong Ye², Zhangwei Gao^1,2, Yaohui Wang², Yanting Zhang⁴, Yuanqi Li³, Yanwen Guo³, Wenhai Wang⁵, Kai Chen², Yu Qiao², Hongjie Zhang^2†

¹Shanghai Jiao Tong University, ²Shanghai AI Laboratory, ³Nanjing University, ⁴Donghua University, ⁵The Chinese University of Hong Kong
^*Indicates Equal Contribution,^†Indicates Correspondence Authors

Paper arXiv Code 🤗 Benchmark 🤗 Dataset 🤗 Model Weights

Overview of our InternSVG family

Abstract

Vector graphics, represented in Scalable Vector Graphics (SVG) format, serve as a core medium for digital design and web rendering. Existing works on SVG tasks often focus on isolated subtasks such as generation, editing, or understanding. In this paper, we propose InternSVG, a unified framework based on multimodal large language models that jointly addresses SVG-related tasks across perception and creation. By representing SVGs as structured sequences and aligning them with textual descriptions and raster renderings, InternSVG enables a generalizable interface for vector reasoning, generation, and manipulation. Extensive experiments demonstrate its versatility and performance across diverse SVG benchmarks.

SAgoge: A Comprehensive Multimodal SVG Dataset

We introduce SAgoge, a large-scale and comprehensive dataset for SVG tasks with more than 16 million training samples spanning icons, illustrations, chemical structures, and animations.

Dataset Pipeline

Raw SVGs are gathered from the web and a custom synthesis pipeline, then normalized to a 128 × 128 canvas and simplified to shorten code. The rendered images or videos, processed SVG code, and handcrafted prompts are fed to an MLLM to synthesize high-quality training samples for understanding, editing, and generation.

InternSVG: A Unified MLLM for SVG Understanding, Editing, and Generation

method

InternSVG follows the “ViT–MLP–LLM” paradigm , using InternViT-300M as the vision encoder and Qwen2.5-7B as the language model. We further design SVG-specific special tokens and introduce a tailored embedding initialization strategy to incorporate SVG content effectively.

See InternSVG in Action!

demo

SArena: A Companion Benchmark

To enable systematic evaluation across SVG understanding, editing, and generation, we introduce SArena, a benchmark that aligns with the domains and difficulty spectrum covered by SAgoge and provides standardized tasks and metrics.SArena includes 4 sub-benchmarks, i.e., icons, illustrations, chemical structures, and animation.

SArena-Icon
SArena-Illustration
SArena-Chemistry
SArena-Animation

Generation
Editing
Understanding

Generation Performance

Model	Text-to-SVG					Image-to-SVG
Model	FID ↓	FID-C ↓	CLIP-T2I ↑	CLIP-I2I ↑	Tokens	DINO ↑	SSIM ↑	LPIPS ↓	PSNR ↑	Tokens
Llama-3.1-8B	19.428	11.247	21.863	71.859	280	--	--	--	--	--
Qwen2.5-VL-7B	24.781	15.454	21.538	71.384	249	0.781	0.506	0.378	6.534	281
Keye-VL-8B	21.961	14.393	21.557	71.167	227	0.801	0.531	0.368	6.939	286
GLM-4.1V-9B	22.684	10.447	22.562	73.197	269	0.820	0.539	0.345	7.329	289
InternVL3-8B	23.061	14.303	21.897	71.450	269	0.812	0.557	0.361	7.220	256
Llama-3.2-11B	28.156	14.345	21.711	71.485	261	0.759	0.467	0.389	5.908	216
Gemma-3-12B	17.137	10.409	22.023	71.622	290	0.821	0.576	0.352	7.632	360
InternVL3-14B	18.996	13.224	22.066	71.493	227	0.825	0.562	0.359	7.343	216
Kimi-VL-A3B	30.807	16.996	21.439	70.536	228	0.798	0.562	0.362	7.179	245
Gemma-3-27B	15.145	9.303	22.526	73.277	249	0.826	0.595	0.354	7.833	267
Qwen2.5-VL-32B	20.043	10.393	22.783	73.228	317	0.836	0.562	0.357	7.503	309
InternVL3-38B	18.014	11.042	22.795	73.077	251	0.829	0.549	0.351	7.305	230
Grok-3	21.967	8.694	24.122	76.797	346	--	--	--	--	--
Llama-3.1-70B	18.032	8.300	22.747	73.876	255	--	--	--	--	--
Llama-3.1-405B	16.794	8.390	22.822	73.920	236	--	--	--	--	--
DeepSeek-V3	24.990	8.803	23.790	76.470	251	--	--	--	--	--
GPT-4o	15.178	6.763	24.617	77.742	246	0.874	0.616	0.316	8.435	231
Gemini-2.5-Flash	16.720	5.208	24.658	78.218	451	0.876	0.587	0.316	8.324	533
Claude-Sonnet-3.7	14.383	3.499	25.294	80.786	417	0.909	0.647	0.290	9.259	389
Claude-Sonnet-4	15.840	4.291	25.421	80.579	444	0.915	0.665	0.276	9.855	541
Llama-3.2-90B	19.309	8.550	22.841	74.006	249	0.757	0.437	0.377	5.777	192
Llama-4-Scout	17.908	9.382	22.849	73.563	256	0.844	0.582	0.346	7.736	246
Llama-4-Maverick	14.931	6.526	23.570	75.816	265	0.863	0.596	0.329	8.027	255
GLM-4.5V	16.641	5.093	24.450	78.349	372	0.872	0.627	0.315	8.666	322
Step3-321B	20.061	9.706	23.053	74.184	308	0.834	0.555	0.340	7.516	301
Qwen2.5-VL-72B	15.948	9.875	22.946	73.681	275	0.837	0.584	0.346	7.834	372
InternVL3-78B	17.580	10.596	22.805	73.123	252	0.850	0.584	0.339	7.802	234
Starvector 8B	--	--	--	--	--	0.871	0.623	0.206	13.595	951
LLM4SVG 7B	21.939	8.611	19.458	70.726	705	0.748	0.472	0.409	5.375	485
OmniSVG 3B	28.292	11.318	21.679	74.831	1.7k	0.894	0.756	0.186	12.669	2.4k
InternSVG 8B	8.715	1.876	23.916	80.911	1.0k	0.949	0.811	0.127	18.226	1.3k

Simple Editing Tasks Performance

Model	Simple Editing Performance
	Low-level Color Editing				Cropping				Flipping				Rotation				Scaling				Adding Stroke				Translation				Transparency
	DINO↑	SSIM↑	LPIPS↓	PSNR↑	DINO↑	SSIM↑	LPIPS↓	PSNR↑	DINO↑	SSIM↑	LPIPS↓	PSNR↑	DINO↑	SSIM↑	LPIPS↓	PSNR↑	DINO↑	SSIM↑	LPIPS↓	PSNR↑	DINO↑	SSIM↑	LPIPS↓	PSNR↑	DINO↑	SSIM↑	LPIPS↓	PSNR↑	DINO↑	SSIM↑	LPIPS↓	PSNR↑
Qwen2.5-VL-7B	0.958	0.892	0.061	73.123	0.870	0.673	0.270	10.087	0.852	0.636	0.313	9.683	0.919	0.803	0.152	47.833	0.902	0.653	0.262	12.466	0.917	0.728	0.180	25.767	0.908	0.634	0.295	13.257	0.966	0.889	0.073	50.893
InternVL3-8B	0.963	0.903	0.055	75.568	0.884	0.705	0.257	10.271	0.842	0.704	0.259	23.198	0.979	0.818	0.157	48.211	0.923	0.684	0.231	12.403	0.933	0.791	0.150	35.333	0.916	0.708	0.222	27.231	0.982	0.954	0.026	67.912
InternVL3.5-8B	0.999	0.992	0.007	88.473	0.881	0.761	0.195	11.376	0.905	0.704	0.241	13.358	0.886	0.697	0.246	21.118	0.932	0.710	0.234	16.638	0.936	0.721	0.162	20.350	0.917	0.660	0.276	12.508	0.989	0.967	0.024	59.713
Gemma-3-27B	1.000	1.000	0.000	99.057	0.885	0.619	0.297	14.116	0.995	0.982	0.008	96.554	0.991	0.945	0.041	85.314	0.943	0.846	0.100	67.280	0.968	0.857	0.116	40.216	0.962	0.896	0.045	82.705	0.883	0.687	0.141	63.444
InternVL3.5-30B	0.999	0.995	0.005	91.706	0.889	0.732	0.235	10.902	0.916	0.769	0.195	23.892	0.869	0.708	0.262	18.751	0.930	0.693	0.236	14.118	0.949	0.769	0.135	27.933	0.947	0.746	0.222	32.944	0.992	0.968	0.024	63.038
Qwen2.5-VL-32B	0.967	0.914	0.044	88.400	0.903	0.657	0.306	9.062	0.919	0.807	0.154	35.634	0.986	0.959	0.024	90.586	0.917	0.673	0.236	19.639	0.932	0.739	0.139	33.796	0.934	0.748	0.191	31.632	0.980	0.949	0.029	80.879
Llama-4-Scout	0.969	0.925	0.049	87.067	0.879	0.652	0.283	9.134	0.901	0.755	0.206	21.027	0.974	0.926	0.051	80.043	0.925	0.705	0.226	18.068	0.960	0.840	0.104	38.360	0.926	0.686	0.251	18.387	0.983	0.957	0.028	66.797
Llama-4-Maverick	0.998	0.996	0.006	94.874	0.903	0.677	0.301	9.404	0.955	0.914	0.074	76.565	0.989	0.967	0.024	88.142	0.927	0.776	0.194	23.361	0.970	0.886	0.073	52.249	0.956	0.741	0.226	31.710	0.996	0.991	0.006	94.987
Qwen2.5-VL-72B	0.995	0.986	0.008	97.542	0.909	0.668	0.307	9.174	0.948	0.874	0.090	52.671	0.992	0.949	0.045	82.266	0.901	0.678	0.267	11.492	0.965	0.875	0.105	44.055	0.951	0.704	0.256	18.695	0.995	0.992	0.010	72.101
InternVL3-78B	0.995	0.987	0.008	96.985	0.909	0.682	0.299	9.599	0.936	0.833	0.129	32.765	0.994	0.974	0.017	92.534	0.931	0.695	0.238	12.792	0.947	0.790	0.145	37.317	0.957	0.831	0.134	46.221	0.992	0.984	0.015	68.573
InternVL3.5-241B	0.983	0.956	0.021	91.262	0.904	0.763	0.225	11.763	0.896	0.754	0.165	30.961	0.901	0.783	0.188	39.965	0.919	0.661	0.245	11.857	0.948	0.762	0.136	27.335	0.928	0.750	0.160	25.850	0.956	0.882	0.059	64.399
GPT-4o	0.995	0.987	0.007	98.406	0.913	0.688	0.300	9.556	0.994	0.976	0.017	87.340	0.995	0.986	0.010	94.845	0.947	0.811	0.163	45.845	0.966	0.864	0.093	48.913	0.982	0.928	0.060	72.016	0.990	0.977	0.014	85.619
Gemini-2.5-Flash	1.000	1.000	9.761	99.057	0.885	0.619	0.297	14.116	0.995	0.982	0.008	96.554	0.991	0.945	0.041	85.314	0.943	0.846	0.100	67.280	0.968	0.857	0.116	40.216	0.962	0.896	0.045	82.705	0.883	0.687	0.141	63.444
Claude-Sonnet-4	1.000	1.000	0.000	100.000	0.928	0.696	0.291	9.626	0.944	0.943	0.055	73.786	0.999	0.994	0.006	96.676	0.953	0.833	0.138	50.330	0.982	0.907	0.055	51.913	0.999	0.997	0.002	87.758	0.999	1.000	0.000	97.535
InternSVG 8B	1.000	1.000	0.000	100.000	1.000	1.000	0.000	100.000	0.996	0.987	0.005	98.672	1.000	1.000	0.000	99.692	0.999	1.000	0.000	98.655	1.000	1.000	0.000	99.488	1.000	1.000	0.000	100.000	1.000	1.000	0.000	99.968

Hard Editing Tasks Performance

Model	Semantic-level Color Editing				Style Transfer
Model	DINO↑	SSIM↑	LPIPS↓	PSNR↑	DINO↑	SSIM↑	LPIPS↓	PSNR↑
Qwen2.5-VL-7B	0.919	0.768	0.166	23.902	0.889	0.658	0.193	11.940
InternVL3-8B	0.903	0.728	0.184	22.071	0.917	0.728	0.158	13.457
Gemma-3-27B	0.981	0.920	0.072	53.068	0.869	0.591	0.210	12.174
Qwen2.5-VL-32B	0.926	0.769	0.158	28.290	0.910	0.723	0.162	14.283
Llama-4-Scout	0.964	0.860	0.120	27.852	0.963	0.848	0.119	15.417
Llama-4-Maverick	0.975	0.891	0.099	41.222	0.969	0.855	0.105	16.765
Qwen2.5-VL-72B	0.975	0.888	0.100	42.759	0.957	0.836	0.113	16.771
InternVL3-78B	0.955	0.857	0.105	27.033	0.912	0.705	0.175	13.429
GPT-4o	0.972	0.912	0.073	54.651	0.952	0.819	0.117	18.173
Gemini-2.5-Flash	0.981	0.920	0.072	53.068	0.869	0.591	0.210	12.174
Claude-Sonnet-4	0.991	0.944	0.050	56.741	0.976	0.867	0.097	18.374
InternSVG 8B	0.996	0.959	0.041	69.875	0.952	0.808	0.139	18.100

Overall Editing Performance

Model	DINO↑	SSIM↑	LPIPS↓	PSNR↑	Tokens
Qwen2.5-VL-7B	0.909	0.728	0.192	25.402	1.0k
InternVL3-8B	0.921	0.761	0.170	29.615	1.2k
Gemma-3-27B	0.942	0.815	0.113	54.200	1.3k
Qwen2.5-VL-32B	0.933	0.782	0.148	37.737	1.0k
Llama-4-Scout	0.949	0.825	0.138	34.070	1.3k
Llama-4-Maverick	0.966	0.870	0.109	46.944	1.3k
Qwen2.5-VL-72B	0.961	0.849	0.124	41.006	1.2k
InternVL3-78B	0.958	0.848	0.116	40.533	1.2k
GPT-4o	0.968	0.887	0.088	55.255	1.2k
Gemini-2.5-Flash	0.942	0.815	0.113	54.200	1.3k
Claude-Sonnet-4	0.979	0.915	0.071	57.595	1.3k
InternSVG 8B	0.989	0.952	0.036	77.331	1.4k

Understanding Performance

Model	Overall	Color	Geometry	Quantity	Semantic
Qwen2.5-VL-7B	52.8	69.3	50.4	34.9	56.4
InternVL3-8B	59.5	79.1	59.3	38.2	61.3
Gemma-3-27B	59.5	82.2	67.6	43.6	44.7
Qwen2.5-VL-32B	65.5	82.8	65.5	47.7	66.1
Llama-4-Scout	57.5	82.4	57.0	41.6	49.0
Llama-4-Maverick	64.7	87.5	62.0	47.2	62.3
Qwen2.5-VL-72B	63.4	82.4	65.1	44.6	61.6
InternVL3-78B	65.3	86.4	71.0	48.8	54.9
GPT-4o	71.0	88.2	78.5	47.5	69.6
Gemini-2.5-Flash	73.0	90.1	81.9	53.0	67.2
Claude-Sonnet-4	77.1	91.5	82.4	53.8	80.6
InternSVG 8B	85.1	93.0	85.8	61.9	99.7

Generation Performance

Model	Text-to-SVG					Image-to-SVG
Model	FID ↓	FID-C ↓	CLIP-T2I ↑	CLIP-I2I ↑	Tokens	DINO ↑	SSIM ↑	LPIPS ↓	PSNR ↑	Tokens
Qwen2.5-VL-7B	37.903	28.455	18.069	61.928	756	0.739	0.513	0.413	7.732	1.2k
InternVL3-8B	36.736	25.682	18.493	61.964	493	0.772	0.569	0.397	8.542	716
InternVL3.5-8B	70.837	35.776	18.095	63.357	3.6k	0.721	0.306	0.410	5.283	2.5k
InternVL3.5-14B	65.967	34.912	18.131	63.496	3.5k	0.722	0.296	0.414	5.130	2.8k
Gemma-3-27B	27.838	13.766	21.486	67.255	613	0.824	0.617	0.379	9.920	764
InternVL3.5-30B	68.438	33.285	18.354	63.910	3.8k	0.739	0.331	0.404	5.778	3.0k
Qwen2.5-VL-32B	32.115	17.804	19.773	64.555	779	0.816	0.591	0.382	9.297	828
InternVL3.5-38B	42.172	21.556	18.221	65.511	4.3k	0.755	0.393	0.400	6.540	3.8k
Llama-4-Scout	35.489	18.647	20.299	64.182	524	0.807	0.599	0.360	9.549	574
Llama-4-Maverick	30.835	14.831	21.872	67.366	551	0.839	0.644	0.340	10.469	608
Qwen2.5-VL-72B	29.521	18.407	20.923	65.349	527	0.808	0.628	0.363	9.900	886
InternVL3-78B	30.457	19.195	20.577	64.826	454	0.830	0.638	0.348	9.985	514
InternVL3.5-241B	43.339	23.061	18.191	65.689	2.9k	0.792	0.480	0.378	8.093	3.1k
GPT-4o	28.124	14.150	23.637	70.696	473	0.850	0.663	0.327	10.723	484
Gemini-2.5-Flash	28.865	8.894	24.800	74.796	1.2k	0.829	0.516	0.359	9.091	1.8k
Claude-Sonnet-4	27.294	7.640	23.094	74.525	1.0k	0.901	0.670	0.305	11.731	1.3k
Starvector 8B	--	--	--	--	--	0.650	0.070	0.447	1.990	2.6k
LLM4SVG 7B	48.704	29.568	15.468	62.933	1.2k	0.713	0.494	0.413	6.221	476
OmniSVG 3B	42.756	22.885	16.861	64.815	4.5k	0.797	0.656	0.330	10.433	6.7k
InternSVG 8B	22.397	5.141	21.116	74.662	8.1k	0.924	0.716	0.188	14.644	7.7k

Generation Performance

Model	Text-to-SVG				Image-to-SVG
Model	FID ↓	FID-C ↓	CLIP-I2I ↑	Tokens	DINO ↑	SSIM ↑	LPIPS ↓	PSNR ↑	Tokens
Qwen2.5-VL-7B	56.248	73.698	51.814	907	0.769	0.468	0.274	7.501	996
InternVL3-8B	33.613	61.675	56.856	910	0.865	0.783	0.203	13.840	805
Gemma-3-27B	29.937	49.967	60.776	776	0.887	0.823	0.190	14.959	683
Qwen2.5-VL-32B	53.047	56.431	58.428	1.2k	0.821	0.570	0.225	10.005	900
Llama-4-Scout	33.781	46.584	62.522	849	0.866	0.734	0.205	12.984	624
Llama-4-Maverick	26.844	31.924	69.643	747	0.908	0.798	0.173	14.977	687
Qwen2.5-VL-72B	32.307	44.540	63.931	620	0.846	0.647	0.215	12.106	716
InternVL3-78B	29.216	40.080	65.969	698	0.911	0.813	0.177	15.375	545
GPT-4o	24.505	19.297	76.599	640	0.920	0.791	0.174	14.673	533
Gemini-2.5-Flash	27.708	21.777	75.897	1.4k	0.934	0.817	0.155	15.539	1.1k
Claude-Sonnet-4	21.252	15.240	78.308	1.2k	0.957	0.871	0.132	17.554	956
Starvector 8B	--	--	--	--	0.977	0.841	0.147	17.419	1.2k
InternSVG 8B	9.974	0.877	93.931	981	0.994	0.873	0.138	17.722	931

Generation Performance

Model	Text-to-SANI				Video-to-SANI
Model	FVD ↓	CLIP-T2V ↑	CLIP-V2V ↑	Tokens	DINO ↑	SSIM ↑	LPIPS ↓	PSNR ↑	Tokens
Qwen2.5-VL-7B	214.379	19.118	50.649	296	0.787	0.716	0.273	11.758	423
InternVL3-8B	310.066	17.017	43.856	433	0.780	0.612	0.286	9.883	415
Gemma-3-27B	159.119	21.105	59.309	533	0.824	0.733	0.265	12.290	516
Qwen2.5-VL-32B	128.299	20.535	59.188	537	0.823	0.696	0.273	11.417	505
Llama-4-Scout	167.932	21.014	62.929	505	0.831	0.742	0.259	12.427	426
Llama-4-Maverick	141.470	22.304	67.615	563	0.841	0.754	0.246	12.858	447
Qwen2.5-VL-72B	151.682	20.376	59.454	433	0.834	0.721	0.261	11.931	402
InternVL3-78B	169.159	20.263	60.896	409	0.828	0.704	0.264	11.336	385
GPT-4o	155.393	22.808	70.608	404	0.860	0.743	0.250	12.260	400
Gemini-2.5-Flash	151.983	22.239	66.554	986	0.847	0.701	0.257	12.015	917
Claude-Sonnet-4	169.484	24.070	74.179	907	0.867	0.760	0.240	13.189	866
InternSVG 8B	99.474	22.572	73.162	812	0.876	0.754	0.237	14.168	888

SGP-Bench

To further validate the effectiveness of SAgoge in enhancing model capabilities for SVG modeling, we conduct comparative experiments on SGP-Bench, a benchmark specifically designed to evaluate semantic and structural understanding of symbolic graphic programs.

Model	Semantics ↑	Count ↑	Color ↑	Shape ↑	Reasoning ↑	Overall ↑
Gemma-1.1-2B	32.1	33.3	25.0	35.6	28.7	31.7
InternLM2.5-7B	27.3	31.7	59.8	51.5	28.2	42.1
Keye-VL-8B	41.4	47.5	71.4	54.9	40.6	52.2
GLM-4.1V-9B	41.6	55.6	79.1	61.5	40.0	57.1
InternVL3-8B	33.7	46.5	69.8	59.1	36.1	50.6
Gemma-3-12B	24.8	30.8	47.2	25.7	22.8	30.5
DeepSeek-Coder-V2-16B	30.9	37.9	63.7	54.8	26.8	45.1
InternVL3-14B	38.2	52.9	74.4	54.1	41.7	52.9
Kimi-VL-A3B-2506	31.1	41.5	67.0	47.4	32.4	44.9
Gemma-3-27B	36.7	51.4	76.3	62.1	39.4	54.7
Qwen2.5-VL-32B	40.0	55.7	76.3	61.2	43.9	56.5
InternVL3-38B	40.8	58.7	82.2	63.6	43.9	59.1
GPT-4o	45.9	56.8	87.3	75.2	50.4	64.8
Gemini-2.5-Flash	53.8	57.8	88.1	75.6	55.5	67.6
Claude-Sonnet-4	55.9	67.6	89.5	79.0	58.9	71.5
GLM-4.5V	47.3	63.7	87.3	72.3	55.8	66.1
Qwen2.5-VL-72B	40.2	55.1	80.1	62.0	41.1	57.1
InternVL3-78B	41.0	59.1	84.0	65.2	47.0	60.3
Step3-321B-A38B	35.9	54.0	82.8	63.2	38.6	56.5
InternSVG 8B	54.6	70.7	85.5	82.4	57.5	72.3

Comparison with Baselines

We compare the generated SVGs with those produced by baseline methods to assess visual quality.

SArena-Icon

Text-to-SVG

Image-to-SVG

SArena-Illustration

Text-to-SVG

Image-to-SVG

SArena-Chemistry

Text-to-SVG

Image-to-SVG

SArena-Animation

Text-to-SVG

Image-to-SVG

BibTeX

@article{wang2025internsvg,
    title={InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models},
    author={Wang, Haomin and Yin, Jinhui and Wei, Qi and Zeng, Wenguang and Gu, Lixin and Ye, Shenglong and Gao, Zhangwei
    and Wang, Yaohui and Zhang, Yanting and Li, Yuanqi and others},
    journal={arXiv preprint arXiv:2510.11341},
    year={2025}
}