InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
Overview of our InternSVG family
Abstract
Vector graphics, represented in Scalable Vector Graphics (SVG) format, serve as a core medium for digital design and web rendering. Existing works on SVG tasks often focus on isolated subtasks such as generation, editing, or understanding. In this paper, we propose InternSVG, a unified framework based on multimodal large language models that jointly addresses SVG-related tasks across perception and creation. By representing SVGs as structured sequences and aligning them with textual descriptions and raster renderings, InternSVG enables a generalizable interface for vector reasoning, generation, and manipulation. Extensive experiments demonstrate its versatility and performance across diverse SVG benchmarks.
SAgoge: A Comprehensive Multimodal SVG Dataset
We introduce SAgoge, a large-scale and comprehensive dataset for SVG tasks with more than 16 million training samples spanning icons, illustrations, chemical structures, and animations.
Raw SVGs are gathered from the web and a custom synthesis pipeline, then normalized to a 128 × 128 canvas and simplified to shorten code. The rendered images or videos, processed SVG code, and handcrafted prompts are fed to an MLLM to synthesize high-quality training samples for understanding, editing, and generation.
InternSVG: A Unified MLLM for SVG Understanding, Editing, and Generation
InternSVG follows the “ViT–MLP–LLM” paradigm , using InternViT-300M as the vision encoder and Qwen2.5-7B as the language model. We further design SVG-specific special tokens and introduce a tailored embedding initialization strategy to incorporate SVG content effectively.
See InternSVG in Action!
SArena: A Companion Benchmark
To enable systematic evaluation across SVG understanding, editing, and generation, we introduce SArena, a benchmark that aligns with the domains and difficulty spectrum covered by SAgoge and provides standardized tasks and metrics.SArena includes 4 sub-benchmarks, i.e., icons, illustrations, chemical structures, and animation.
Generation Performance
| Model | Text-to-SVG | Image-to-SVG | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| FID ↓ | FID-C ↓ | CLIP-T2I ↑ | CLIP-I2I ↑ | Tokens | DINO ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | Tokens | |
| Llama-3.1-8B | 19.428 | 11.247 | 21.863 | 71.859 | 280 | -- | -- | -- | -- | -- |
| Qwen2.5-VL-7B | 24.781 | 15.454 | 21.538 | 71.384 | 249 | 0.781 | 0.506 | 0.378 | 6.534 | 281 |
| Keye-VL-8B | 21.961 | 14.393 | 21.557 | 71.167 | 227 | 0.801 | 0.531 | 0.368 | 6.939 | 286 |
| GLM-4.1V-9B | 22.684 | 10.447 | 22.562 | 73.197 | 269 | 0.820 | 0.539 | 0.345 | 7.329 | 289 |
| InternVL3-8B | 23.061 | 14.303 | 21.897 | 71.450 | 269 | 0.812 | 0.557 | 0.361 | 7.220 | 256 |
| Llama-3.2-11B | 28.156 | 14.345 | 21.711 | 71.485 | 261 | 0.759 | 0.467 | 0.389 | 5.908 | 216 |
| Gemma-3-12B | 17.137 | 10.409 | 22.023 | 71.622 | 290 | 0.821 | 0.576 | 0.352 | 7.632 | 360 |
| InternVL3-14B | 18.996 | 13.224 | 22.066 | 71.493 | 227 | 0.825 | 0.562 | 0.359 | 7.343 | 216 |
| Kimi-VL-A3B | 30.807 | 16.996 | 21.439 | 70.536 | 228 | 0.798 | 0.562 | 0.362 | 7.179 | 245 |
| Gemma-3-27B | 15.145 | 9.303 | 22.526 | 73.277 | 249 | 0.826 | 0.595 | 0.354 | 7.833 | 267 |
| Qwen2.5-VL-32B | 20.043 | 10.393 | 22.783 | 73.228 | 317 | 0.836 | 0.562 | 0.357 | 7.503 | 309 |
| InternVL3-38B | 18.014 | 11.042 | 22.795 | 73.077 | 251 | 0.829 | 0.549 | 0.351 | 7.305 | 230 |
| Grok-3 | 21.967 | 8.694 | 24.122 | 76.797 | 346 | -- | -- | -- | -- | -- |
| Llama-3.1-70B | 18.032 | 8.300 | 22.747 | 73.876 | 255 | -- | -- | -- | -- | -- |
| Llama-3.1-405B | 16.794 | 8.390 | 22.822 | 73.920 | 236 | -- | -- | -- | -- | -- |
| DeepSeek-V3 | 24.990 | 8.803 | 23.790 | 76.470 | 251 | -- | -- | -- | -- | -- |
| GPT-4o | 15.178 | 6.763 | 24.617 | 77.742 | 246 | 0.874 | 0.616 | 0.316 | 8.435 | 231 |
| Gemini-2.5-Flash | 16.720 | 5.208 | 24.658 | 78.218 | 451 | 0.876 | 0.587 | 0.316 | 8.324 | 533 |
| Claude-Sonnet-3.7 | 14.383 | 3.499 | 25.294 | 80.786 | 417 | 0.909 | 0.647 | 0.290 | 9.259 | 389 |
| Claude-Sonnet-4 | 15.840 | 4.291 | 25.421 | 80.579 | 444 | 0.915 | 0.665 | 0.276 | 9.855 | 541 |
| Llama-3.2-90B | 19.309 | 8.550 | 22.841 | 74.006 | 249 | 0.757 | 0.437 | 0.377 | 5.777 | 192 |
| Llama-4-Scout | 17.908 | 9.382 | 22.849 | 73.563 | 256 | 0.844 | 0.582 | 0.346 | 7.736 | 246 |
| Llama-4-Maverick | 14.931 | 6.526 | 23.570 | 75.816 | 265 | 0.863 | 0.596 | 0.329 | 8.027 | 255 |
| GLM-4.5V | 16.641 | 5.093 | 24.450 | 78.349 | 372 | 0.872 | 0.627 | 0.315 | 8.666 | 322 |
| Step3-321B | 20.061 | 9.706 | 23.053 | 74.184 | 308 | 0.834 | 0.555 | 0.340 | 7.516 | 301 |
| Qwen2.5-VL-72B | 15.948 | 9.875 | 22.946 | 73.681 | 275 | 0.837 | 0.584 | 0.346 | 7.834 | 372 |
| InternVL3-78B | 17.580 | 10.596 | 22.805 | 73.123 | 252 | 0.850 | 0.584 | 0.339 | 7.802 | 234 |
| Starvector 8B | -- | -- | -- | -- | -- | 0.871 | 0.623 | 0.206 | 13.595 | 951 |
| LLM4SVG 7B | 21.939 | 8.611 | 19.458 | 70.726 | 705 | 0.748 | 0.472 | 0.409 | 5.375 | 485 |
| OmniSVG 3B | 28.292 | 11.318 | 21.679 | 74.831 | 1.7k | 0.894 | 0.756 | 0.186 | 12.669 | 2.4k |
| InternSVG 8B | 8.715 | 1.876 | 23.916 | 80.911 | 1.0k | 0.949 | 0.811 | 0.127 | 18.226 | 1.3k |
Simple Editing Tasks Performance
| Model | Simple Editing Performance | |||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Low-level Color Editing | Cropping | Flipping | Rotation | Scaling | Adding Stroke | Translation | Transparency | |||||||||||||||||||||||||
| DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | |
| Qwen2.5-VL-7B | 0.958 | 0.892 | 0.061 | 73.123 | 0.870 | 0.673 | 0.270 | 10.087 | 0.852 | 0.636 | 0.313 | 9.683 | 0.919 | 0.803 | 0.152 | 47.833 | 0.902 | 0.653 | 0.262 | 12.466 | 0.917 | 0.728 | 0.180 | 25.767 | 0.908 | 0.634 | 0.295 | 13.257 | 0.966 | 0.889 | 0.073 | 50.893 |
| InternVL3-8B | 0.963 | 0.903 | 0.055 | 75.568 | 0.884 | 0.705 | 0.257 | 10.271 | 0.842 | 0.704 | 0.259 | 23.198 | 0.979 | 0.818 | 0.157 | 48.211 | 0.923 | 0.684 | 0.231 | 12.403 | 0.933 | 0.791 | 0.150 | 35.333 | 0.916 | 0.708 | 0.222 | 27.231 | 0.982 | 0.954 | 0.026 | 67.912 |
| InternVL3.5-8B | 0.999 | 0.992 | 0.007 | 88.473 | 0.881 | 0.761 | 0.195 | 11.376 | 0.905 | 0.704 | 0.241 | 13.358 | 0.886 | 0.697 | 0.246 | 21.118 | 0.932 | 0.710 | 0.234 | 16.638 | 0.936 | 0.721 | 0.162 | 20.350 | 0.917 | 0.660 | 0.276 | 12.508 | 0.989 | 0.967 | 0.024 | 59.713 |
| Gemma-3-27B | 1.000 | 1.000 | 0.000 | 99.057 | 0.885 | 0.619 | 0.297 | 14.116 | 0.995 | 0.982 | 0.008 | 96.554 | 0.991 | 0.945 | 0.041 | 85.314 | 0.943 | 0.846 | 0.100 | 67.280 | 0.968 | 0.857 | 0.116 | 40.216 | 0.962 | 0.896 | 0.045 | 82.705 | 0.883 | 0.687 | 0.141 | 63.444 |
| InternVL3.5-30B | 0.999 | 0.995 | 0.005 | 91.706 | 0.889 | 0.732 | 0.235 | 10.902 | 0.916 | 0.769 | 0.195 | 23.892 | 0.869 | 0.708 | 0.262 | 18.751 | 0.930 | 0.693 | 0.236 | 14.118 | 0.949 | 0.769 | 0.135 | 27.933 | 0.947 | 0.746 | 0.222 | 32.944 | 0.992 | 0.968 | 0.024 | 63.038 |
| Qwen2.5-VL-32B | 0.967 | 0.914 | 0.044 | 88.400 | 0.903 | 0.657 | 0.306 | 9.062 | 0.919 | 0.807 | 0.154 | 35.634 | 0.986 | 0.959 | 0.024 | 90.586 | 0.917 | 0.673 | 0.236 | 19.639 | 0.932 | 0.739 | 0.139 | 33.796 | 0.934 | 0.748 | 0.191 | 31.632 | 0.980 | 0.949 | 0.029 | 80.879 |
| Llama-4-Scout | 0.969 | 0.925 | 0.049 | 87.067 | 0.879 | 0.652 | 0.283 | 9.134 | 0.901 | 0.755 | 0.206 | 21.027 | 0.974 | 0.926 | 0.051 | 80.043 | 0.925 | 0.705 | 0.226 | 18.068 | 0.960 | 0.840 | 0.104 | 38.360 | 0.926 | 0.686 | 0.251 | 18.387 | 0.983 | 0.957 | 0.028 | 66.797 |
| Llama-4-Maverick | 0.998 | 0.996 | 0.006 | 94.874 | 0.903 | 0.677 | 0.301 | 9.404 | 0.955 | 0.914 | 0.074 | 76.565 | 0.989 | 0.967 | 0.024 | 88.142 | 0.927 | 0.776 | 0.194 | 23.361 | 0.970 | 0.886 | 0.073 | 52.249 | 0.956 | 0.741 | 0.226 | 31.710 | 0.996 | 0.991 | 0.006 | 94.987 |
| Qwen2.5-VL-72B | 0.995 | 0.986 | 0.008 | 97.542 | 0.909 | 0.668 | 0.307 | 9.174 | 0.948 | 0.874 | 0.090 | 52.671 | 0.992 | 0.949 | 0.045 | 82.266 | 0.901 | 0.678 | 0.267 | 11.492 | 0.965 | 0.875 | 0.105 | 44.055 | 0.951 | 0.704 | 0.256 | 18.695 | 0.995 | 0.992 | 0.010 | 72.101 |
| InternVL3-78B | 0.995 | 0.987 | 0.008 | 96.985 | 0.909 | 0.682 | 0.299 | 9.599 | 0.936 | 0.833 | 0.129 | 32.765 | 0.994 | 0.974 | 0.017 | 92.534 | 0.931 | 0.695 | 0.238 | 12.792 | 0.947 | 0.790 | 0.145 | 37.317 | 0.957 | 0.831 | 0.134 | 46.221 | 0.992 | 0.984 | 0.015 | 68.573 |
| InternVL3.5-241B | 0.983 | 0.956 | 0.021 | 91.262 | 0.904 | 0.763 | 0.225 | 11.763 | 0.896 | 0.754 | 0.165 | 30.961 | 0.901 | 0.783 | 0.188 | 39.965 | 0.919 | 0.661 | 0.245 | 11.857 | 0.948 | 0.762 | 0.136 | 27.335 | 0.928 | 0.750 | 0.160 | 25.850 | 0.956 | 0.882 | 0.059 | 64.399 |
| GPT-4o | 0.995 | 0.987 | 0.007 | 98.406 | 0.913 | 0.688 | 0.300 | 9.556 | 0.994 | 0.976 | 0.017 | 87.340 | 0.995 | 0.986 | 0.010 | 94.845 | 0.947 | 0.811 | 0.163 | 45.845 | 0.966 | 0.864 | 0.093 | 48.913 | 0.982 | 0.928 | 0.060 | 72.016 | 0.990 | 0.977 | 0.014 | 85.619 |
| Gemini-2.5-Flash | 1.000 | 1.000 | 9.761 | 99.057 | 0.885 | 0.619 | 0.297 | 14.116 | 0.995 | 0.982 | 0.008 | 96.554 | 0.991 | 0.945 | 0.041 | 85.314 | 0.943 | 0.846 | 0.100 | 67.280 | 0.968 | 0.857 | 0.116 | 40.216 | 0.962 | 0.896 | 0.045 | 82.705 | 0.883 | 0.687 | 0.141 | 63.444 |
| Claude-Sonnet-4 | 1.000 | 1.000 | 0.000 | 100.000 | 0.928 | 0.696 | 0.291 | 9.626 | 0.944 | 0.943 | 0.055 | 73.786 | 0.999 | 0.994 | 0.006 | 96.676 | 0.953 | 0.833 | 0.138 | 50.330 | 0.982 | 0.907 | 0.055 | 51.913 | 0.999 | 0.997 | 0.002 | 87.758 | 0.999 | 1.000 | 0.000 | 97.535 |
| InternSVG 8B | 1.000 | 1.000 | 0.000 | 100.000 | 1.000 | 1.000 | 0.000 | 100.000 | 0.996 | 0.987 | 0.005 | 98.672 | 1.000 | 1.000 | 0.000 | 99.692 | 0.999 | 1.000 | 0.000 | 98.655 | 1.000 | 1.000 | 0.000 | 99.488 | 1.000 | 1.000 | 0.000 | 100.000 | 1.000 | 1.000 | 0.000 | 99.968 |
Hard Editing Tasks Performance
| Model | Semantic-level Color Editing | Style Transfer | ||||||
|---|---|---|---|---|---|---|---|---|
| DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | |
| Qwen2.5-VL-7B | 0.919 | 0.768 | 0.166 | 23.902 | 0.889 | 0.658 | 0.193 | 11.940 |
| InternVL3-8B | 0.903 | 0.728 | 0.184 | 22.071 | 0.917 | 0.728 | 0.158 | 13.457 |
| Gemma-3-27B | 0.981 | 0.920 | 0.072 | 53.068 | 0.869 | 0.591 | 0.210 | 12.174 |
| Qwen2.5-VL-32B | 0.926 | 0.769 | 0.158 | 28.290 | 0.910 | 0.723 | 0.162 | 14.283 |
| Llama-4-Scout | 0.964 | 0.860 | 0.120 | 27.852 | 0.963 | 0.848 | 0.119 | 15.417 |
| Llama-4-Maverick | 0.975 | 0.891 | 0.099 | 41.222 | 0.969 | 0.855 | 0.105 | 16.765 |
| Qwen2.5-VL-72B | 0.975 | 0.888 | 0.100 | 42.759 | 0.957 | 0.836 | 0.113 | 16.771 |
| InternVL3-78B | 0.955 | 0.857 | 0.105 | 27.033 | 0.912 | 0.705 | 0.175 | 13.429 |
| GPT-4o | 0.972 | 0.912 | 0.073 | 54.651 | 0.952 | 0.819 | 0.117 | 18.173 |
| Gemini-2.5-Flash | 0.981 | 0.920 | 0.072 | 53.068 | 0.869 | 0.591 | 0.210 | 12.174 |
| Claude-Sonnet-4 | 0.991 | 0.944 | 0.050 | 56.741 | 0.976 | 0.867 | 0.097 | 18.374 |
| InternSVG 8B | 0.996 | 0.959 | 0.041 | 69.875 | 0.952 | 0.808 | 0.139 | 18.100 |
Overall Editing Performance
| Model | DINO↑ | SSIM↑ | LPIPS↓ | PSNR↑ | Tokens |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 0.909 | 0.728 | 0.192 | 25.402 | 1.0k |
| InternVL3-8B | 0.921 | 0.761 | 0.170 | 29.615 | 1.2k |
| Gemma-3-27B | 0.942 | 0.815 | 0.113 | 54.200 | 1.3k |
| Qwen2.5-VL-32B | 0.933 | 0.782 | 0.148 | 37.737 | 1.0k |
| Llama-4-Scout | 0.949 | 0.825 | 0.138 | 34.070 | 1.3k |
| Llama-4-Maverick | 0.966 | 0.870 | 0.109 | 46.944 | 1.3k |
| Qwen2.5-VL-72B | 0.961 | 0.849 | 0.124 | 41.006 | 1.2k |
| InternVL3-78B | 0.958 | 0.848 | 0.116 | 40.533 | 1.2k |
| GPT-4o | 0.968 | 0.887 | 0.088 | 55.255 | 1.2k |
| Gemini-2.5-Flash | 0.942 | 0.815 | 0.113 | 54.200 | 1.3k |
| Claude-Sonnet-4 | 0.979 | 0.915 | 0.071 | 57.595 | 1.3k |
| InternSVG 8B | 0.989 | 0.952 | 0.036 | 77.331 | 1.4k |
Understanding Performance
| Model | Overall | Color | Geometry | Quantity | Semantic |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B | 52.8 | 69.3 | 50.4 | 34.9 | 56.4 |
| InternVL3-8B | 59.5 | 79.1 | 59.3 | 38.2 | 61.3 |
| Gemma-3-27B | 59.5 | 82.2 | 67.6 | 43.6 | 44.7 |
| Qwen2.5-VL-32B | 65.5 | 82.8 | 65.5 | 47.7 | 66.1 |
| Llama-4-Scout | 57.5 | 82.4 | 57.0 | 41.6 | 49.0 |
| Llama-4-Maverick | 64.7 | 87.5 | 62.0 | 47.2 | 62.3 |
| Qwen2.5-VL-72B | 63.4 | 82.4 | 65.1 | 44.6 | 61.6 |
| InternVL3-78B | 65.3 | 86.4 | 71.0 | 48.8 | 54.9 |
| GPT-4o | 71.0 | 88.2 | 78.5 | 47.5 | 69.6 |
| Gemini-2.5-Flash | 73.0 | 90.1 | 81.9 | 53.0 | 67.2 |
| Claude-Sonnet-4 | 77.1 | 91.5 | 82.4 | 53.8 | 80.6 |
| InternSVG 8B | 85.1 | 93.0 | 85.8 | 61.9 | 99.7 |
Generation Performance
| Model | Text-to-SVG | Image-to-SVG | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| FID ↓ | FID-C ↓ | CLIP-T2I ↑ | CLIP-I2I ↑ | Tokens | DINO ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | Tokens | |
| Qwen2.5-VL-7B | 37.903 | 28.455 | 18.069 | 61.928 | 756 | 0.739 | 0.513 | 0.413 | 7.732 | 1.2k |
| InternVL3-8B | 36.736 | 25.682 | 18.493 | 61.964 | 493 | 0.772 | 0.569 | 0.397 | 8.542 | 716 |
| InternVL3.5-8B | 70.837 | 35.776 | 18.095 | 63.357 | 3.6k | 0.721 | 0.306 | 0.410 | 5.283 | 2.5k |
| InternVL3.5-14B | 65.967 | 34.912 | 18.131 | 63.496 | 3.5k | 0.722 | 0.296 | 0.414 | 5.130 | 2.8k |
| Gemma-3-27B | 27.838 | 13.766 | 21.486 | 67.255 | 613 | 0.824 | 0.617 | 0.379 | 9.920 | 764 |
| InternVL3.5-30B | 68.438 | 33.285 | 18.354 | 63.910 | 3.8k | 0.739 | 0.331 | 0.404 | 5.778 | 3.0k |
| Qwen2.5-VL-32B | 32.115 | 17.804 | 19.773 | 64.555 | 779 | 0.816 | 0.591 | 0.382 | 9.297 | 828 |
| InternVL3.5-38B | 42.172 | 21.556 | 18.221 | 65.511 | 4.3k | 0.755 | 0.393 | 0.400 | 6.540 | 3.8k |
| Llama-4-Scout | 35.489 | 18.647 | 20.299 | 64.182 | 524 | 0.807 | 0.599 | 0.360 | 9.549 | 574 |
| Llama-4-Maverick | 30.835 | 14.831 | 21.872 | 67.366 | 551 | 0.839 | 0.644 | 0.340 | 10.469 | 608 |
| Qwen2.5-VL-72B | 29.521 | 18.407 | 20.923 | 65.349 | 527 | 0.808 | 0.628 | 0.363 | 9.900 | 886 |
| InternVL3-78B | 30.457 | 19.195 | 20.577 | 64.826 | 454 | 0.830 | 0.638 | 0.348 | 9.985 | 514 |
| InternVL3.5-241B | 43.339 | 23.061 | 18.191 | 65.689 | 2.9k | 0.792 | 0.480 | 0.378 | 8.093 | 3.1k |
| GPT-4o | 28.124 | 14.150 | 23.637 | 70.696 | 473 | 0.850 | 0.663 | 0.327 | 10.723 | 484 |
| Gemini-2.5-Flash | 28.865 | 8.894 | 24.800 | 74.796 | 1.2k | 0.829 | 0.516 | 0.359 | 9.091 | 1.8k |
| Claude-Sonnet-4 | 27.294 | 7.640 | 23.094 | 74.525 | 1.0k | 0.901 | 0.670 | 0.305 | 11.731 | 1.3k |
| Starvector 8B | -- | -- | -- | -- | -- | 0.650 | 0.070 | 0.447 | 1.990 | 2.6k |
| LLM4SVG 7B | 48.704 | 29.568 | 15.468 | 62.933 | 1.2k | 0.713 | 0.494 | 0.413 | 6.221 | 476 |
| OmniSVG 3B | 42.756 | 22.885 | 16.861 | 64.815 | 4.5k | 0.797 | 0.656 | 0.330 | 10.433 | 6.7k |
| InternSVG 8B | 22.397 | 5.141 | 21.116 | 74.662 | 8.1k | 0.924 | 0.716 | 0.188 | 14.644 | 7.7k |
Generation Performance
| Model | Text-to-SVG | Image-to-SVG | |||||||
|---|---|---|---|---|---|---|---|---|---|
| FID ↓ | FID-C ↓ | CLIP-I2I ↑ | Tokens | DINO ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | Tokens | |
| Qwen2.5-VL-7B | 56.248 | 73.698 | 51.814 | 907 | 0.769 | 0.468 | 0.274 | 7.501 | 996 |
| InternVL3-8B | 33.613 | 61.675 | 56.856 | 910 | 0.865 | 0.783 | 0.203 | 13.840 | 805 |
| Gemma-3-27B | 29.937 | 49.967 | 60.776 | 776 | 0.887 | 0.823 | 0.190 | 14.959 | 683 |
| Qwen2.5-VL-32B | 53.047 | 56.431 | 58.428 | 1.2k | 0.821 | 0.570 | 0.225 | 10.005 | 900 |
| Llama-4-Scout | 33.781 | 46.584 | 62.522 | 849 | 0.866 | 0.734 | 0.205 | 12.984 | 624 |
| Llama-4-Maverick | 26.844 | 31.924 | 69.643 | 747 | 0.908 | 0.798 | 0.173 | 14.977 | 687 |
| Qwen2.5-VL-72B | 32.307 | 44.540 | 63.931 | 620 | 0.846 | 0.647 | 0.215 | 12.106 | 716 |
| InternVL3-78B | 29.216 | 40.080 | 65.969 | 698 | 0.911 | 0.813 | 0.177 | 15.375 | 545 |
| GPT-4o | 24.505 | 19.297 | 76.599 | 640 | 0.920 | 0.791 | 0.174 | 14.673 | 533 |
| Gemini-2.5-Flash | 27.708 | 21.777 | 75.897 | 1.4k | 0.934 | 0.817 | 0.155 | 15.539 | 1.1k |
| Claude-Sonnet-4 | 21.252 | 15.240 | 78.308 | 1.2k | 0.957 | 0.871 | 0.132 | 17.554 | 956 |
| Starvector 8B | -- | -- | -- | -- | 0.977 | 0.841 | 0.147 | 17.419 | 1.2k |
| InternSVG 8B | 9.974 | 0.877 | 93.931 | 981 | 0.994 | 0.873 | 0.138 | 17.722 | 931 |
Generation Performance
| Model | Text-to-SANI | Video-to-SANI | |||||||
|---|---|---|---|---|---|---|---|---|---|
| FVD ↓ | CLIP-T2V ↑ | CLIP-V2V ↑ | Tokens | DINO ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | Tokens | |
| Qwen2.5-VL-7B | 214.379 | 19.118 | 50.649 | 296 | 0.787 | 0.716 | 0.273 | 11.758 | 423 |
| InternVL3-8B | 310.066 | 17.017 | 43.856 | 433 | 0.780 | 0.612 | 0.286 | 9.883 | 415 |
| Gemma-3-27B | 159.119 | 21.105 | 59.309 | 533 | 0.824 | 0.733 | 0.265 | 12.290 | 516 |
| Qwen2.5-VL-32B | 128.299 | 20.535 | 59.188 | 537 | 0.823 | 0.696 | 0.273 | 11.417 | 505 |
| Llama-4-Scout | 167.932 | 21.014 | 62.929 | 505 | 0.831 | 0.742 | 0.259 | 12.427 | 426 |
| Llama-4-Maverick | 141.470 | 22.304 | 67.615 | 563 | 0.841 | 0.754 | 0.246 | 12.858 | 447 |
| Qwen2.5-VL-72B | 151.682 | 20.376 | 59.454 | 433 | 0.834 | 0.721 | 0.261 | 11.931 | 402 |
| InternVL3-78B | 169.159 | 20.263 | 60.896 | 409 | 0.828 | 0.704 | 0.264 | 11.336 | 385 |
| GPT-4o | 155.393 | 22.808 | 70.608 | 404 | 0.860 | 0.743 | 0.250 | 12.260 | 400 |
| Gemini-2.5-Flash | 151.983 | 22.239 | 66.554 | 986 | 0.847 | 0.701 | 0.257 | 12.015 | 917 |
| Claude-Sonnet-4 | 169.484 | 24.070 | 74.179 | 907 | 0.867 | 0.760 | 0.240 | 13.189 | 866 |
| InternSVG 8B | 99.474 | 22.572 | 73.162 | 812 | 0.876 | 0.754 | 0.237 | 14.168 | 888 |
SGP-Bench
To further validate the effectiveness of SAgoge in enhancing model capabilities for SVG modeling, we conduct comparative experiments on SGP-Bench, a benchmark specifically designed to evaluate semantic and structural understanding of symbolic graphic programs.
| Model | Semantics ↑ | Count ↑ | Color ↑ | Shape ↑ | Reasoning ↑ | Overall ↑ |
|---|---|---|---|---|---|---|
| Gemma-1.1-2B | 32.1 | 33.3 | 25.0 | 35.6 | 28.7 | 31.7 |
| InternLM2.5-7B | 27.3 | 31.7 | 59.8 | 51.5 | 28.2 | 42.1 |
| Keye-VL-8B | 41.4 | 47.5 | 71.4 | 54.9 | 40.6 | 52.2 |
| GLM-4.1V-9B | 41.6 | 55.6 | 79.1 | 61.5 | 40.0 | 57.1 |
| InternVL3-8B | 33.7 | 46.5 | 69.8 | 59.1 | 36.1 | 50.6 |
| Gemma-3-12B | 24.8 | 30.8 | 47.2 | 25.7 | 22.8 | 30.5 |
| DeepSeek-Coder-V2-16B | 30.9 | 37.9 | 63.7 | 54.8 | 26.8 | 45.1 |
| InternVL3-14B | 38.2 | 52.9 | 74.4 | 54.1 | 41.7 | 52.9 |
| Kimi-VL-A3B-2506 | 31.1 | 41.5 | 67.0 | 47.4 | 32.4 | 44.9 |
| Gemma-3-27B | 36.7 | 51.4 | 76.3 | 62.1 | 39.4 | 54.7 |
| Qwen2.5-VL-32B | 40.0 | 55.7 | 76.3 | 61.2 | 43.9 | 56.5 |
| InternVL3-38B | 40.8 | 58.7 | 82.2 | 63.6 | 43.9 | 59.1 |
| GPT-4o | 45.9 | 56.8 | 87.3 | 75.2 | 50.4 | 64.8 |
| Gemini-2.5-Flash | 53.8 | 57.8 | 88.1 | 75.6 | 55.5 | 67.6 |
| Claude-Sonnet-4 | 55.9 | 67.6 | 89.5 | 79.0 | 58.9 | 71.5 |
| GLM-4.5V | 47.3 | 63.7 | 87.3 | 72.3 | 55.8 | 66.1 |
| Qwen2.5-VL-72B | 40.2 | 55.1 | 80.1 | 62.0 | 41.1 | 57.1 |
| InternVL3-78B | 41.0 | 59.1 | 84.0 | 65.2 | 47.0 | 60.3 |
| Step3-321B-A38B | 35.9 | 54.0 | 82.8 | 63.2 | 38.6 | 56.5 |
| InternSVG 8B | 54.6 | 70.7 | 85.5 | 82.4 | 57.5 | 72.3 |
Comparison with Baselines
We compare the generated SVGs with those produced by baseline methods to assess visual quality.
SArena-Icon
Text-to-SVG
Image-to-SVG
SArena-Illustration
Text-to-SVG
Image-to-SVG
SArena-Chemistry
Text-to-SVG
Image-to-SVG
SArena-Animation
Text-to-SVG
Image-to-SVG
BibTeX
@article{wang2025internsvg,
title={InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models},
author={Wang, Haomin and Yin, Jinhui and Wei, Qi and Zeng, Wenguang and Gu, Lixin and Ye, Shenglong and Gao, Zhangwei
and Wang, Yaohui and Zhang, Yanting and Li, Yuanqi and others},
journal={arXiv preprint arXiv:2510.11341},
year={2025}
}