iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🐕

Multimodal Models Utilizing Model Merging as Potential AI Eyes [Paper Summary]

に公開

Introduction

Hello, I'm mita, an aspiring AI engineer!

This time, I will summarize and discuss a paper titled
VisCodex: Vision-Language–Code Models by Merging Architectures,
which integrates a multimodal model and a code generation model through "model merging" to achieve both visual understanding and code generation.

I'll be sharing what I've caught up on, and I'd be honored if it helps you in some way!

https://arxiv.org/pdf/2508.09945v1

Paper Summary

  • VisCodex proposes a method to inject task vectors from a code-specific LLM into the language backbone of a VLM via linear combination, achieving both capabilities without retraining.
  • The visual side is fixed, and merging is performed only on the language backbone, followed by SFT. The key is adding code capability while maintaining visual grounding.
  • Main Results: VisCodex-8B outperforms similarly sized OS models and exceeds GPT-4o-mini with an average score of 68.8. VisCodex-33B scores an average of 72.3, approaching GPT-4o (73.3). It is particularly strong in UI/chart-related tasks.

Paper Details

Research Background

  • A typical MLLM consists of three components: Vision Encoder, Projection, and Language Backbone. Qwen2.5-VL is robust to arbitrary resolutions using 2D RoPE within the ViT and is strong at maintaining spatial relationships. It was adopted as the foundation for VisCodex.
  • However, replacing the LLM backbone poses a challenge where the already learned visual grounding breaks. Therefore, code capabilities are "added" via model merging to preserve visual understanding.

What is VisCodex

VisCodex is designed with the following two-step approach.

1. Model Merging

  • Prepare task vectors: τ_task = θ_ft − θ_base, and linearly combine τ_vlm for VLM and τ_code for Code LLM on the language backbone (Vision/Projection are fixed).
  • Implementation: Qwen2.5-VL is used for the VLM, and OpenCodeReasoning-Nemotron-1.1-7B/32B is primarily used for the code side (other models like Qwen2.5-Coder and OpenThinker2 were also compared).
  • Coefficient λ: 0.7 was selected for the 8B model based on MMCode, and 0.85 was adopted for the 33B model (due to resource constraints).

2. Model Training (SFT)

  • After merging, SFT is performed with MCD. Vision/Projection are frozen, and only the language backbone is updated.

3. InfiBench-V (Evaluation Criteria)

  • Real-world programming Q&A where images are indispensable were carefully selected from StackOverflow. Narrowed down from 1 million → 40,000 → 10,000 image-required → 322 questions refined by experts (13 languages, 5 categories). :contentReference[oaicite:19]{index=19}
  • The evaluation is based on three pillars: Keyword Matching / Unit Test / GPT-4o Judge, scoring accuracy and completeness. :contentReference[oaicite:20]{index=20}

Experimental Results

  • Main Results (Average): VisCodex-8B 68.8, VisCodex-33B 72.3, GPT-4o 73.3, GPT-4o-mini 65.7. It is particularly strong in UI/Chart tasks (Design2Code: 90.1/90.9 (8B), 90.5/91.1 (33B)).
  • Effectiveness of Merging (Ablation): In the 8B model, metrics consistently improved across the board, such as MMCode pass@1: 6.8 → 11.0 and ChartMimic 73.4 → 74.8.
  • Replacement vs. Merging: While replacing the LLM backbone is prone to breaking visual grounding, merging allows for adding code capabilities while preserving visual understanding. {index=24}
  • Superiority of MCD: Outperforms Web2Code and WebCode2M in low/high-level evaluations in Design2Code (e.g., Block-Match 89.6).

What is Model Merging?

Model merging is a method of combining the weights (parameters) of multiple pre-trained models to obtain a model with new capabilities. The goal is to "add up" the strengths of existing models without performing additional large-scale retraining.


What is VLM (Vision-Language Model)?

VLM is a general term for generative models that can simultaneously handle images ("Vision") and text ("Language"). They can understand images to generate text or reason about image content based on textual instructions.

Personal Thoughts

Paper Interpretation

This study proposes a model called VisCodex, which has strengths in interpreting images and generating code, by merging the multimodal model Qwen2.5-VL with OpenCodeReasoning-Nemotron-1.1, a text-only unimodal model specialized for code generation.

Models like GPT-4o are also multimodal models, and VisCodex is compared against them.
Since it outperforms the lightweight model GPT-4o-mini, there is a possibility that higher-performing models will emerge as research in this field progresses.

Discussion (Thoughts)

The biggest takeaway from this paper is the effectiveness of model merging. This is a method to boost performance by synthesizing the weights (parameters) of existing pre-trained models, and it is widely used in other research as well. Conceptually, it gives hope that combining multiple domain-specific models could bring us closer to AGI.

However, in reality, the hurdles are high. Model merging is practically predicated on having an "identical architecture," and at least the following must match:

  • Number and order of layers, and the positions of residuals and normalization
  • Shape of each layer (d_model, number of heads/dimensions, FFN intermediate dimensions, presence/absence of bias, etc.)
  • Attention mechanisms (MHA/MQA/GQA), positional embeddings (RoPE/ALiBi), and their settings
  • Vocabulary size, shapes of embeddings/LM heads, and handling of special tokens
  • Parameter keys (state_dict names), storage/parallelization formats, dtype/quantization

In short, weights cannot be added unless conditions such as "same number of layers," "same parameter shape for each layer," "same inter-layer connections," and "same vocabulary size" are met.

While these can be aligned during design (hyperparameters), detailed information sharing is essential to match models from different teams or companies, which is not realistic. As a result, in practice, operations tend to favor merging fine-tuned models with different purposes that share the same base model. This limits the speed and scope of directly incorporating insights from new models developed by others. My current feeling is that it is difficult to aim for AGI through model merging alone.

Therefore, as the next option, I am focusing on distillation (transferring teacher behavior to students even across different architectures). Moving forward, I will continue to read papers on distillation that enable integration between heterogeneous models.

Discussion