MoE-LLaVA  : Mixture of Experts for Large Vision-Language Models

Lin, Bin; Tang, Zhenyu; Ye, Yang; Huang, Jinfa; Zhang, Junwu; Pang, Yatian; Jin, Peng; Ning, Munan; Luo, Jiebo; Yuan, Li

doi:10.1109/tmm.2026.3654458

articleIEEE Transactions on MultimediaJan 1, 2026Closed access

MoE-LLaVA : Mixture of Experts for Large Vision-Language Models

BLBin Lin ZTZhenyu Tang YYYang Ye JHJinfa Huang JZJunwu Zhang

Indexed incrossref

Abstract

Recently, remarkable progress has been made in scaling up Large Language Models (LLMs) through the use of the sparse Mixture-of-Expert (MoE) layers without significantly increasing computational cost. However, the transition from a pre-trained LLM to a sparse Large Vision-Language Model (LVLM) with MoE remains an open challenge. Directly fine-tuning an LLM to a sparse LVLM often leads to training collapse, characterized by (1) a large modality feature distribution gap and (2) expert load imbalance. This paper proposes a three-stage decoupled weight training process. In the first two stages, the model learns to adapt the LLM to an LVLM. In the third stage, the FFN weights from the second stage are used as…

Citation impact

8

total citations

FWCI: 155.53
Percentile: 100%
References: 0

Citations per year

Authors

10

Topics & keywords

Topics

Keywords

Initialization
Sparse matrix
Sparse approximation
Feature (linguistics)
Code (set theory)
Baseline (sea)
Lossless compression

No related works found for this paper.