articleIEEE Transactions on MultimediaJan 1, 2026Closed access

MoE-LLaVA : Mixture of Experts for Large Vision-Language Models

Indexed incrossref

Abstract

Recently, remarkable progress has been made in scaling up Large Language Models (LLMs) through the use of the sparse Mixture-of-Expert (MoE) layers without significantly increasing computational cost. However, the transition from a pre-trained LLM to a sparse Large Vision-Language Model (LVLM) with MoE remains an open challenge. Directly fine-tuning an LLM to a sparse LVLM often leads to training collapse, characterized by (1) a large modality feature distribution gap and (2) expert load imbalance. This paper proposes a three-stage decoupled weight training process. In the first two stages, the model learns to adapt the LLM to an LVLM. In the third stage, the FFN weights from the second stage are used as…

Citation impact

8
total citations
FWCI
155.53
Percentile
100%
References
0
Citations per year

Authors

10

Topics & keywords

Keywords
  • Initialization
  • Sparse matrix
  • Sparse approximation
  • Feature (linguistics)
  • Code (set theory)
  • Baseline (sea)
  • Lossless compression
No related works found for this paper.