articleJun 16, 2024Closed access

GLaMM: Pixel Grounding Large Multimodal Model

Mohamed bin Zayed University of Artificial Intelligence · University of California, Merced

Indexed incrossref

Abstract

Large Multimodal Models (LMMs) extend Large Lan-guage Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or can-not offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly in-tertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the con-versations but is flexible enough to accept both…

Citation impact

115
total citations
FWCI
25.96
Percentile
100%
References
59
Citations per year

Authors

10

Topics & keywords

Keywords
  • Computer science
  • Pixel
  • Ground
  • Computer vision
  • Artificial intelligence
  • Electrical engineering
  • Engineering
No related works found for this paper.