GLaMM: Pixel Grounding Large Multimodal Model

Rasheed, Hanoona; Maaz, Muhammad; Shaji, Sahal; Shaker, Abdelrahman; Khan, Salman; Cholakkal, Hisham; Anwer, Rao Muhammad; Xing, Eric P.; Yang, Ming–Hsuan; Khan, Fahad Shahbaz

doi:10.1109/cvpr52733.2024.01236

articleJun 16, 2024Closed access

GLaMM: Pixel Grounding Large Multimodal Model

HRHanoona Rasheed MMMuhammad MaazSSSahal ShajiASAbdelrahman Shaker SKSalman Khan

Mohamed bin Zayed University of Artificial Intelligence · University of California, Merced

Indexed incrossref

Abstract

Large Multimodal Models (LMMs) extend Large Lan-guage Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or can-not offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly in-tertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the con-versations but is flexible enough to accept both…

Citation impact

115

total citations

FWCI: 25.96
Percentile: 100%
References: 59

Citations per year

Authors

10

HR
Hanoona RasheedCorresponding
Mohamed bin Zayed University of Artificial Intelligence
MM
Muhammad Maaz
Mohamed bin Zayed University of Artificial Intelligence
SS
Sahal Shaji
Mohamed bin Zayed University of Artificial Intelligence
AS
Abdelrahman Shaker
Mohamed bin Zayed University of Artificial Intelligence
SK
Salman Khan
Mohamed bin Zayed University of Artificial Intelligence

Topics & keywords

Topics

Keywords

Computer science
Pixel
Ground
Computer vision
Artificial intelligence
Electrical engineering
Engineering

No related works found for this paper.