GLaMM: Pixel Grounding Large Multimodal Model
Mohamed bin Zayed University of Artificial Intelligence · University of California, Merced
Abstract
Large Multimodal Models (LMMs) extend Large Lan-guage Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or can-not offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly in-tertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the con-versations but is flexible enough to accept both…
Citation impact
- FWCI
- 25.96
- Percentile
- 100%
- References
- 59
Authors
10- HRHanoona RasheedCorresponding
Mohamed bin Zayed University of Artificial Intelligence
- MMMuhammad Maaz
Mohamed bin Zayed University of Artificial Intelligence
- SSSahal Shaji
Mohamed bin Zayed University of Artificial Intelligence
- ASAbdelrahman Shaker
Mohamed bin Zayed University of Artificial Intelligence
- SKSalman Khan
Mohamed bin Zayed University of Artificial Intelligence
Topics & keywords
- Computer science
- Pixel
- Ground
- Computer vision
- Artificial intelligence
- Electrical engineering
- Engineering