GroupViT: Semantic Segmentation Emerges from Text Supervision

UC San Diego Health System

Indexed incrossref

Abstract

Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text…

Citation impact

403
total citations
FWCI
22.42
Percentile
100%
References
98
Citations per year

Authors

7

Topics & keywords

Keywords
  • Computer science
  • Pascal (unit)
  • Segmentation
  • Artificial intelligence
  • Encoder
  • Transformer
  • Natural language processing
  • Pattern recognition (psychology)
No related works found for this paper.

Funding