GroupViT: Semantic Segmentation Emerges from Text Supervision

Xu, Jiarui; Mello, Shalini De; Liu, Sifei; Byeon, Wonmin; Breuel, Thomas M.; Kautz, Jan; Wang, Xiaolong

doi:10.1109/cvpr52688.2022.01760

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

GroupViT: Semantic Segmentation Emerges from Text Supervision

JXJiarui Xu SDShalini De Mello SLSifei Liu WBWonmin Byeon TMThomas M. Breuel

UC San Diego Health System

Indexed incrossref

Abstract

Grouping and recognition are important components of visual scene understanding, e.g., for object detection and semantic segmentation. With end-to-end deep learning systems, grouping of image regions usually happens implicitly via top-down supervision from pixel-level recognition labels. Instead, in this paper, we propose to bring back the grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. We propose a hierarchical Grouping Vision Transformer (GroupViT), which goes beyond the regular grid structure representation and learns to group image regions into progressively larger arbitrary-shaped segments. We train GroupViT jointly with a text…

Citation impact

403

total citations

FWCI: 22.42
Percentile: 100%
References: 98

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Pascal (unit)
Segmentation
Artificial intelligence
Encoder
Transformer
Natural language processing
Pattern recognition (psychology)

No related works found for this paper.

Funding

NS
National Science Foundation
Award: CCF-2112665 (TILOS)