Cross-view Transformers for real-time Map-view Semantic Segmentation

The University of Texas at Austin

Indexed incrossref

Abstract

We present cross-view transformers, an efficient attention-based model for map-view semantic segmentation from multiple cameras. Our architecture implicitly learns a mapping from individual camera views into a canonical map-view representation using a camera-aware cross-view attention mechanism. Each camera uses positional embeddings that depend on its intrinsic and extrinsic calibration. These embeddings allow a transformer to learn the mapping across different views without ever explicitly modeling it geometrically. The architecture consists of a convolutional image encoder for each view and cross-view transformer layers to infer a map-view semantic segmentation. Our model is simple, easily parallelizable,…

Citation impact

283
total citations
FWCI
15.81
Percentile
100%
References
63
Citations per year

Authors

2

Topics & keywords

Keywords
  • Computer science
  • Encoder
  • Segmentation
  • Transformer
  • Inference
  • Artificial intelligence
  • Computer vision
  • Architecture
UN Sustainable Development Goals
  • Sustainable cities and communities
No related works found for this paper.

Funding