Cross-view Transformers for real-time Map-view Semantic Segmentation
The University of Texas at Austin
Abstract
We present cross-view transformers, an efficient attention-based model for map-view semantic segmentation from multiple cameras. Our architecture implicitly learns a mapping from individual camera views into a canonical map-view representation using a camera-aware cross-view attention mechanism. Each camera uses positional embeddings that depend on its intrinsic and extrinsic calibration. These embeddings allow a transformer to learn the mapping across different views without ever explicitly modeling it geometrically. The architecture consists of a convolutional image encoder for each view and cross-view transformer layers to infer a map-view semantic segmentation. Our model is simple, easily parallelizable,…
Citation impact
- FWCI
- 15.81
- Percentile
- 100%
- References
- 63
Authors
2Topics & keywords
- Computer science
- Encoder
- Segmentation
- Transformer
- Inference
- Artificial intelligence
- Computer vision
- Architecture
- Sustainable cities and communities