BEVFormer: Learning Bird’s-Eye-View Representation From LiDAR-Camera via Spatiotemporal Transformers
Nanjing University · Shanghai Artificial Intelligence Laboratory · +2 more institutions
Abstract
Multi-modality fusion strategy is currently the de-facto most competitive solution for 3D perception tasks. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations from multi-modality data with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from both point cloud and camera input, thus completing multi-modality information fusion under BEV space. For…
Citation impact
- FWCI
- 209.84
- Percentile
- 100%
- References
- 103
Authors
8Topics & keywords
- Lidar
- Artificial intelligence
- Computer vision
- Computer science
- Transformer
- Representation (politics)
- Pattern recognition (psychology)
- Remote sensing