Rethinking Vision Transformers for MobileNet Size and Speed
Universidad del Noreste · Snap (United States) · +2 more institutions
Abstract
With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as…
Citation impact
- FWCI
- 29.40
- Percentile
- 100%
- References
- 108
Authors
8Topics & keywords
- Computer science
- Software deployment
- Latency (audio)
- Transformer
- Computer engineering
- Mobile device
- Distributed computing
- Real-time computing