A-ViT: Adaptive Tokens for Efficient Vision Transformer

Indexed incrossref

Abstract

We introduce A - ViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. A - ViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds. We refor-mulate Adaptive Computation Time (ACT [17]) for this task, extending halting to discard redundant spatial tokens. The appealing architectural properties of vision transformers enables our adaptive token reduction mechanism to speed up inference without modifying the network architecture or inference hardware. We demonstrate that A - ViT requires no extra parameters or sub-network for halting, as we base the learning of…

Citation impact

295
total citations
FWCI
16.11
Percentile
100%
References
83
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Inference
  • Transformer
  • Security token
  • Artificial intelligence
  • Rendering (computer graphics)
  • Computation
  • Regularization (linguistics)
No related works found for this paper.