ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification
Carnegie Mellon University · Adobe Systems (United States) · +1 more institution
Abstract
In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation…
Citation impact
- FWCI
- 26.81
- Percentile
- 100%
- References
- 79
Authors
5Topics & keywords
- Computer science
- Pooling
- Representation (politics)
- Artificial intelligence
- Margin (machine learning)
- Action recognition
- Pattern recognition (psychology)
- Feature (linguistics)