ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

Girdhar, Rohit; Ramanan, Deva; Gupta, Abhinav; Šivic, Josef; Russell, Bryan

doi:10.1109/cvpr.2017.337

preprintJul 1, 2017GREEN OA

ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification

RGRohit Girdhar DRDeva Ramanan AGAbhinav Gupta JŠJosef Šivic BRBryan Russell

Carnegie Mellon University · Adobe Systems (United States) · +1 more institution

Indexed inarxivcrossref

Abstract

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks [42] with learnable spatio-temporal feature aggregation [6]. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation…

Citation impact

519

total citations

FWCI: 26.81
Percentile: 100%
References: 79

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Pooling
Representation (politics)
Artificial intelligence
Margin (machine learning)
Action recognition
Pattern recognition (psychology)
Feature (linguistics)

No related works found for this paper.