Convolutional Two-Stream Network Fusion for Video Action Recognition
Graz University of Technology · University of Oxford
Abstract
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters, (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class…
Citation impact
- FWCI
- 151.89
- Percentile
- 100%
- References
- 49
Authors
3Topics & keywords
- Softmax function
- Computer science
- Fuse (electrical)
- Convolutional neural network
- Pooling
- Artificial intelligence
- Convolution (computer science)
- Action recognition
- Sustainable cities and communities