TY - GEN
T1 - R-C3D
T2 - 16th IEEE International Conference on Computer Vision, ICCV 2017
AU - Xu, Huijuan
AU - Das, Abir
AU - Saenko, Kate
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/12/22
Y1 - 2017/12/22
N2 - We address the problem of activity detection in continuous, untrimmed video streams. This is a difficult task that requires extracting meaningful spatio-temporal features to capture activities, accurately localizing the start and end times of each activity. We introduce a new model, Region Convolutional 3D Network (R-C3D), which encodes the video streams using a three-dimensional fully convolutional network, then generates candidate temporal regions containing activities, and finally classifies selected regions into specific activities. Computation is saved due to the sharing of convolutional features between the proposal and the classification pipelines. The entire model is trained end-to-end with jointly optimized localization and classification losses. R-C3D is faster than existing methods (569 frames per second on a single Titan X Maxwell GPU) and achieves state-of-the-art results on THUMOS'14. We further demonstrate that our model is a general activity detection framework that does not rely on assumptions about particular dataset properties by evaluating our approach on ActivityNet and Charades. Our code is available at http://ai.bu.edu/r-c3d/
AB - We address the problem of activity detection in continuous, untrimmed video streams. This is a difficult task that requires extracting meaningful spatio-temporal features to capture activities, accurately localizing the start and end times of each activity. We introduce a new model, Region Convolutional 3D Network (R-C3D), which encodes the video streams using a three-dimensional fully convolutional network, then generates candidate temporal regions containing activities, and finally classifies selected regions into specific activities. Computation is saved due to the sharing of convolutional features between the proposal and the classification pipelines. The entire model is trained end-to-end with jointly optimized localization and classification losses. R-C3D is faster than existing methods (569 frames per second on a single Titan X Maxwell GPU) and achieves state-of-the-art results on THUMOS'14. We further demonstrate that our model is a general activity detection framework that does not rely on assumptions about particular dataset properties by evaluating our approach on ActivityNet and Charades. Our code is available at http://ai.bu.edu/r-c3d/
UR - https://www.scopus.com/pages/publications/85041930119
UR - https://www.scopus.com/pages/publications/85041930119#tab=citedBy
U2 - 10.1109/ICCV.2017.617
DO - 10.1109/ICCV.2017.617
M3 - Conference contribution
AN - SCOPUS:85041930119
T3 - Proceedings of the IEEE International Conference on Computer Vision
SP - 5794
EP - 5803
BT - Proceedings - 2017 IEEE International Conference on Computer Vision, ICCV 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 22 October 2017 through 29 October 2017
ER -