TY - GEN
T1 - Multilevel language and vision integration for text-to-clip retrieval
AU - Xu, Huijuan
AU - He, Kun
AU - Plummer, Bryan A.
AU - Sigal, Leonid
AU - Sclaroff, Stan
AU - Saenko, Kate
N1 - Funding Information:
Supported in part by IARPA (contract number D17PC00344) and DARPA's XAI program.
Publisher Copyright:
© 2019, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2019
Y1 - 2019
N2 - We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.
AB - We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.
UR - http://www.scopus.com/inward/record.url?scp=85090804599&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090804599&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85090804599
T3 - 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019
SP - 9062
EP - 9069
BT - 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019
PB - AAAI press
T2 - 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Annual Conference on Innovative Applications of Artificial Intelligence, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019
Y2 - 27 January 2019 through 1 February 2019
ER -