Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning

Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, Huijuan Xu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

56 Scopus citations


Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag’s label is known, the main challenge is assigning which key instances within the bag to trigger the bag’s label. Most previous models use attention-based approaches applying attentions to generate the bag’s representation from instances, and then train it via the bag’s classification. These models, however, implicitly violate the MIL assumption that instances in negative bags should be uniformly negative. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We show that our EM-MIL approach more accurately models both the learning objective and the MIL assumptions. It achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2.

Original languageEnglish (US)
Title of host publicationComputer Vision – ECCV 2020 - 16th European Conference, Proceedings
EditorsAndrea Vedaldi, Horst Bischof, Thomas Brox, Jan-Michael Frahm
PublisherSpringer Science and Business Media Deutschland GmbH
Number of pages17
ISBN (Print)9783030585259
StatePublished - 2020
Event16th European Conference on Computer Vision, ECCV 2020 - Glasgow, United Kingdom
Duration: Aug 23 2020Aug 28 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12374 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference16th European Conference on Computer Vision, ECCV 2020
Country/TerritoryUnited Kingdom

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning'. Together they form a unique fingerprint.

Cite this