Ask, attend and answer: Exploring question-guided spatial attention for visual question answering

Huijuan Xu, Kate Saenko

Research output: Chapter in Book/Report/Conference proceedingConference contribution

404 Scopus citations


We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses attention to choose regions relevant for computing the answer. We propose a novel question-guided spatial attention architecture that looks for regions relevant to either individual words or the entire question, repeating the process over multiple recurrent steps, or “hops”. To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the network’s attention.We evaluate our model on two available visual question answering datasets and obtain improved results.

Original languageEnglish (US)
Title of host publicationComputer Vision - 14th European Conference, ECCV 2016, Proceedings
EditorsBastian Leibe, Jiri Matas, Nicu Sebe, Max Welling
PublisherSpringer Verlag
Number of pages16
ISBN (Print)9783319464770
StatePublished - 2016
Event14th European Conference on Computer Vision, ECCV 2016 - Amsterdam, Netherlands
Duration: Oct 8 2016Oct 16 2016

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9911 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Other14th European Conference on Computer Vision, ECCV 2016

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'Ask, attend and answer: Exploring question-guided spatial attention for visual question answering'. Together they form a unique fingerprint.

Cite this