The general theme that these papers follow is language grounding and vision for a range of different tasks, datasets, environments and.. downstream reasons we would want this.
Here’s the list. Happy reading!
Learning Visually Grounded Sentence Representations D. Kiela, A. Conneau, A. Jabri and M. Nickel
Learning Robust Visual-Semantic Embeddings Yao-Hung Hubert Tsai, Liang-Kang Huang, Ruslan Salakhutdinov
Multimodal learning with deep Boltzmann machines
Learning Representations by Maximizing Mutual Information Across Views Philip Bachman, R Devon Hjelm, William Buchwalter
Learning Robust Visual-Semantic Embeddings Yao-Hung Hubert Tsai, Liang-Kang Huang, Ruslan Salakhutdinov
Do Neural Network Cross-Modal Mappings Really Bridge Modalities? Guillem Collell, Marie-Francine Moens
Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering Ramakrishna Vedantam, Karan Desai, Stefan Lee, Marcus Rohrbach, Dhruv Batra, Devi Parikh
Learning to Count Objects in Natural Images for Visual Question Answering Yan Zhang, Jonathon Hare, Adam Prügel-Bennett
Overcoming Language Priors in Visual Question Answering with Adversarial Regularization Sainandan Ramakrishnan, Aishwarya Agrawal, Stefan Lee
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick
VQA: Visual Question Answering Aishwarya Agrawal, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Dhruv Batra, Devi Parikh