About Research+Code Blog

NLP Papers

The general theme that these papers follow is tying language meaning to language use specifically in reinforcement learning settings for task solving and navigation. The complexity of environments varies (e.g., complex visual information and natural language utterances as opposed to discrete grid-world environments and fake language sampled from a grammar. That said, there's a lot to learn from both settings and we're still a far way off from solving a lot of the hard problems in both scenarios.

Here’s the list. Happy reading!



Speaker-Follower Models for Vision-and-Language Navigation
Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell

This paper was a very cool attempt at fitting instruction following into the RSA setup. Essentially, build a base listener trained to follow instructions, a base speaker trained to generate instructions and then a pragmatic listener agent that conditions on both. An agent that tries to follow natural language instructions by keeping in mind the speaker's beliefs over routes in the environment can outperform an L0 listener agent that only attempts to explore the environment without this additional reasoning. Big gains in performance (possibly also because of data augmentation from speaker) and generally a very cool practical use of Goodman et. al's RSA framework for navigation in environments with complex, panoramic state information. Nit: while the speaker is trained to generate natural language, the routes that it is trained on are exhaustively sampled shortes path routes in different environments. It is likely the case that the gains in performance come from this additional exploration during training rather than any communicative agent belief reasoning.. but still a noteworthy approach.



Learning to Win by Reading Manuals in a Monte-Carlo Framework
S.R.K. Branavan, David Silver, Regina Barzilay

This (in my mind) revolutionary paper from MIT in 2011 attempts to do things (akin to DQN, language-conditioned RL, skill learning) waay before it's time. Game environments (in their case they use Civilisation, but can extend to Minecraft and the likes) often come with manuals. It is natural for a human to first read a manual to understand the game dynamics, and then successfully learn to win the game. This paper attempts to do that in a somewhat primitive way --- instructions in manuals are parsed to learn useful relations and skills (e.g., axes->tools->wood->fire and wood->planks->bridge) that you can then input to a Q-function that therefore learns to attribute useful language information to states, in order to transfer to other scenarios where these can be used. They use MC updates for Q learning and show that this actually helps in better behavior. Lots of things left to explore in this area: can we learn dynamically from text, from interaction, by transferring and chaining skills sequentially, by incorporating priors over states and also using pretrained language information.



Points, Paths, and Playscapes: Large-scale Spatial Language Understanding Tasks Set in the Real World
Jason Baldridge, Tania Bedrax-Weiss, Daphne Luong, Srini Narayanan, Bo Pang, Fernando Pereira, Radu Soricut, Michael Tseng, Yuan Zhang

This is mostly a survey paper but an interesting read (and was brilliantly presented at a NAACL workshop). Spatial language understanding has so far been explored in 2D realms and entirely from language; the next big advances require real-world interactive datasets, where a bot can experience everything that a human would, interact with them, and learn to survive in the environment This requires understanding of the happenings occurring around it and complex sensorimotor information perceiving capabilties. For the first time, it has access to nearly all modes of informaiton that a human would, as opposed to only text and images. This survey/proposal proposes to create richer data that requires scene understanding, will be in first person perspective, will be at a much larger scale (e.g., Google Maps data around the world), will have more natural language.



Semantic Parsing with Semi-Supervised Sequential Autoencoders
Tomáš Kočiský, Gábor Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, Karl Moritz Hermann

Interesting approach for meaning representations (more detail in the other, correct section). Which applies semi-supervision to a navigation task by sampling new environments and maps (in synthetic domains without vision), and training an autoencoder to reconstruct routes, using language as a latent variable.



Listen, Attend, and Walk: Neural Mapping of Navigational Instructions to Action Sequences
Hongyuan Mei, Mohit Bansal and Matthew R. Walter

The first seq2seq! For RL! That works! And on SAIL, too! More concretely --- one of the first attempts to go directly from language → policy on Ray Mooney's benchmark instruction-following task and they convincingly show improved SOTA performance on a pretty hard task (yes, SAIL is a grid world and a pretty small one, but has a complex array of objects and state features and pretty variable natural language instructions) which is interesting + promising to see.



Language as an abstraction for HRL.
Yiding Jiang, Shixiang Gu, Kevin Murphy, Chelsea Finn

Very cool attempt to use language to instruct a two-layer hierarchical policy for sorting / rearrangement tasks. Instead of other forms of abstraction, attempt to use *language* as an abstraction for HRL (so atomic units can be composed together like subtasks). Train low level policies that take in language and learn to solve the task and then train high level policies that instruct the low level policies (through language), and therefore allows reuse of language components (i.e., abstractions and sub-units) across tasks. Evaluate on object sorting and multi-object rearrangement and also introduce a new RL env for these tasks. They mostly show that making high level policies produce “language” e.g., high-level actions in the space of possible language strings does better than a vanilla DQN-esque thing with a flat policy.



Automatically Composing Representation Transformations as a Means for Generalization
Michael B. Chang, Abhishek Gupta, Sergey Levine, Thomas L. Griffiths

Compositionality is key to be able to reuse previous experiences and generalise to complex tasks composed of subtasks. This paper wants to relate tasks through a compositional problem graph (where tasks of different complexity i.e., number of subtasks can be related). They introduce a compositional recursive learner that reasons about what computation to execute by making analogies to previously seen problems. So recast the problem of generalisation as a problem of learning algorithmic procedures over representation transformations → a solution to a problem is a transformation between its input and output representations. They experiment on multilingual arithmetic problem solving and recognition of spatially transformed digits in MNIST. And it works kind of well!



Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions.
Yoav Artzi and Luke Zettlemoyer.

As needed for most planning / RL / robotics settings, we want to ground language to symbolic meaning (CCG) representations that you can use to plan. But can we do this without having to annotate paired language → CCG data? This paper builds a feedback mechanism to reward the learned logical forms with trajectories in an environment --- therefore never (but, caveat) seeing ground truth logical forms during training. Their feedback mechanism is somewhat inefficient --- a model needs to be built for a produced logical form, to execute this in the environment and see if it reaches the goal state (this is a somewhat simplified explanation, read for more detail). One of the first weakly supervised approaches for instruction following and has been foundational for all future attempts to do this.



Modular Multitask RL with Policy Sketches.
Jacob Andreas, Dan Klein and Sergey Levine.

This paper attempts to tie in information about the structure of a policy by annotating tasks with "sketches". Each sketch is a symbolic name of a high-level sub-plan analogous to an option. Each modular sub-policy (associated with the task symbol in the sketch) is jointly learned by maximising reward over the task-specifi policies (that can share these sub-polciies). They use an A2C method to optimise this entire thing and compare to baselines (option-critic and un-tied sub-polciies) and show good generalisation that allows reusing the learned sub-polcies for different tasks. This doesn't actually do anything with natural language, but was a pretty cool attempt to show that giving a small amount of structured information --- here in the form of symbolic information about option chains, can go a long way.


Learning and using language via recursive pragmatic reasoning about other agents.
Smith, Nathaniel J.; Noah D. Goodman; and Michael C. Frank. 2013.

This paper (to the best of my knowledge) serves as the first proper introduction of the RSA model i.e., a setting in which interacting agents or language learners approximate a shared lexicon that they jointly reason over. They interact in a goal-oriented fashion i.e., iteratively pass knowledge to one another. Later work builds on this is several different ways; while fixed RSA deals with a fixed literal semantic lexicon that agents share, learned RSA allows more flexibility. Here, what agents are sharing are model parameters, which they then use to optimise what they are interested in i.e., a speaker wants the best utterance given the target and context and the listener wants the best object given the utterance, but now these are parameterised by the same learned variable. This model therefore portrays several phenomena and key features of language i.e., pragmatic inference, communicative conversation, word meaning and referring expressions. Important elements: Covers the fundamental RSA equations relating speakers, listeners and the lexicon. Shows how learning, reasoning and inference take place and ties this in with phenomenon like scalar implicature and Horn implicature. Explains the basic reference game layout.



Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, Lei Zhang



Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira, Richard Socher, Caiming Xiong



Deep Transfer in Reinforcement Learning by Language Grounding
Karthik Narasimhan, Regina Barzilay and Tommi Jaakkola



Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout
Hao Tan, Licheng Yu, Mohit Bansal



Hierarchical Decision Making by Generating and Following Natural Language Instructions
Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, Mike Lewis



From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following
Justin Fu, Anoop Korattikara, Sergey Levine, Sergio Guadarrama



Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation
Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, Zhe Gan, Jingjing Liu, Jianfeng Gao, Yejin Choi, Siddhartha Srinivasa



Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction
Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, Yoav Artzi