LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following

Cheng-Fu Yang, Yen-Chun Chen, Jianwei Yang, Xiyang Dai, Lu Yuan, Yu-Chiang Frank Wang, and Kai-Wei Chang, in EMNLP, 2023.

Download the full text

Abstract

End-to-end Transformers has demonstrated impressive success rate for Embodied Instruction Following when the environment has been seen in the training time. However, they tend to struggle when deploying into a new environment. We discover this lack of generalizability is due to ignorance of the natural language instruction. To mitigate this, we first propose to explicitly align the agent’s hidden states to the instructions via contrastive learning. Nevertheless, the semantic gap between high-level language instructions and the agent’s low-level action space remains an obstacle. We further bridge this gap via a novel concept of meta-actions. Meta-actions are ubiquitous action patterns that can be parsed from the original action sequence. These patterns represents higher-level semantics that are intuitively more similar to the instructions. When meta-actions are further applied as additional training signals, the agent generalizes even better to unseen environments. Compared to a strong multi-modal Transformer baseline, we achieve a significant 4.5% absolute gain in success rate at the unseen environments of ALFRED Embodied Instruction Following. Additional analysis shows that the contrastive objective and meta-actions are complementary for achieving the best result, and the resulting agent better aligns its states to corresponding instructions, hence is more favorable for real-world embodied agents.

Bib Entry

@inproceedings{yang2023lacma,
  title = {LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following},
  author = {Yang, Cheng-Fu and Chen, Yen-Chun and Yang, Jianwei and Dai, Xiyang and Yuan, Lu and Wang, Yu-Chiang Frank and Chang, Kai-Wei},
  booktitle = {EMNLP},
  year = {2023}
}