Vision-Language Navigation (VLN)

🚩 Important

This work, NaVILA, is proposed by Cheng et al. and presents a vision-language action (VLA) model for legged robot navigation. Please refer to their website for more details.

Citation

@inproceedings{cheng2024navila,
    title = {NaVILA: Legged Robot Vision-Language-Action Model for Navigation},
    author = {Cheng, An-Chieh and Ji, Yandong and Yang, Zhaojing and Zou, Xueyan and Kautz, Jan and Biyik, Erdem and Yin,
    Hongxu and Liu, Sifei and Wang, Xiaolong},
    booktitle = {RSS},
    year = {2025},}

Introduction

We demonstrate its deployment on a Unitree A1 quadruped performing indoor tasks. The VLA model is executed on an RTX 5090 server, with communication handled via UDP. A detailed setup guide is provided for RTX 50-series GPUs and newer Ubuntu environments.

Installation Guidance

Please follow this instruction (It's a markdown file).