Autoregressive Reasoning and Diffusion Policies for Generalizable Vision-Language-Action Models

Junjie Wen*,1,2 Minjie Zhu*1,2 Yichen Zhu*,†,1 Zhibin Tang1 Jinming Li1,3 Chengmeng Li1,3 Zhongyi Zhou1,2 Xiaoyu Liu1,3
Chaomin Shen2 Yaxin Peng3 Feifei Feng1
1. Midea Group 2. East China Normal University 3. Shanghai University
*Equal Contribution.

Corresponding author.

Abstract

In this paper, we present DiVLA, a novel framework that seamlessly combines the autoregression model with the diffusion model for learning visuomotor policy. Central to our approach is a next-token prediction objective, enabling the model to reason effectively over the user's query in the context of current observations. Subsequently, a diffusion model is attached to generate robust action outputs. To enhance policy learning through self-reasoning, we introduce a novel reasoning injection module that integrates reasoning phrases directly into the policy learning process. The whole framework is simple and flexible, making it easy to deploy and upgrade.

We conduct extensive experiments using multiple real robots to validate the effectiveness of DiVLA. Our tests include a challenging factory sorting task, where DiVLA successfully categorizes objects, including those not seen during training. We observe that the reasoning module enhances interpretability, allowing observers to understand the model's thought process and identify potential causes of policy failures. Additionally, we test DiVLA on a zero-shot bin-picking task, achieving 63.7% accuracy on 102 previously unseen objects. Our method demonstrates robustness to visual changes, such as distractors and new backgrounds, and easily adapts to new embodiments. Furthermore, DiVLA can follow novel instructions and retain conversational ability. Notably, DiVLA is data-efficient and fast at inference; our smallest DiVLA-2B runs 82Hz on a single A6000 GPU and can train from scratch on less than 50 demonstrations for a complex task. Finally, we scale the model from 2B to 72B parameters, showcasing improved generalization capabilities with increased model size.

Demos on Challenging Tasks

1.Factory Sorting

The instruction of factory sorting is:Sort all the items on the right panel. Blue means reasoning generated by our model.

2.Bussing Table

The instruction of Bussing table is:Sort the tablewares and rubbish on the table. Blue means reasoning generated by our model.

Demos on Multi-Tasks & Visual Generalization

Red means instructions. Blue means reasoning generated by our model.

Environmental Setup

teaser

Setup for the Franka Robot and Experimental Configuration for Factory Sorting. Left: For factor sorting tasks, (a) The target sorting box is divided into four distinct sectors, each designated for one of the following categories: stuffed toys, hex keys, knit gloves, and toy cars, (c) The seen objects in the train data, (d) mixing the seen and unseen object for evaluation, (e) cluttered scene for seen objects, (f) cluttered scene for mixing seen and unseen objects. Middle: We use a Franka robot arm equipped with two external Zed cameras and a Realsense 435i wrist camera. Right: The setup for zero-shot bin picking.

teaser

(a) Environmental setup for the bimanual robot, featuring three camera views; (b) Table bussing setup, with a trash bin positioned on the right side and a panel on the left. The task requires placing all tableware on the panel and all trash in the trash bin; (c-f) All tableware and trash items used in the bussing table task evaluation.

Experiments Results

teaser

Experimental Results for Factory Sorting. We compared our DiVLA with Diffusion Policy, Octo, TinyVLA, and OpenVLA. DiVLA achieves the highest average success rate, outperforming the runner-up OpenVLA by 20.9%. Furthermore, DiVLA exhibits strong zero-shot capabilities for bin picking, demonstrating impressive capability in handling objects with varying shapes, heights, and orientations.

teaser

Zero-shot Bin Picking on 102 Unseen Objects. Our method outperforms the state-of-the-art robot foundation models by a large margin.

teaser

Experimental Results for Multi-Task Learning on Real Robot. We report the count of pre-trained trajectories. We also report the average success rate for evaluation on both in-distribution and out-of-distribution. Task 1: Select the appropriate object based on the user's intent. Task 2: Upright the tipped-over pot. Task 3: Pick up the cube and place it into the [yellow/blue] box. Task 4: Place the cup onto the plate. Task 5: Place the cube into the box.

BibTeX

    @article{wen2024diffusionvla,
      title={DiffusionVLA: Scaling Robot Foundation Models via Unified Diffusion and Autoregression},
      author={Wen, Junjie and Zhu, Minjie and Zhu, Yichen and Tang, Zhibin and Li, Jinming and Zhou, Zhongyi and Li, Chengmeng and Liu, Xiaoyu and Peng, Yaxin and Shen, Chaomin and Feng, Feifei},
      journal={arXiv preprint arXiv:None},
      year={2024}
    },
    @article{wen2024tinyvla,
      title={TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation},
      author={Wen, Junjie and Zhu, Yichen and Li, Jinming and Zhu, Minjie and Wu, Kun and Xu, Zhiyuan and Cheng, Ran and Shen, Chaomin and Peng, Yaxin and Feng, Feifei and others},
      journal={arXiv preprint arXiv:2409.12514},
      year={2024}
    }