Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models

Haohan Chi*,1, Huan-ang Gao*,1, Ziming Liu†,2, Jianing Liu1,
Chenyu Liu1, Jinwei Li1, Kaisen Yang1, Yangcheng Yu1, Zeda Wang1, Wenyi Li1,
Leichen Wang2, Xingtao Hu2, Hao Sun2, Hang Zhao3, Hao Zhao1,†

1AIR, Tsinghua University    2Bosch Research    3IIIS, Tsinghua University
*Equal contribution    Corresponding author

A short introduction to Impromptu VLA.

Abstract

Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks—improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.

Results

Open-loop trajectory prediction L2 errors (m) on the nuScenes dataset.
Method 1s 2s 3s Avg.
Closed-source API-only Models
GPT-4o1 0.28 0.93 2.02 1.07
Claude-3.5-Sonnet1 0.29 0.98 2.12 1.13
Claude-3.7-Sonnet1 0.28 0.94 2.04 1.09
Gemini-2.0-Flash1 0.31 1.08 2.36 1.25
Gemini-2.5-Pro1 0.37 1.35 2.96 1.56
Open-source Generalist VLMs
LLaVA-1.6-Mistral-7B2 1.49 3.38 4.09 2.98
Llama-3.2-11B-Vision-Instruct2 1.54 3.31 3.91 2.92
Qwen2-VL-7B-Instruct2 1.45 3.21 3.76 2.81
DeepSeek-VL2-16B1 0.66 1.68 2.92 1.75
DeepSeek-VL2-28B1 0.37 1.35 2.96 1.56
LLaMA-3.2-11B-Vision-Instruct1 0.52 1.42 2.68 1.54
LLaMA-3.2-90B-Vision-Instruct1 0.66 1.71 3.01 1.79
Qwen-2.5-VL-7B-Instruct1 0.46 1.33 2.55 1.45
Training-based Driving Specialists (Existing Methods)
UniAD3 0.42 0.64 0.91 0.66
VAD3 0.17 0.34 0.60 0.37
BEV-Planner3 0.16 0.32 0.57 0.35
Ego-MLP3* 0.15 0.32 0.59 0.35
Ours and Key Competitors (Specialized Driving Models)
DriveVLM3 0.18 0.34 0.68 0.40
OmniDrive3 0.14 0.29 0.55 0.33
DriveVLM-Dual3 0.15 0.29 0.48 0.31
EMMA (random init)3 0.15 0.33 0.63 0.37
EMMA3 0.14 0.29 0.54 0.32
EMMA+3 0.13 0.27 0.48 0.29
3B Base+nuScenes 0.14 0.30 0.58 0.34
3B Base+Impromptu+nuScenes 0.13 0.27 0.52 0.30
7B Base+nuScenes 0.13 0.28 0.55 0.32
7B Base+Impromptu+nuScenes 0.13 0.27 0.53 0.30
Note: Best results within each category are in bold, second best are underlined. 1 from LightEMMA, 2 from OpenEMMA, 3 from EMMA.
Results on NeuroNCAP
Source Method NeuroNCAP Score ↑ Collision rate (%) ↓
Avg. Stat. Frontal Side Avg. Stat. Frontal Side
CVPR 2023 UniAD2 0.73 0.84 0.10 1.26 88.6 87.8 98.4 79.6
ICCV 2023 VAD2 0.66 0.47 0.04 1.45 92.5 96.2 99.6 81.6
ICRA 2025 SparseDrive1 0.92 - - - 93.9 - - -
CVPR 2025 BridgeAD-S1 1.52 - - - 76.2 - - -
CVPR 2025 BridgeAD-B1 1.60 - - - 72.6 - - -
- Base+nuScenes 1.77 1.80 1.67 1.75 72.5 68.0 73.0 71.5
- Base+Impromptu+nuScenes 2.15 1.77 2.31 2.10 65.5 70.0 59.0 65.0
Note: Best scores in each category are in bold, second best are underlined. 1 from BridgeAD, 2 from NeuRAD
The improvements in both the overall NeuroNCAP score and, crucially, the reduction in collision rates suggest that our dataset helps the model develop a more nuanced understanding of complex road interactions, leading to more robust and safer driving policies.

Video Gallery

The videos compare the driving behavior of the two models in three representative challenging scenarios: stationary, frontal, and side. For each scenario, the left column shows the behavior of the base model, which is fine-tuned on nuScenes. The right column shows the performance of the model trained on a subset of our proposed dataset and then fine-tuned on nuScenes. Compared to the base model, the model using our data can better avoid vehicles by turning, slowing down, etc.


stationary

Base+nuScenes

Base+Impromptu+nuScenes


side

Base+nuScenes

Base+Impromptu+nuScenes


frontal

Base+nuScenes

Base+Impromptu+nuScenes