Impromptu VLA

Abstract

Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks—improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.

Results

Open-loop trajectory prediction L2 errors (m) on the nuScenes dataset.
Method	1s	2s	3s	Avg.
Closed-source API-only Models
GPT-4o¹	0.28	0.93	2.02	1.07
Claude-3.5-Sonnet¹	0.29	0.98	2.12	1.13
Claude-3.7-Sonnet¹	0.28	0.94	2.04	1.09
Gemini-2.0-Flash¹	0.31	1.08	2.36	1.25
Gemini-2.5-Pro¹	0.37	1.35	2.96	1.56
Open-source Generalist VLMs
LLaVA-1.6-Mistral-7B²	1.49	3.38	4.09	2.98
Llama-3.2-11B-Vision-Instruct²	1.54	3.31	3.91	2.92
Qwen2-VL-7B-Instruct²	1.45	3.21	3.76	2.81
DeepSeek-VL2-16B¹	0.66	1.68	2.92	1.75
DeepSeek-VL2-28B¹	0.37	1.35	2.96	1.56
LLaMA-3.2-11B-Vision-Instruct¹	0.52	1.42	2.68	1.54
LLaMA-3.2-90B-Vision-Instruct¹	0.66	1.71	3.01	1.79
Qwen-2.5-VL-7B-Instruct¹	0.46	1.33	2.55	1.45
Training-based Driving Specialists (Existing Methods)
UniAD³	0.42	0.64	0.91	0.66
VAD³	0.17	0.34	0.60	0.37
BEV-Planner³	0.16	0.32	0.57	0.35
Ego-MLP³*	0.15	0.32	0.59	0.35
Ours and Key Competitors (Specialized Driving Models)
DriveVLM³	0.18	0.34	0.68	0.40
OmniDrive³	0.14	0.29	0.55	0.33
DriveVLM-Dual³	0.15	0.29	0.48	0.31
EMMA (random init)³	0.15	0.33	0.63	0.37
EMMA³	0.14	0.29	0.54	0.32
EMMA+³	0.13	0.27	0.48	0.29
3B Base+nuScenes	0.14	0.30	0.58	0.34
3B Base+Impromptu+nuScenes	0.13	0.27	0.52	0.30
7B Base+nuScenes	0.13	0.28	0.55	0.32
7B Base+Impromptu+nuScenes	0.13	0.27	0.53	0.30

Note: Best results within each category are in bold, second best are underlined. ¹ from LightEMMA, ² from OpenEMMA, ³ from EMMA.

Results on NeuroNCAP
Source	Method	NeuroNCAP Score ↑				Collision rate (%) ↓
Source	Method	Avg.	Stat.	Frontal	Side	Avg.	Stat.	Frontal	Side
CVPR 2023	UniAD²	0.73	0.84	0.10	1.26	88.6	87.8	98.4	79.6
ICCV 2023	VAD²	0.66	0.47	0.04	1.45	92.5	96.2	99.6	81.6
ICRA 2025	SparseDrive¹	0.92	-	-	-	93.9	-	-	-
CVPR 2025	BridgeAD-S¹	1.52	-	-	-	76.2	-	-	-
CVPR 2025	BridgeAD-B¹	1.60	-	-	-	72.6	-	-	-
-	Base+nuScenes	1.77	1.80	1.67	1.75	72.5	68.0	73.0	71.5
-	Base+Impromptu+nuScenes	2.15	1.77	2.31	2.10	65.5	70.0	59.0	65.0

Note: Best scores in each category are in bold, second best are underlined. ¹ from BridgeAD, ² from NeuRAD
The improvements in both the overall NeuroNCAP score and, crucially, the reduction in collision rates suggest that our dataset helps the model develop a more nuanced understanding of complex road interactions, leading to more robust and safer driving policies.

Video Gallery

The videos compare the driving behavior of the two models in three representative challenging scenarios: stationary, frontal, and side. For each scenario, the left column shows the behavior of the base model, which is fine-tuned on nuScenes. The right column shows the performance of the model trained on a subset of our proposed dataset and then fine-tuned on nuScenes. Compared to the base model, the model using our data can better avoid vehicles by turning, slowing down, etc.