Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks—improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.
Method | 1s | 2s | 3s | Avg. |
---|---|---|---|---|
Closed-source API-only Models | ||||
GPT-4o1 | 0.28 | 0.93 | 2.02 | 1.07 |
Claude-3.5-Sonnet1 | 0.29 | 0.98 | 2.12 | 1.13 |
Claude-3.7-Sonnet1 | 0.28 | 0.94 | 2.04 | 1.09 |
Gemini-2.0-Flash1 | 0.31 | 1.08 | 2.36 | 1.25 |
Gemini-2.5-Pro1 | 0.37 | 1.35 | 2.96 | 1.56 |
Open-source Generalist VLMs | ||||
LLaVA-1.6-Mistral-7B2 | 1.49 | 3.38 | 4.09 | 2.98 |
Llama-3.2-11B-Vision-Instruct2 | 1.54 | 3.31 | 3.91 | 2.92 |
Qwen2-VL-7B-Instruct2 | 1.45 | 3.21 | 3.76 | 2.81 |
DeepSeek-VL2-16B1 | 0.66 | 1.68 | 2.92 | 1.75 |
DeepSeek-VL2-28B1 | 0.37 | 1.35 | 2.96 | 1.56 |
LLaMA-3.2-11B-Vision-Instruct1 | 0.52 | 1.42 | 2.68 | 1.54 |
LLaMA-3.2-90B-Vision-Instruct1 | 0.66 | 1.71 | 3.01 | 1.79 |
Qwen-2.5-VL-7B-Instruct1 | 0.46 | 1.33 | 2.55 | 1.45 |
Training-based Driving Specialists (Existing Methods) | ||||
UniAD3 | 0.42 | 0.64 | 0.91 | 0.66 |
VAD3 | 0.17 | 0.34 | 0.60 | 0.37 |
BEV-Planner3 | 0.16 | 0.32 | 0.57 | 0.35 |
Ego-MLP3* | 0.15 | 0.32 | 0.59 | 0.35 |
Ours and Key Competitors (Specialized Driving Models) | ||||
DriveVLM3 | 0.18 | 0.34 | 0.68 | 0.40 |
OmniDrive3 | 0.14 | 0.29 | 0.55 | 0.33 |
DriveVLM-Dual3 | 0.15 | 0.29 | 0.48 | 0.31 |
EMMA (random init)3 | 0.15 | 0.33 | 0.63 | 0.37 |
EMMA3 | 0.14 | 0.29 | 0.54 | 0.32 |
EMMA+3 | 0.13 | 0.27 | 0.48 | 0.29 |
3B Base+nuScenes | 0.14 | 0.30 | 0.58 | 0.34 |
3B Base+Impromptu+nuScenes | 0.13 | 0.27 | 0.52 | 0.30 |
7B Base+nuScenes | 0.13 | 0.28 | 0.55 | 0.32 |
7B Base+Impromptu+nuScenes | 0.13 | 0.27 | 0.53 | 0.30 |
Source | Method | NeuroNCAP Score ↑ | Collision rate (%) ↓ | ||||||
---|---|---|---|---|---|---|---|---|---|
Avg. | Stat. | Frontal | Side | Avg. | Stat. | Frontal | Side | ||
CVPR 2023 | UniAD2 | 0.73 | 0.84 | 0.10 | 1.26 | 88.6 | 87.8 | 98.4 | 79.6 |
ICCV 2023 | VAD2 | 0.66 | 0.47 | 0.04 | 1.45 | 92.5 | 96.2 | 99.6 | 81.6 |
ICRA 2025 | SparseDrive1 | 0.92 | - | - | - | 93.9 | - | - | - |
CVPR 2025 | BridgeAD-S1 | 1.52 | - | - | - | 76.2 | - | - | - |
CVPR 2025 | BridgeAD-B1 | 1.60 | - | - | - | 72.6 | - | - | - |
- | Base+nuScenes | 1.77 | 1.80 | 1.67 | 1.75 | 72.5 | 68.0 | 73.0 | 71.5 |
- | Base+Impromptu+nuScenes | 2.15 | 1.77 | 2.31 | 2.10 | 65.5 | 70.0 | 59.0 | 65.0 |
The videos compare the driving behavior of the two models in three representative challenging scenarios: stationary, frontal, and side. For each scenario, the left column shows the behavior of the base model, which is fine-tuned on nuScenes. The right column shows the performance of the model trained on a subset of our proposed dataset and then fine-tuned on nuScenes. Compared to the base model, the model using our data can better avoid vehicles by turning, slowing down, etc.