Vision-Language-Action Models
Vision-Language-Action (VLA) models are multimodal foundation models that integrate visual perception, natural language understanding, and robot control into a single neural network. Given a camera image and a text instruction like “pick up the red cup,” a VLA directly outputs executable robot actions in one forward pass.
Prerequisites
How VLAs Work
┌─────────────────────────────────────────────────────────────────┐│ VLA Model ││ ┌───────────┐ ┌───────────┐ ┌───────────────────────┐ ││ │ Camera │───▶│ Vision │───▶│ │ ││ │ Image │ │ Encoder │ │ LLM Backbone │ ││ └───────────┘ └───────────┘ │ (Llama, PaLM, etc.) │───▶ Actions│ │ │ ││ ┌───────────┐ │ │ ││ │ "Pick up │────────────────────▶│ │ ││ │ the cup" │ └───────────────────────┘ ││ └───────────┘ │└─────────────────────────────────────────────────────────────────┘ │ ▼ [x, y, z, roll, pitch, yaw, gripper] 7-DoF Robot ActionKey insight: VLAs treat robot actions as another form of “language” — predicting action tokens just like predicting the next word in a sentence.
Evolution of VLA Models
| Year | Model | Contribution |
|---|---|---|
| 2023 | RT-2 | Pioneered VLA concept — actions as text tokens |
| 2024 | OpenVLA | Open-source 7B model, Apache 2.0 license |
| 2024 | π0 | Flow matching for 50Hz continuous control |
| 2025 | GR00T N1.5 | Dual-system architecture for humanoids |
| 2025 | Gemini Robotics | Highly dexterous manipulation (origami folding) |
| 2025 | SmolVLA | 450M params — runs on consumer hardware |
| 2026 | GR00T N1.6 | Cosmos Reason VLM + whole-body humanoid control |
Architecture Patterns
Single-Model Architecture
Used by RT-2, OpenVLA, and π0. Simple and lower latency.
Image + Text → [Vision Encoder] → [LLM] → Action Tokens → Robot │ │ Visual features Single forward passDual-System Architecture
Used by GR00T N1 and Helix. Mirrors human cognition with “fast” and “slow” thinking.
┌─────────────────────────────────────────────────────────┐│ System 2 (Slow): Vision-Language Model ││ • Scene understanding ││ • Language comprehension ││ • High-level reasoning (~100ms) │└────────────────────────┬────────────────────────────────┘ ▼┌─────────────────────────────────────────────────────────┐│ System 1 (Fast): Diffusion/Flow Action Generator ││ • Real-time motor control (~10ms) ││ • Smooth action trajectories ││ • 24-50Hz control frequency │└─────────────────────────────────────────────────────────┘Action Generation Methods
Different approaches to converting model outputs into robot commands:
| Method | Models | Speed | Notes |
|---|---|---|---|
| Autoregressive tokens | RT-2, OpenVLA | Slow | Actions as 256-bin text tokens |
| Flow matching | π0, GR00T N1 | Fast | Continuous prediction, 50Hz possible |
| Diffusion policy | GR00T N1, Helix | Fast | Smooth multi-modal trajectories |
| FAST tokenization | π0-FAST, OpenVLA-FAST | 5-15x faster | DCT compression of action sequences |
Code Example
Basic VLA inference with OpenVLA:
from transformers import AutoModelForVision2Seq, AutoProcessorfrom PIL import Imageimport torch
# Load model (requires ~14-19 GB memory)processor = AutoProcessor.from_pretrained( "openvla/openvla-7b", trust_remote_code=True)model = AutoModelForVision2Seq.from_pretrained( "openvla/openvla-7b", torch_dtype=torch.bfloat16, trust_remote_code=True).to("cuda")
# Get observation and instructionimage = Image.open("camera_observation.jpg")instruction = "Pick up the red block and place it on the blue plate"
# Format prompt (OpenVLA expects specific format)prompt = f"In: What action should the robot take to {instruction.lower()}?\nOut:"
# Generate actioninputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")with torch.no_grad(): action_tokens = model.generate( **inputs, max_new_tokens=256, do_sample=False )
# Decode to 7-DoF action: [x, y, z, roll, pitch, yaw, gripper]action = processor.decode(action_tokens[0], skip_special_tokens=True)print(f"Predicted action: {action}")Major VLA Models
OpenVLA (Stanford, 2024)
- Size: 7B parameters (SigLIP + DinoV2 + Llama 2)
- Training: 970k real demos from Open X-Embodiment
- Performance: 16.5% better than RT-2-X with 7x fewer parameters
- License: Apache 2.0
GR00T N1.6 (NVIDIA, 2026)
- Architecture: Cosmos Reason VLM (2B) + 32-layer DiT
- Training: 10k+ hours of robot data
- Target: Whole-body humanoid control
- Deployment: Jetson Thor
π0 (Physical Intelligence, 2024)
- Size: 3.3B parameters (PaliGemma backbone)
- Action: Flow matching for 50Hz continuous control
- Training: 7 robot platforms, 68 tasks
SmolVLA (Hugging Face, 2025)
- Size: 450M parameters (compact)
- Training: LeRobot community data (10M frames)
- Target: Consumer hardware deployment
NVIDIA Ecosystem Integration
The GR00T stack for VLA development and deployment:
┌──────────────────────────────────────────┐│ Isaac GR00T N1.6 ││ (VLA Model + Cosmos Reason VLM) │└──────────────────┬───────────────────────┘ ▼┌──────────────────────────────────────────┐│ Isaac Lab / Isaac Lab-Arena ││ (Training, Evaluation, Sim-to-Real) │└──────────────────┬───────────────────────┘ ▼┌──────────────────────────────────────────┐│ Isaac Sim 5.1 ││ (Physics simulation, synthetic data) │└──────────────────┬───────────────────────┘ ▼┌──────────────────────────────────────────┐│ TensorRT / TensorRT Edge-LLM ││ (Optimized inference, FP4 quant) │└──────────────────┬───────────────────────┘ ▼┌──────────────────────────────────────────┐│ Jetson Thor / T4000 / AGX Orin ││ (Edge deployment hardware) │└──────────────────────────────────────────┘Hardware Requirements
| Model | Memory | Suitable Hardware |
|---|---|---|
| SmolVLA (450M) | ~2 GB | Jetson Orin Nano, consumer GPU |
| X-VLA (0.9B) | ~4 GB | Jetson AGX Orin |
| GR00T N1.6 | ~8-16 GB | Jetson Thor, T4000 |
| OpenVLA (7B) | 14-19 GB | Jetson Thor, datacenter GPU |
Training Data
Open X-Embodiment Dataset
The foundation for most VLA training:
- 1M+ real robot trajectories
- 22 robot embodiments
- 527 skills, 160k+ tasks
- 60 datasets from 34 research labs
Synthetic Data
NVIDIA Isaac Sim enables rapid data generation:
- GR00T-Dreams: 36 hours vs 3 months manual collection
- Domain randomization for sim-to-real transfer
Practical Considerations
Strengths:
- End-to-end learning — no manual feature engineering
- Language-conditioned — flexible task specification
- Transfer learning — pre-trained models adapt quickly
Limitations:
- Inference latency — autoregressive models can be slow
- Memory requirements — large models need powerful hardware
- Data hungry — best results need thousands of demonstrations
Related Terms
Sources
- RT-2 Paper — Original VLA concept from Google DeepMind
- OpenVLA Paper — Open-source VLA architecture
- OpenVLA GitHub — Official implementation
- GR00T N1 Paper — NVIDIA humanoid foundation model
- Isaac GR00T GitHub — Official NVIDIA repository
- π0 Paper — Physical Intelligence VLA
- SmolVLA Blog — Compact VLA for consumer hardware
- Open X-Embodiment — Cross-embodiment training dataset
- VLA Survey — Comprehensive review of VLA models