Skip to content

Vision-Language-Action Models

Deep Dive

Vision-Language-Action (VLA) models are multimodal foundation models that integrate visual perception, natural language understanding, and robot control into a single neural network. Given a camera image and a text instruction like “pick up the red cup,” a VLA directly outputs executable robot actions in one forward pass.

Prerequisites

How VLAs Work

┌─────────────────────────────────────────────────────────────────┐
│ VLA Model │
│ ┌───────────┐ ┌───────────┐ ┌───────────────────────┐ │
│ │ Camera │───▶│ Vision │───▶│ │ │
│ │ Image │ │ Encoder │ │ LLM Backbone │ │
│ └───────────┘ └───────────┘ │ (Llama, PaLM, etc.) │───▶ Actions
│ │ │ │
│ ┌───────────┐ │ │ │
│ │ "Pick up │────────────────────▶│ │ │
│ │ the cup" │ └───────────────────────┘ │
│ └───────────┘ │
└─────────────────────────────────────────────────────────────────┘
[x, y, z, roll, pitch, yaw, gripper]
7-DoF Robot Action

Key insight: VLAs treat robot actions as another form of “language” — predicting action tokens just like predicting the next word in a sentence.

Evolution of VLA Models

YearModelContribution
2023RT-2Pioneered VLA concept — actions as text tokens
2024OpenVLAOpen-source 7B model, Apache 2.0 license
2024π0Flow matching for 50Hz continuous control
2025GR00T N1.5Dual-system architecture for humanoids
2025Gemini RoboticsHighly dexterous manipulation (origami folding)
2025SmolVLA450M params — runs on consumer hardware
2026GR00T N1.6Cosmos Reason VLM + whole-body humanoid control

Architecture Patterns

Single-Model Architecture

Used by RT-2, OpenVLA, and π0. Simple and lower latency.

Image + Text → [Vision Encoder] → [LLM] → Action Tokens → Robot
│ │
Visual features Single forward pass

Dual-System Architecture

Used by GR00T N1 and Helix. Mirrors human cognition with “fast” and “slow” thinking.

┌─────────────────────────────────────────────────────────┐
│ System 2 (Slow): Vision-Language Model │
│ • Scene understanding │
│ • Language comprehension │
│ • High-level reasoning (~100ms) │
└────────────────────────┬────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ System 1 (Fast): Diffusion/Flow Action Generator │
│ • Real-time motor control (~10ms) │
│ • Smooth action trajectories │
│ • 24-50Hz control frequency │
└─────────────────────────────────────────────────────────┘

Action Generation Methods

Different approaches to converting model outputs into robot commands:

MethodModelsSpeedNotes
Autoregressive tokensRT-2, OpenVLASlowActions as 256-bin text tokens
Flow matchingπ0, GR00T N1FastContinuous prediction, 50Hz possible
Diffusion policyGR00T N1, HelixFastSmooth multi-modal trajectories
FAST tokenizationπ0-FAST, OpenVLA-FAST5-15x fasterDCT compression of action sequences

Code Example

Basic VLA inference with OpenVLA:

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch
# Load model (requires ~14-19 GB memory)
processor = AutoProcessor.from_pretrained(
"openvla/openvla-7b",
trust_remote_code=True
)
model = AutoModelForVision2Seq.from_pretrained(
"openvla/openvla-7b",
torch_dtype=torch.bfloat16,
trust_remote_code=True
).to("cuda")
# Get observation and instruction
image = Image.open("camera_observation.jpg")
instruction = "Pick up the red block and place it on the blue plate"
# Format prompt (OpenVLA expects specific format)
prompt = f"In: What action should the robot take to {instruction.lower()}?\nOut:"
# Generate action
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
action_tokens = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False
)
# Decode to 7-DoF action: [x, y, z, roll, pitch, yaw, gripper]
action = processor.decode(action_tokens[0], skip_special_tokens=True)
print(f"Predicted action: {action}")

Major VLA Models

OpenVLA (Stanford, 2024)

  • Size: 7B parameters (SigLIP + DinoV2 + Llama 2)
  • Training: 970k real demos from Open X-Embodiment
  • Performance: 16.5% better than RT-2-X with 7x fewer parameters
  • License: Apache 2.0

GR00T N1.6 (NVIDIA, 2026)

  • Architecture: Cosmos Reason VLM (2B) + 32-layer DiT
  • Training: 10k+ hours of robot data
  • Target: Whole-body humanoid control
  • Deployment: Jetson Thor

π0 (Physical Intelligence, 2024)

  • Size: 3.3B parameters (PaliGemma backbone)
  • Action: Flow matching for 50Hz continuous control
  • Training: 7 robot platforms, 68 tasks

SmolVLA (Hugging Face, 2025)

  • Size: 450M parameters (compact)
  • Training: LeRobot community data (10M frames)
  • Target: Consumer hardware deployment

NVIDIA Ecosystem Integration

The GR00T stack for VLA development and deployment:

┌──────────────────────────────────────────┐
│ Isaac GR00T N1.6 │
│ (VLA Model + Cosmos Reason VLM) │
└──────────────────┬───────────────────────┘
┌──────────────────────────────────────────┐
│ Isaac Lab / Isaac Lab-Arena │
│ (Training, Evaluation, Sim-to-Real) │
└──────────────────┬───────────────────────┘
┌──────────────────────────────────────────┐
│ Isaac Sim 5.1 │
│ (Physics simulation, synthetic data) │
└──────────────────┬───────────────────────┘
┌──────────────────────────────────────────┐
│ TensorRT / TensorRT Edge-LLM │
│ (Optimized inference, FP4 quant) │
└──────────────────┬───────────────────────┘
┌──────────────────────────────────────────┐
│ Jetson Thor / T4000 / AGX Orin │
│ (Edge deployment hardware) │
└──────────────────────────────────────────┘

Hardware Requirements

ModelMemorySuitable Hardware
SmolVLA (450M)~2 GBJetson Orin Nano, consumer GPU
X-VLA (0.9B)~4 GBJetson AGX Orin
GR00T N1.6~8-16 GBJetson Thor, T4000
OpenVLA (7B)14-19 GBJetson Thor, datacenter GPU

Training Data

Open X-Embodiment Dataset

The foundation for most VLA training:

  • 1M+ real robot trajectories
  • 22 robot embodiments
  • 527 skills, 160k+ tasks
  • 60 datasets from 34 research labs

Synthetic Data

NVIDIA Isaac Sim enables rapid data generation:

  • GR00T-Dreams: 36 hours vs 3 months manual collection
  • Domain randomization for sim-to-real transfer

Practical Considerations

Strengths:

  • End-to-end learning — no manual feature engineering
  • Language-conditioned — flexible task specification
  • Transfer learning — pre-trained models adapt quickly

Limitations:

  • Inference latency — autoregressive models can be slow
  • Memory requirements — large models need powerful hardware
  • Data hungry — best results need thousands of demonstrations

Sources