Vision-Language-Action Models

Deep Dive

Vision-Language-Action (VLA) models are multimodal foundation models that integrate visual perception, natural language understanding, and robot control into a single neural network. Given a camera image and a text instruction like “pick up the red cup,” a VLA directly outputs executable robot actions in one forward pass.

Prerequisites

Neural Networks Foundation architectures for VLA models

Reinforcement Learning Policy learning concepts used in VLA training

How VLAs Work

┌─────────────────────────────────────────────────────────────────┐
│                        VLA Model                                │
│  ┌───────────┐    ┌───────────┐    ┌───────────────────────┐   │
│  │  Camera   │───▶│  Vision   │───▶│                       │   │
│  │  Image    │    │  Encoder  │    │    LLM Backbone       │   │
│  └───────────┘    └───────────┘    │  (Llama, PaLM, etc.)  │───▶ Actions
│                                    │                       │   │
│  ┌───────────┐                     │                       │   │
│  │ "Pick up  │────────────────────▶│                       │   │
│  │  the cup" │                     └───────────────────────┘   │
│  └───────────┘                                                  │
└─────────────────────────────────────────────────────────────────┘
                                │
                                ▼
                    [x, y, z, roll, pitch, yaw, gripper]
                         7-DoF Robot Action

Key insight: VLAs treat robot actions as another form of “language” — predicting action tokens just like predicting the next word in a sentence.

Evolution of VLA Models

Year	Model	Contribution
2023	RT-2	Pioneered VLA concept — actions as text tokens
2024	OpenVLA	Open-source 7B model, Apache 2.0 license
2024	π0	Flow matching for 50Hz continuous control
2025	GR00T N1.5	Dual-system architecture for humanoids
2025	Gemini Robotics	Highly dexterous manipulation (origami folding)
2025	SmolVLA	450M params — runs on consumer hardware
2026	GR00T N1.6	Cosmos Reason VLM + whole-body humanoid control

Architecture Patterns

Single-Model Architecture

Used by RT-2, OpenVLA, and π0. Simple and lower latency.

Image + Text → [Vision Encoder] → [LLM] → Action Tokens → Robot
                      │               │
               Visual features   Single forward pass

Dual-System Architecture

Used by GR00T N1 and Helix. Mirrors human cognition with “fast” and “slow” thinking.

┌─────────────────────────────────────────────────────────┐
│  System 2 (Slow): Vision-Language Model                 │
│    • Scene understanding                                │
│    • Language comprehension                             │
│    • High-level reasoning (~100ms)                      │
└────────────────────────┬────────────────────────────────┘
                         ▼
┌─────────────────────────────────────────────────────────┐
│  System 1 (Fast): Diffusion/Flow Action Generator       │
│    • Real-time motor control (~10ms)                    │
│    • Smooth action trajectories                         │
│    • 24-50Hz control frequency                          │
└─────────────────────────────────────────────────────────┘

Action Generation Methods

Different approaches to converting model outputs into robot commands:

Method	Models	Speed	Notes
Autoregressive tokens	RT-2, OpenVLA	Slow	Actions as 256-bin text tokens
Flow matching	π0, GR00T N1	Fast	Continuous prediction, 50Hz possible
Diffusion policy	GR00T N1, Helix	Fast	Smooth multi-modal trajectories
FAST tokenization	π0-FAST, OpenVLA-FAST	5-15x faster	DCT compression of action sequences

Code Example

Basic VLA inference with OpenVLA:

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

# Load model (requires ~14-19 GB memory)
processor = AutoProcessor.from_pretrained(
    "openvla/openvla-7b",
    trust_remote_code=True
)
model = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
).to("cuda")

# Get observation and instruction
image = Image.open("camera_observation.jpg")
instruction = "Pick up the red block and place it on the blue plate"

# Format prompt (OpenVLA expects specific format)
prompt = f"In: What action should the robot take to {instruction.lower()}?\nOut:"

# Generate action
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
    action_tokens = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False
    )

# Decode to 7-DoF action: [x, y, z, roll, pitch, yaw, gripper]
action = processor.decode(action_tokens[0], skip_special_tokens=True)
print(f"Predicted action: {action}")

Major VLA Models

OpenVLA (Stanford, 2024)

Size: 7B parameters (SigLIP + DinoV2 + Llama 2)
Training: 970k real demos from Open X-Embodiment
Performance: 16.5% better than RT-2-X with 7x fewer parameters
License: Apache 2.0

GR00T N1.6 (NVIDIA, 2026)

Architecture: Cosmos Reason VLM (2B) + 32-layer DiT
Training: 10k+ hours of robot data
Target: Whole-body humanoid control
Deployment: Jetson Thor

π0 (Physical Intelligence, 2024)

Size: 3.3B parameters (PaliGemma backbone)
Action: Flow matching for 50Hz continuous control
Training: 7 robot platforms, 68 tasks

SmolVLA (Hugging Face, 2025)

Size: 450M parameters (compact)
Training: LeRobot community data (10M frames)
Target: Consumer hardware deployment

NVIDIA Ecosystem Integration

The GR00T stack for VLA development and deployment:

┌──────────────────────────────────────────┐
│         Isaac GR00T N1.6                 │
│     (VLA Model + Cosmos Reason VLM)      │
└──────────────────┬───────────────────────┘
                   ▼
┌──────────────────────────────────────────┐
│       Isaac Lab / Isaac Lab-Arena        │
│    (Training, Evaluation, Sim-to-Real)   │
└──────────────────┬───────────────────────┘
                   ▼
┌──────────────────────────────────────────┐
│            Isaac Sim 5.1                 │
│   (Physics simulation, synthetic data)   │
└──────────────────┬───────────────────────┘
                   ▼
┌──────────────────────────────────────────┐
│      TensorRT / TensorRT Edge-LLM        │
│    (Optimized inference, FP4 quant)      │
└──────────────────┬───────────────────────┘
                   ▼
┌──────────────────────────────────────────┐
│    Jetson Thor / T4000 / AGX Orin        │
│       (Edge deployment hardware)         │
└──────────────────────────────────────────┘

Hardware Requirements

Model	Memory	Suitable Hardware
SmolVLA (450M)	~2 GB	Jetson Orin Nano, consumer GPU
X-VLA (0.9B)	~4 GB	Jetson AGX Orin
GR00T N1.6	~8-16 GB	Jetson Thor, T4000
OpenVLA (7B)	14-19 GB	Jetson Thor, datacenter GPU

Training Data

Open X-Embodiment Dataset

The foundation for most VLA training:

1M+ real robot trajectories
22 robot embodiments
527 skills, 160k+ tasks
60 datasets from 34 research labs

Synthetic Data

NVIDIA Isaac Sim enables rapid data generation:

GR00T-Dreams: 36 hours vs 3 months manual collection
Domain randomization for sim-to-real transfer

Practical Considerations

Strengths:

End-to-end learning — no manual feature engineering
Language-conditioned — flexible task specification
Transfer learning — pre-trained models adapt quickly

Limitations:

Inference latency — autoregressive models can be slow
Memory requirements — large models need powerful hardware
Data hungry — best results need thousands of demonstrations