Skip to content

Vision-Language-Action Models

Deep Dive

Vision-Language-Action (VLA) models are foundation models that take visual observations and language instructions as input and directly output robot actions. They represent a paradigm shift from hand-engineered perception-planning-control pipelines to end-to-end learned policies that can generalize across tasks, environments, and even robot embodiments.

Prerequisites

How VLAs Work

┌──────────────────────────────────────────────────────────┐
│ VLA Model │
│ │
│ Camera Images ─────┐ │
│ ├──► Vision ┌─────────────┐ │
│ Depth (optional) ──┘ Encoder │ │ │
│ │ │ Action │ │
│ ▼ │ Decoder │──► Joint Commands
│ "Pick up the Language │ │ Gripper Actions
│ red cup" ──────────► Encoder │ (diffusion │ Velocities
│ │ │ or auto- │ │
│ Joint Positions ────► Proprio- │ regressive│ │
│ Gripper State ception │ or flow) │ │
│ │ └─────────────┘ │
│ ▼ │
│ Fused Representation │
└──────────────────────────────────────────────────────────┘

The key insight: instead of decomposing robotics into separate perception, planning, and control modules, VLAs learn the entire mapping end-to-end from raw sensory data to motor commands.

The VLA Landscape (2023–2026)

2023 2024 2025 2026
│ │ │ │
RT-2 ACT π₀ π*₀.₆
│ Octo π₀.₅ SmolVLA
│ OpenVLA GR00T N1 Scaling
│ RT-H ↓
│ Production
│ Deployments

Key Models

ACT (Action Chunking with Transformers)

The lightweight pioneer that showed predicting chunks of future actions (not just one step) dramatically improves imitation learning.

AspectDetail
Parameters~80M
ArchitectureCVAE encoder-decoder with Transformer
InputRGB images + proprioception
OutputAction chunks (k future timesteps)
TrainingImitation learning from demonstrations
BenchmarkBimanual tasks: ~90% success on ALOHA benchmarks
Open sourceYes, via LeRobot

How action chunking works:

Standard policy: t₁ → a₁, t₂ → a₂, t₃ → a₃ ...
(each step can compound errors)
ACT (chunk size 4): t₁ → [a₁, a₂, a₃, a₄]
(predict sequence, reduce horizon by 4x)

ACT also uses temporal ensembling — querying the policy faster than the chunk duration and averaging overlapping predictions for smoother execution.

Best for: Fine-grained manipulation, bimanual tasks, low-cost hardware (ALOHA), getting started with imitation learning.

# Training ACT with LeRobot
from lerobot.common.policies.act.configuration_act import ACTConfig
from lerobot.common.policies.act.modeling_act import ACTPolicy
config = ACTConfig(
chunk_size=100, # Predict 100 future steps
n_action_steps=100,
input_shapes={"observation.image": [3, 480, 640]},
output_shapes={"action": [14]}, # 14-DOF bimanual
)
policy = ACTPolicy(config)

π₀ (Pi Zero) — Physical Intelligence

The model that proved VLAs can handle truly dexterous tasks. π₀ uses flow matching (a diffusion variant) instead of autoregressive decoding, enabling high-frequency (50 Hz) continuous action generation.

AspectDetail
Parameters3.3B
ArchitectureVLM backbone (PaliGemma) + flow matching action head
InputRGB images + language + proprioception
OutputContinuous action trajectories via flow matching
TrainingPre-trained on web data, fine-tuned on robot data
BenchmarkOutperforms Octo and OpenVLA on dexterous manipulation
Open sourceπ₀-FAST variant available

Why flow matching?

Autoregressive models predict one token at a time — too slow and jerky for dexterous manipulation. Flow matching generates entire trajectories as continuous flows, producing smooth 50 Hz actions naturally.

Autoregressive: [token₁] → [token₂] → [token₃] → ... (discrete, slow)
Flow matching: noise ═══smooth flow═══► action trajectory (continuous, fast)

π₀.₅ — Open-World Generalization

Builds on π₀ with co-training on heterogeneous data: multiple robot types, human videos, synthetic data, and web knowledge.

AspectDetail
Parameters3.3B
Key advanceCross-embodiment generalization via co-training
Training stagesPre-training (diverse tasks) → Post-training (specialization)
BenchmarkMulti-step mobile manipulation: 60–80% success in novel homes
Open sourceYes, model weights available

Two-stage training:

Stage 1 (Pre-training): Stage 2 (Post-training):
Multiple robot types Specific embodiment
+ Human videos + Target tasks
+ Web knowledge + Domain-specific data
+ Synthetic data
↓ ↓
General robot Specialized mobile
understanding manipulation policy

π*₀.₆ — Learning from Experience

The latest from Physical Intelligence. The asterisk (π*****₀.₆) denotes a “self-improving” variant — the first VLA that gets better through real-world deployment. Introduces RECAP (RL with Experience and Corrections via Advantage-conditioned Policies).

AspectDetail
Parameters3.3B
Key advanceSelf-improvement via RL from real-world experience
MethodRECAP — combines demonstrations + on-policy data + expert corrections
Benchmark40–80% improvement over π₀.₅ on complex household tasks
SignificanceFirst VLA that improves from deployment experience

Why RECAP matters:

Most VLAs are static after training — they can only be as good as their demonstration data. π*₀.₆ can improve from its own mistakes in the real world:

┌─────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────┐
│ Deploy │ → │ Collect │ → │ Expert │ → │ Retrain │
│ policy │ │ on-policy│ │ corrections │ │ with │
│ │ │ data │ │ when stuck │ │ RECAP │
└─────────┘ └──────────┘ └──────────────┘ └──────────┘
↑ │
└──────────── Improved policy ◄────────────────────┘

Results: folds laundry in real homes, assembles boxes, makes espresso — tasks that pure imitation learning struggles with.


GR00T N1 — NVIDIA

NVIDIA’s open foundation model for humanoid robots, announced at GTC 2025. Uses a dual-system architecture inspired by human cognition.

AspectDetail
Parameters2B
ArchitectureDual-system: VLM (System 2) + Diffusion Transformer (System 1)
InputMulti-camera RGB + language instructions + proprioception
OutputWhole-body humanoid actions
TrainingReal robot data + human videos + synthetic data (50K H100 GPU hours)
BenchmarkState-of-the-art on humanoid manipulation and locomotion
Open sourceYes, weights on Hugging Face

Dual-system design:

System 2 (Slow Thinking): System 1 (Fast Acting):
Vision-Language Model Diffusion Transformer
"What do I see?" "How do I move?"
"What should I do?" Continuous motor commands
Deliberate reasoning Fluid real-time actions
│ │
└───── Coupled & jointly trained ───────┘

Adopters: 1X, Agility Robotics, Boston Dynamics, Mentee Robotics, NEURA Robotics.


SmolVLA — Hugging Face

A compact, community-driven VLA that proves you don’t need billions of parameters.

AspectDetail
Parameters450M
ArchitectureTrimmed SmolVLM-2 + Transformer action expert
Training data10M frames from 487 open-source community datasets
Benchmark87.3% on LIBERO (matches models 7× its size)
HardwareTrains on 1 GPU, runs on a MacBook
Open sourceYes, via LeRobot

Why SmolVLA matters for the community:

Performance
π₀ (3.3B) ─────── ●│
SmolVLA (0.45B) ── ●│ ← Close performance at 1/7 the size
ACT (0.08B) ────── ●│ ← Still strong for specific tasks
└──────────────► Parameters

Architecture Comparison

ModelParamsAction DecoderLanguageOpen SourceBest For
ACT80MCVAE + chunkingNoYesFine-grained manipulation
π₀3.3BFlow matchingYesPartialDexterous tasks
π₀.₅3.3BFlow matchingYesYesOpen-world mobile manip
π*₀.₆3.3BFlow matching + RLYesNoSelf-improving deployment
GR00T N12BDiffusion TransformerYesYesHumanoid robots
SmolVLA450MTransformer expertYesYesLow-cost, community
Octo93MDiffusionYesYesCross-embodiment research
OpenVLA7BAutoregressiveYesYesGeneral manipulation
RT-255BAutoregressiveYesNoGoogle’s large-scale VLA

Action Representations

How VLAs output actions varies significantly:

Predict action tokens one at a time (like GPT generates text).

Models: RT-2, OpenVLA

Pros: Simple, leverages LLM pre-training

Cons: Slow inference, discrete actions, hard to do high-frequency control

[img] [text] → [a₁] → [a₂] → [a₃] → ... (sequential)

Getting Started

Easiest Path: LeRobot + ACT or SmolVLA

Terminal window
# Install LeRobot
pip install lerobot
# Fine-tune SmolVLA on your data
python lerobot/scripts/train.py \
--policy.type=smolvla \
--dataset.repo_id=your_username/your_dataset \
--output_dir=outputs/smolvla_finetuned \
--steps=20000

For Humanoid Robots: GR00T N1

# Download from Hugging Face
from transformers import AutoModel
model = AutoModel.from_pretrained("nvidia/GR00T-N1-2B")
# Fine-tune for your embodiment
# See: https://developer.nvidia.com/isaac-gr00t

Hardware Requirements

ModelTrainingInference
ACT1× consumer GPUJetson Orin Nano
SmolVLA1× A100 (4 hrs)Jetson Orin NX / MacBook
GR00T N1Multi-GPU clusterJetson Thor
π₀Large clusterJetson Thor / high-end GPU

What’s Next for VLAs?

The field is moving fast. Key trends for 2026+:

  • Self-improvement: Models that get better from deployment (π*₀.₆ / RECAP)
  • Smaller, faster: Efficient models for edge deployment (SmolVLA direction)
  • Sim-to-real: Better transfer from simulated training (Isaac Lab)
  • Multi-embodiment: One model, many robots (π₀.₅ direction)
  • Safety: Constrained policies that avoid dangerous actions

Learn More