Skip to content

Action Chunking

Conceptual

Action chunking is a control strategy in which a robot policy predicts a sequence of N future actions at once — a “chunk” — executes them open-loop without re-observing, and only then takes a new observation and predicts the next chunk. Rather than asking “what is the single best action right now?”, the policy asks “what are the best 50 actions starting from right now?” and commits to that trajectory for the chunk duration.

Action chunking is the core innovation of the ACT (Action Chunking with Transformers) model and has since become a standard technique in imitation learning for dexterous manipulation.

The Problem It Solves

The Naive One-Action Loop

A straightforward imitation learning controller runs a tight observe-predict-execute cycle:

observe → predict a₁ → execute a₁ → observe → predict a₂ → execute a₂ → ...

Each prediction is made independently from a fresh observation. This causes two compounding problems:

  1. Jitter from micro-corrections. Every observation carries sensor noise. Predicting one action at a time means the policy is constantly micro-correcting against noise rather than following a smooth intended trajectory.
  2. Temporal inconsistency. Consecutive predictions for adjacent timesteps are not guaranteed to be jointly consistent. The policy at t+1 has “forgotten” the context of t. The result is jerky, discontinuous motion that looks nothing like the smooth human demonstrations the model was trained on.

The Chunked Version

Action chunking breaks the loop into larger, committed segments:

observe → predict [a₁, a₂, ..., a₅₀] → execute all 50 → observe → predict [a₅₁, ..., a₁₀₀] → ...

The policy commits to a smooth trajectory for the entire chunk duration. Within a chunk, no re-observation or re-planning occurs. At the chunk boundary, the robot takes a fresh observation and the cycle repeats. The result:

  • Smooth intra-chunk motion (no micro-jitter)
  • Coherent action sequences that match the temporal structure of human demonstrations
  • Reduced sensitivity to observation noise within a chunk

Chunk Size Trade-offs

The chunk size N is a key hyperparameter. Choosing it requires balancing smoothness against the need to re-observe and correct.

Chunk SizeDuration at 30 fpsBehaviorWhen to Use
1–50.03–0.17 sJittery, effectively same as one-action-at-a-timeAvoid for dexterous tasks
25–1000.83–3.33 sSmooth motion, re-observes often enough to correct mid-taskDefault range for tabletop manipulation
200+6.7+ sVery smooth but cannot correct for obstacles or disturbancesOnly suitable for fixed, fully predictable environments

For a 30 fps robot, chunk size 50 = 1.67 seconds of committed action before re-observing. This is roughly the duration of a single dexterous sub-movement (reach, grasp, or place), which aligns well with the temporal structure of human demonstrations.

Temporal Ensemble

Temporal ensemble is a refinement on basic action chunking that eliminates the hard discontinuity at chunk boundaries. Instead of switching abruptly from one chunk to the next, multiple overlapping chunk predictions are made and blended together.

How It Works

At each timestep, the policy predicts a full chunk of N actions starting from the current observation. Rather than discarding the predictions from previous timesteps, all overlapping predictions for the current timestep are averaged with exponentially decaying weights — older predictions contribute less than the freshest one.

Timestep: t t+1 t+2 t+3 t+4 t+5 t+6
┌────────────────────────────────┐
Chunk @t: │ a₀ a₁ a₂ a₃ a₄ a₅ │
└────────────────────────────────┘
┌────────────────────────────┐
Chunk @t+1: │ a₀ a₁ a₂ a₃ a₄ a₅ │
└────────────────────────────┘
┌────────────────────────┐
Chunk @t+2: │ a₀ a₁ a₂ a₃ a₄ │
└────────────────────────┘
Executed at t+2: blend(Chunk@t[2], Chunk@t+1[1], Chunk@t+2[0])
weights: low medium high

At timestep t+2 in the example above, three overlapping predictions cover the current moment. The action executed is a weighted average, with the most recent prediction (Chunk@t+2) given the highest weight because it was computed from the freshest observation.

Effect

Temporal ensemble smooths the transitions at chunk boundaries, eliminating the brief jitter that can occur when the robot switches from one planned trajectory to the next. The trade-off is increased inference cost — the policy must run at every timestep rather than once per chunk — which may be a constraint on resource-limited edge devices.

In ACT

ACT (Action Chunking with Transformers) is the model that introduced action chunking for robot manipulation. Its architecture is built around producing all N actions simultaneously rather than autoregressively.

Model class: Conditional VAE (CVAE) with a Transformer decoder.

Inputs at inference:

  • Camera image(s) — encoded by a ResNet-18 or ViT backbone
  • Current joint state — a vector of joint positions
  • Latent variable z — sampled from the learned prior (zero at test time)

Output:

  • A sequence of N joint position deltas: [Δjoint₁, Δjoint₂, ..., Δjoint_N]

All N outputs are produced in a single forward pass through the Transformer decoder. There is no autoregressive decoding step. The Transformer’s self-attention across the N output tokens is what gives the actions their temporal coherence — the model learns correlations between the actions at different positions in the chunk.

Training configuration note: In ACT, chunk_size and n_action_steps must be equal. Setting them to different values produces a mismatch between training and inference and results in degraded performance.

chunk_size = n_action_steps = 50 ← standard for 30fps tabletop tasks

Sources