Tutorial: policy replay¶

This is the original replay path in the repo: record a run, then replay the recorded joint trajectory back in sim.

Looking for the layered LeRobot workflow?

This page is the original umbrella tutorial for record → replay in sim. For the three-step workflow that covers exporting to LeRobot datasets, driving a pre-trained checkpoint in sim, and handing a sim-validated skill off to real hardware, start here:

LeRobot Export — record an episode and export it into a LeRobot v3 dataset; inspect the parquet
metadata with standard tooling.
LeRobot Policy Replay — run a public ACT checkpoint through LeRobotPolicyAdapter + run_policy on a non-bundled robot.
Sim-to-Real Handoff — what carries over to real hardware, what doesn't, and a concrete SO-101 backend skeleton.

The page below stays useful for the ReplayTrajectoryPolicy path: given an events.jsonl from LocalRecorder, open-loop replay the joint trajectory with no training and no checkpoint.

If you want the simplest possible policy path, this is it. No training, no checkpoint loading, just replaying the trajectory you already recorded.

1. Record an episode¶

uv run python examples/record_demo.py --out-dir runs

This writes:

runs/20260418-094533-1a2b3c4d/
├── episode.json
├── events.jsonl
├── result.json
└── video.mp4

See recording & export for the events.jsonl schema.

2. (Optional) Export to LeRobot v3¶

uv pip install -e 'packages/robosandbox-core[lerobot]'
robo-sandbox export-lerobot \
  runs/20260418-094533-1a2b3c4d \
  /tmp/my_dataset

This writes a LeRobot v3 dataset at /tmp/my_dataset/ with data/chunk-000/episode_000000.parquet + meta/ + videos/. Pass this to any LeRobot-compatible training loop.

3. Replay the trajectory¶

The bundled ReplayTrajectoryPolicy treats events.jsonl as an open-loop action trace and drives the sim through it tick by tick.

From the CLI:

robo-sandbox run --policy runs/20260418-094533-1a2b3c4d \
                 --task pick_cube_franka \
                 --max-steps 1000

What happens under the hood:

load_policy(path) inspects the directory. An events.jsonl present → wraps in ReplayTrajectoryPolicy.
The task's scene is loaded into MuJoCoBackend and settled under gravity.
run_policy(sim, policy, max_steps, success=task.success) loops observe → act → step.
The task's success criterion runs against the final observation and is printed at the end.

Example output:

[run --policy] task:        pick_cube_franka
[run --policy] policy:      runs/20260418-094533-1a2b3c4d
[run --policy] verdict:     success
[run --policy] steps:       1000
[run --policy] final_reason: policy_completed_1000_steps
[run --policy] wall:        18.3s

4. Wire your own policy¶

Anything with act(obs: Observation) -> np.ndarray of shape (n_dof + 1,) (joints + gripper in [0, 1]) satisfies the Policy protocol:

from robosandbox.policy import Policy, run_policy

class MyAwesomePolicy:
    def __init__(self, checkpoint: str):
        self._model = load_my_model(checkpoint)

    def act(self, obs):
        joints, gripper = self._model.infer(obs.rgb, obs.robot_joints)
        return np.concatenate([joints, [gripper]])

result = run_policy(sim, MyAwesomePolicy("ckpt.pt"),
                    max_steps=1000, success=task.success)
# {"success": True, "steps": 1000, "initial_obs": ..., "final_obs": ...}

If you want the CLI to understand your own checkpoint directory, extend robosandbox.policy.load_policy to dispatch on your checkpoint format (LeRobot, torchscript, onnx, whatever):

# in your own package
from robosandbox.policy import load_policy as _core_load_policy

def load_policy(path):
    p = Path(path)
    if (p / "config.json").exists():
        return MyAwesomePolicy(p)
    return _core_load_policy(p)   # fall through to replay

`policy.json` alternative¶

If a directory does not auto-match, add a policy.json:

{
  "kind": "replay_trajectory",
  "trajectory": "events.jsonl",
  "action_lookahead": 1
}

action_lookahead > 1 skips that many rows per act() — useful to replay a 200 Hz recording at 100 Hz.

Action semantics¶

Policy.act(obs) returns a flat (n_dof + 1,) vector:

first n_dof entries — target joint positions
last entry — gripper in [0, 1] (0 = open, 1 = closed)

This matches MuJoCoBackend.step(target_joints=..., gripper=...). Values outside range are clamped by the sim, not the policy.

Tips¶

verdict: unknown in the CLI means the task didn't declare a success criterion. That's fine for free-form exploratory runs.
Policy runs forever — ReplayTrajectoryPolicy repeats its last action after the trajectory ends. Use --max-steps to cap.
Sim lag — run_policy does one sim step per act call. At 200 Hz sim timestep, 1000 steps = 5 sim seconds.