Running the agent¶

There are three main ways to run RoboSandbox. They share the same agent loop; the differences are in how interactive they are and what they write to disk.

franka pick

Watch a short walkthrough of the run-and-inspect loop

The three ways to run¶

Entry point	When to reach for it	What it writes
`robo-sandbox-bench`	Reproducible test runs with success tracking	`benchmark_results.json`
`robo-sandbox viewer`	Interactive exploration + recording on demand	`runs/<id>/{video.mp4, events.jsonl, result.json}` (when Record is on)
`robo-sandbox run`	One-off scripted run from the CLI	`runs/<id>/{video.mp4, events.jsonl, result.json}`

All three use the same Agent loop and the same skills.

The fastest successful pick¶

uv run robo-sandbox-bench --tasks pick_cube_franka --seeds 1

This is the quickest way to confirm the stack is working:

TASK               SEED  RESULT   SECS  REPLANS DETAIL
------------------------------------------------------------------------
pick_cube_franka   0     OK        1.6        0 dz_mm=166.905, min_mm=50.000

SUMMARY: 1/1 successful

Bench runs are headless: no window, no video. Use the viewer or robo-sandbox run if you want recorded artifacts.

Running from the viewer¶

uv run robo-sandbox viewer --task pick_cube_franka
# open http://localhost:8000

If this is your first time using the repo, the viewer is the best place to start. Click Record, click Run, and then use the Inspector slider to scrub back through the episode.

What's in `runs/<id>/`¶

run artifacts

runs/20260418-230155-4471d060/
├── episode.json      # 4 lines: episode_id, task, started_at, sim_dt
├── events.jsonl      # one line per sim step: joints, ee_pose, objects, gripper
├── result.json       # verdict: success, frames, wall, reason
└── video.mp4         # 30 fps render of the agent's camera

result.json is the high-level summary:

{
  "episode_id": "4471d060",
  "success": true,
  "ended_at": "2026-04-18T23:02:30.623",
  "frames": 1656,
  "task": "pick_cube_franka",
  "wall": 26.29,
  "reason": "plan_complete"
}

events.jsonl is the raw per-tick stream that policy and export code consume. Each line is one sim tick (dt=0.005s, about 200 Hz):

t — sim time in seconds
frame_idx — zero-based
robot_joints — full DoF vector
ee_pose — {xyz, quat_xyzw} in world frame
gripper_width — meters between fingertips
objects — every scene object's current pose
action — what the skill commanded at that tick (if any)

If you want to train from it, convert the run to a LeRobot dataset:

uv run robo-sandbox export-lerobot runs/<id> datasets/mypolicy

Swapping the planner¶

robo-sandbox run and the viewer both support --vlm-provider:

# regex planner, zero deps (default)
robo-sandbox run "pick up the red cube" --vlm-provider stub

# local Ollama with a vision model
ollama pull llama3.2-vision
ollama serve &
robo-sandbox run "pick up the red cube" --vlm-provider ollama

# OpenAI
export OPENAI_API_KEY=sk-...
robo-sandbox run "pick up the red cube" --vlm-provider openai --model gpt-4o-mini

# any OpenAI-compatible endpoint (vLLM, together.ai, groq, ...)
robo-sandbox run "..." --vlm-provider custom --base-url https://... --api-key-env TOGETHER_API_KEY

The agent loop stays the same across all four. Only the Planner instance changes. If you want to see exactly what the model sees, the VLM tool-calling guide walks through it.

Watching the phases¶

PLAN and EXECUTE log lines make the control loop visible:

phase logs

PLAN:    task='pick up the red cube' replan=0
EXECUTE: pick({'object': 'red_cube'})
TASK               SEED  RESULT   SECS  REPLANS DETAIL
---------------------------------------------------------
pick_cube_franka   0     OK        1.1        0 dz_mm=166.905

If a skill fails, you will see PLAN again with a larger replan=N counter. That is the recovery loop.

Reading `result.json` programmatically¶

import json
from pathlib import Path

for run_dir in sorted(Path("runs").iterdir()):
    r = json.loads((run_dir / "result.json").read_text())
    if r.get("success"):
        print(f"{run_dir.name}  {r['task']}  {r['wall']:.1f}s  {r['frames']} frames")

Common values in result.json.reason:

reason	meaning
`plan_complete`	every skill in the plan succeeded
`already_done`	planner returned empty plan on the first call
`replan_exhausted`	`max_replans` hit; see the final skill's detail
`vlm_transport`	VLM API error (timeout, auth, rate limit)
`stopped_by_user`	viewer Record stopped mid-episode

What's next¶

Bring your own task — author a YAML, run it through the same path.
Replan loop — trace a deliberately failing run.
VLM tool-calling — what each provider actually sees.