Running the agent¶
There are three main ways to run RoboSandbox. They share the same agent loop; the differences are in how interactive they are and what they write to disk.

Watch a short walkthrough of the run-and-inspect loop
The three ways to run¶
| Entry point | When to reach for it | What it writes |
|---|---|---|
robo-sandbox-bench |
Reproducible test runs with success tracking | benchmark_results.json |
robo-sandbox viewer |
Interactive exploration + recording on demand | runs/<id>/{video.mp4, events.jsonl, result.json} (when Record is on) |
robo-sandbox run |
One-off scripted run from the CLI | runs/<id>/{video.mp4, events.jsonl, result.json} |
All three use the same Agent loop and the same skills.
The fastest successful pick¶
This is the quickest way to confirm the stack is working:
TASK SEED RESULT SECS REPLANS DETAIL
------------------------------------------------------------------------
pick_cube_franka 0 OK 1.6 0 dz_mm=166.905, min_mm=50.000
SUMMARY: 1/1 successful
Bench runs are headless: no window, no video. Use the viewer or
robo-sandbox run if you want recorded artifacts.
Running from the viewer¶
If this is your first time using the repo, the viewer is the best place to start. Click Record, click Run, and then use the Inspector slider to scrub back through the episode.
What's in runs/<id>/¶

runs/20260418-230155-4471d060/
├── episode.json # 4 lines: episode_id, task, started_at, sim_dt
├── events.jsonl # one line per sim step: joints, ee_pose, objects, gripper
├── result.json # verdict: success, frames, wall, reason
└── video.mp4 # 30 fps render of the agent's camera
result.json is the high-level summary:
{
"episode_id": "4471d060",
"success": true,
"ended_at": "2026-04-18T23:02:30.623",
"frames": 1656,
"task": "pick_cube_franka",
"wall": 26.29,
"reason": "plan_complete"
}
events.jsonl is the raw per-tick stream that policy and export code
consume. Each line is one sim tick (dt=0.005s, about 200 Hz):
t— sim time in secondsframe_idx— zero-basedrobot_joints— full DoF vectoree_pose—{xyz, quat_xyzw}in world framegripper_width— meters between fingertipsobjects— every scene object's current poseaction— what the skill commanded at that tick (if any)
If you want to train from it, convert the run to a LeRobot dataset:
Swapping the planner¶
robo-sandbox run and the viewer both support --vlm-provider:
# regex planner, zero deps (default)
robo-sandbox run "pick up the red cube" --vlm-provider stub
# local Ollama with a vision model
ollama pull llama3.2-vision
ollama serve &
robo-sandbox run "pick up the red cube" --vlm-provider ollama
# OpenAI
export OPENAI_API_KEY=sk-...
robo-sandbox run "pick up the red cube" --vlm-provider openai --model gpt-4o-mini
# any OpenAI-compatible endpoint (vLLM, together.ai, groq, ...)
robo-sandbox run "..." --vlm-provider custom --base-url https://... --api-key-env TOGETHER_API_KEY
The agent loop stays the same across all four. Only the Planner
instance changes. If you want to see exactly what the model sees, the
VLM tool-calling guide walks through it.
Watching the phases¶
PLAN and EXECUTE log lines make the control loop visible:

PLAN: task='pick up the red cube' replan=0
EXECUTE: pick({'object': 'red_cube'})
TASK SEED RESULT SECS REPLANS DETAIL
---------------------------------------------------------
pick_cube_franka 0 OK 1.1 0 dz_mm=166.905
If a skill fails, you will see PLAN again with a larger replan=N
counter. That is the recovery loop.
Reading result.json programmatically¶
import json
from pathlib import Path
for run_dir in sorted(Path("runs").iterdir()):
r = json.loads((run_dir / "result.json").read_text())
if r.get("success"):
print(f"{run_dir.name} {r['task']} {r['wall']:.1f}s {r['frames']} frames")
Common values in result.json.reason:
| reason | meaning |
|---|---|
plan_complete |
every skill in the plan succeeded |
already_done |
planner returned empty plan on the first call |
replan_exhausted |
max_replans hit; see the final skill's detail |
vlm_transport |
VLM API error (timeout, auth, rate limit) |
stopped_by_user |
viewer Record stopped mid-episode |
What's next¶
- Bring your own task — author a YAML, run it through the same path.
- Replan loop — trace a deliberately failing run.
- VLM tool-calling — what each provider actually sees.