Skip to content

Skills & agents

Most of the control flow in RoboSandbox lives in three abstractions:

  1. Skills — the action vocabulary (pick, place_on, …). Each is a callable with a JSON schema.
  2. Planner — turns a natural-language task + current observation into a list of SkillCalls.
  3. Agent — ReAct loop: plan → execute → (on failure) replan.

Skill protocol

class Skill(Protocol):
    name: str
    description: str
    parameters_schema: dict            # JSON schema
    def __call__(self, ctx, **kwargs) -> SkillResult: ...

There is no base class requirement here. Any object matching the shape works. The VLMPlanner converts parameters_schema directly to an OpenAI tool definition; the stub planner dispatches by name.

ctx is an AgentContext carrying sim, perception, grasp, motion, and (optionally) recorder. Skills do I/O through ctx.

Skills shipped in v0.1

Skill Signature What it does
pick pick(object: str) Locate object, plan top-down grasp, execute, close gripper.
place_on place_on(target: str) Move above target, release.
push push(object: str, direction: str) Cartesian push in forward / back / left / right.
home home() Return to the home pose from the robot sidecar.
pour pour(target: str) Tilt end-effector over target.
tap tap(object: str) Touch the top of an object with the fingertip.
open_drawer open_drawer(drawer: str) Grasp handle, pull toward base.
close_drawer close_drawer(drawer: str) Push drawer back in.
stack stack(object: str, target: str) Pick + place in one.

Every skill returns SkillResult(success, reason, reason_detail, artifacts). Failure reasons are short structured strings such as unreachable, not_found, or missed_grasp, so the replan loop has something useful to work with.

Skills register at the robosandbox.skills entry point — see packages/robosandbox-core/pyproject.toml. A plugin package can add new skills without core changes.

Planner protocol

class Planner(Protocol):
    def plan(
        self,
        task: str,
        obs: Observation,
        prior_attempts: list[dict],
    ) -> tuple[list[SkillCall], int]:
        """Returns (plan, n_model_calls). Empty plan == 'already done'."""

Two implementations ship with core:

VLMPlanner

Calls an OpenAI-compatible chat endpoint with tool-calling + image input. Converts each Skill's parameters_schema into a tool definition. Passes an RGB frame of the current observation plus a text summary of scene_objects keys/xyz.

Used by robo-sandbox run --vlm-provider {openai,ollama,custom} and the llm_guided.py example.

If the model responds with prose instead of tool calls, VLMPlanner retries once with a "tool calls only" nudge.

StubPlanner

This is the deterministic, zero-dependency planner. It is just enough to exercise the agent loop without involving a model. It handles:

  • pick (up) the <obj>
  • pick (up) the <obj> (and|then|,) (put|place) (it) on (the) <obj2>
  • stack <obj> on (top of) <obj2>
  • push the <obj> forward|back|left|right
  • pour <obj> into <obj2>
  • tap/press/poke/touch the <obj>
  • open/close the <drawer>
  • (go) home

Object names are fuzzy-matched against scene object IDs (exact → substring → word-overlap).

Used by the browser viewer (no API key required), the benchmark runner, and tests. If your prompt hits the pattern grammar it works immediately; for anything outside that grammar, switch to VLMPlanner.

Agent loop

IDLE → PLAN → EXECUTE (one skill at a time) → EVALUATE →
                   │ success                      │ failure
                   ▼                              ▼
                 next in plan                   REPLAN ─► (max N times)
                   │                              │
                   ▼                              ▼
                 DONE                           FAILED
from robosandbox.agent.agent import Agent
from robosandbox.agent.context import AgentContext
from robosandbox.agent.planner import StubPlanner
from robosandbox.skills.pick import Pick
from robosandbox.skills.home import Home

skills = [Pick(), Home()]
agent = Agent(ctx=ctx, skills=skills, planner=StubPlanner(skills))
episode = agent.run("pick up the red cube")
# episode.success, episode.steps, episode.replans, episode.plan, ...

The replan behaviour is:

  • On any SkillResult(success=False), the agent collects the failure (step_idx, skill, args, reason, reason_detail) into prior_attempts.
  • The planner is called again with prior_attempts so a VLM can avoid repeating the same move. The stub planner ignores it (it's deterministic).
  • The loop terminates on replans >= max_replans with final_reason="replan_exhausted".

Composing your own

Custom planner: any object with a plan(task, obs, prior_attempts) -> (list[SkillCall], int) method. See examples/custom_skill.py for a TapStub that emits one skill call unconditionally.

Custom skill: tutorial. The Tap example in the tutorial is under 40 lines.