Perception & grasping¶
Between "pick up the red cube" and an actual joint trajectory, there are three main pieces:
Perception— text + observation →DetectedObjectlist.GraspPlanner— observation + target → candidateGrasplist.MotionPlanner— start joints + target pose →JointTrajectory.
Each one is a narrow protocol, so swapping implementations is meant to be straightforward.
Perception¶
class Perception(Protocol):
def locate(self, query: str, obs: Observation) -> list[DetectedObject]: ...
Two implementations ship in core:
GroundTruthPerception¶
Substring match against obs.scene_objects. Used in the benchmark
runner and by all Tier-3 skill tests so reliability numbers measure
the sandbox, not the VLM.
from robosandbox.perception.ground_truth import GroundTruthPerception
perception = GroundTruthPerception()
perception.locate("red cube", obs)
# [DetectedObject(label='red_cube', pose_3d=Pose(xyz=(0.4, 0.0, 0.05), ...), confidence=1.0)]
Use this when you want to debug the rest of the stack without involving the model.
VLMPointer¶
Default perception in the agentic flow. Calls an OpenAI-compatible chat endpoint, passes the RGB frame + intrinsics + extrinsics, asks the VLM for pixel coords + bbox, back-projects through depth into world frame.
Output shape the VLM must produce:
{"objects": [
{"label": "...", "bbox": [x1, y1, x2, y2], "point": [cx, cy], "confidence": 0.0-1.0}
]}
VLMOutputError is raised on malformed output; the caller can choose
to retry. vlm.json_recovery.parse_json_loose tolerates code-fenced
or prose-wrapped responses.
When to use which¶
| Use case | Perception |
|---|---|
| Running the benchmark, regression-testing sim behaviour | GroundTruthPerception |
| Running the agent loop with a real VLM | VLMPointer |
| Deterministic tests that exercise the skill (not the model) | GroundTruthPerception |
A CassetteVLMClient (see tests/cassettes/) records + replays VLM
responses — lets VLMPointer participate in CI without network access.
Grasping¶
class GraspPlanner(Protocol):
def plan(self, obs: Observation, target: DetectedObject) -> list[Grasp]: ...
There is one implementation in v0.1:
AnalyticTopDown¶
Assumes the gripper approaches along -Z, palm down, jaws along X.
It works well enough for cubes, short cylinders, cans, and mugs picked
from above. It is not the right tool for flatter or more awkward shapes
that really need side grasps.
Key parameters:
default_object_width=0.04— fallback when bounds unknown.grasp_height_offset=0.013— added above the object centroid when targeting the end-effector site. Tuned for v0.1's 4 cm fingers.
Output Grasp(pose, gripper_width, approach_offset=0.08, score):
pose— gripper pose at closed-on-object.gripper_width— width at which to close.approach_offset— distance aboveposeto approach from.
The planner returns one candidate today, and Pick executes it.
Motion planning¶
class MotionPlanner(Protocol):
def plan(
self,
sim: SimBackend,
start_joints: np.ndarray,
target_pose: Pose,
constraints: dict | None = None,
) -> JointTrajectory: ...
There is one implementation in v0.1:
DLSMotionPlanner¶
Damped least-squares Jacobian IK + Cartesian interpolation.
- Interpolates
n_waypointsCartesian poses from current EE pose totarget_pose. - Solves IK at each waypoint;
UnreachableErroris raised if the solver diverges. - Supports a
{"orientation": "z_down"}constraint that locks the palm-down quaternion — used by every pick/place skill.
Construction:
from robosandbox.motion.ik import DLSMotionPlanner
motion = DLSMotionPlanner(n_waypoints=160, dt=0.005) # default values
dt is per-step time; total trajectory duration is n_waypoints *
dt. Skills call motion.plan(...) once per Cartesian segment.
If a GPU motion planner gets added later, this is the seam where it fits.
Typical skill shape¶
Most pick/place skills follow this shape, and Pick is the clearest
example:
def __call__(self, ctx, object):
obs = ctx.sim.observe()
detections = ctx.perception.locate(object, obs)
if not detections:
return SkillResult(False, reason="not_found", reason_detail=object)
grasps = ctx.grasp.plan(obs, detections[0])
grasp = grasps[0]
# approach → descend → close → lift
for pose in (approach_pose, grasp.pose, grasp.pose, lift_pose):
traj = ctx.motion.plan(
ctx.sim,
start_joints=ctx.sim.observe().robot_joints,
target_pose=pose,
constraints={"orientation": "z_down"},
)
execute_trajectory(ctx, traj, gripper=...)
return SkillResult(True, reason="picked", artifacts={"grasp": grasp})
The shared skills._common.execute_trajectory helper streams waypoints
to the sim at traj.dt.
Related¶
- Skills & agents — who calls Perception / Grasp / Motion.
- Scenes & objects — what
DetectedObject.labelrefers back to.