Skip to content

Teleoperation

Practical

Teleoperation is the control of a robot from a distance — in the context of imitation learning, it means a human operator physically manipulates a leader arm while a follower robot mirrors those movements in real time, recording every joint angle and camera frame as a training demonstration. The goal is not task completion for its own sake: it is capturing high-quality, consistent human behavior that a policy can later learn to imitate.

Leader-Follower Setup

The most common physical setup for imitation learning uses mechanically identical arms in a leader-follower configuration:

Human grips leader arm
Joint angles sampled
at 30Hz via serial bus
Follower arm mirrors
movements in real time ──────────► Cameras record scene
All data written to
training dataset (Parquet + video)

Why physically coupled over a joystick?

A joystick or spacemouse requires the operator to mentally translate “move joystick left” into “rotate wrist joint 3.” A leader arm eliminates that translation layer entirely — the human moves naturally and the follower copies exactly. This produces demonstrations with the same dynamic profile (velocities, accelerations, gripper force timing) that the policy will need to reproduce. Joystick control introduces an operator-specific encoding step that the model also has to learn to undo.

What Gets Recorded

Each frame in a demonstration captures four streams simultaneously:

FieldContentsSize per frame
observation.stateJoint positions — 6 floats in radians (shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper)~48 bytes
actionTarget joint positions — same 6 floats, representing what the leader arm commanded~48 bytes
observation.images.overhead640×480 RGB frame from overhead camera~900 KB uncompressed
observation.images.side640×480 RGB frame from side camera (optional but strongly recommended)~900 KB uncompressed

Storage estimate: One 10-second demonstration at 30fps = 300 frames = approximately 540 MB of uncompressed video plus about 14 KB of joint data. In practice, video is encoded to H.264 or AV1 at recording time, shrinking to 5–15 MB per episode.

Data Collection Rules

These rules are not suggestions. Each one was learned by recording a batch of episodes, training on them, watching the model fail at evaluation, and tracing the failure back to a recording mistake.

  1. Watch the camera feeds, not the robot. The model sees what the cameras see. If your eye is on the physical arm instead of the monitor, your demonstrations will not match the camera perspective the model will use at inference time.

  2. Start moving before pressing record. Idle frames at the start of an episode teach the model that the correct initial action is to hold still. This creates a “frozen moment” bias where the deployed policy hesitates before starting a task. Begin the motion, then trigger recording.

  3. Stop recording immediately when the task is complete. Idle frames at the end are equally damaging — the model learns that a correct completion involves holding position. Stop the recording the instant the gripper releases or the task goal is met.

  4. Vary the object position across spatial zones. A model trained on 50 episodes where the target object is always in the center of the workspace will fail the moment the object moves to the left edge. Divide the workspace into a grid (near/far × left/center/right = 6 zones) and target at least 8 episodes per zone.

  5. Complete the full task on every episode. A demonstration that shows the robot reaching for an object but not grasping it teaches the policy that reaching is the goal. Every episode must include the complete sequence: approach, grasp, transport, and release (or whatever the full task is). Discard any episode that stops short.

Hardware Options

MethodExample hardwareCostDemonstration quality
Leader-follower armsSO-101, ALOHA, Koch300300–20KBest — natural human dynamics, no translation lag
VR controllersMeta Quest 3, HTC Vive300300–1KGood for bimanual tasks; some latency and drift
Joystick / SpaceMouse3Dconnexion SpaceMouse150150–400Easiest setup; lower quality due to operator encoding overhead
Kinesthetic teachingNo extra hardware$0Requires gravity compensation on the robot; good for slow, precise tasks

Sources