Teleoperation

Practical

Teleoperation is the control of a robot from a distance — in the context of imitation learning, it means a human operator physically manipulates a leader arm while a follower robot mirrors those movements in real time, recording every joint angle and camera frame as a training demonstration. The goal is not task completion for its own sake: it is capturing high-quality, consistent human behavior that a policy can later learn to imitate.

Leader-Follower Setup

The most common physical setup for imitation learning uses mechanically identical arms in a leader-follower configuration:

Human grips leader arm
         │
         ▼
  Joint angles sampled
  at 30Hz via serial bus
         │
         ▼
  Follower arm mirrors
  movements in real time   ──────────► Cameras record scene
         │
         ▼
  All data written to
  training dataset (Parquet + video)

Why physically coupled over a joystick?

A joystick or spacemouse requires the operator to mentally translate “move joystick left” into “rotate wrist joint 3.” A leader arm eliminates that translation layer entirely — the human moves naturally and the follower copies exactly. This produces demonstrations with the same dynamic profile (velocities, accelerations, gripper force timing) that the policy will need to reproduce. Joystick control introduces an operator-specific encoding step that the model also has to learn to undo.

What Gets Recorded

Each frame in a demonstration captures four streams simultaneously:

Field	Contents	Size per frame
`observation.state`	Joint positions — 6 floats in radians (shoulder_pan, shoulder_lift, elbow_flex, wrist_flex, wrist_roll, gripper)	~48 bytes
`action`	Target joint positions — same 6 floats, representing what the leader arm commanded	~48 bytes
`observation.images.overhead`	640×480 RGB frame from overhead camera	~900 KB uncompressed
`observation.images.side`	640×480 RGB frame from side camera (optional but strongly recommended)	~900 KB uncompressed

Storage estimate: One 10-second demonstration at 30fps = 300 frames = approximately 540 MB of uncompressed video plus about 14 KB of joint data. In practice, video is encoded to H.264 or AV1 at recording time, shrinking to 5–15 MB per episode.

Data Collection Rules

These rules are not suggestions. Each one was learned by recording a batch of episodes, training on them, watching the model fail at evaluation, and tracing the failure back to a recording mistake.

Watch the camera feeds, not the robot. The model sees what the cameras see. If your eye is on the physical arm instead of the monitor, your demonstrations will not match the camera perspective the model will use at inference time.
Start moving before pressing record. Idle frames at the start of an episode teach the model that the correct initial action is to hold still. This creates a “frozen moment” bias where the deployed policy hesitates before starting a task. Begin the motion, then trigger recording.
Stop recording immediately when the task is complete. Idle frames at the end are equally damaging — the model learns that a correct completion involves holding position. Stop the recording the instant the gripper releases or the task goal is met.
Vary the object position across spatial zones. A model trained on 50 episodes where the target object is always in the center of the workspace will fail the moment the object moves to the left edge. Divide the workspace into a grid (near/far × left/center/right = 6 zones) and target at least 8 episodes per zone.
Complete the full task on every episode. A demonstration that shows the robot reaching for an object but not grasping it teaches the policy that reaching is the goal. Every episode must include the complete sequence: approach, grasp, transport, and release (or whatever the full task is). Discard any episode that stops short.

Hardware Options

Method	Example hardware	Cost	Demonstration quality
Leader-follower arms	SO-101, ALOHA, Koch	$300–$ 20K	Best — natural human dynamics, no translation lag
VR controllers	Meta Quest 3, HTC Vive	$300–$ 1K	Good for bimanual tasks; some latency and drift
Joystick / SpaceMouse	3Dconnexion SpaceMouse	$150–$ 400	Easiest setup; lower quality due to operator encoding overhead
Kinesthetic teaching	No extra hardware	$0	Requires gravity compensation on the robot; good for slow, precise tasks

Imitation Learning Training policies from human demonstrations

Robot Training Dataset Structured collections of demonstrations used for policy learning

End Effector The business end of a robot arm — gripper, tool, or sensor

Kinematics How robots move through space — the geometry of motion

Sources

LeRobot Teleoperation Guide — Practical setup and recording workflow for SO-100/SO-101 arms
ACT: Action Chunking with Transformers — Zhao et al., 2023 — the paper that popularized leader-follower teleoperation for imitation learning at low cost
ALOHA: A Low-cost Open-source Hardware System — Stanford reference hardware for bimanual teleoperation