Robot Training Dataset
A robot training dataset is the collection of recorded teleoperation demonstrations used to train an imitation learning policy. It’s the bridge between human expertise and robot capability — every successful policy starts with a human showing the robot what to do, and the dataset is the permanent record of those demonstrations.
Quality beats quantity. A small set of clean, diverse demonstrations consistently outperforms a large set of mixed-quality recordings.
Structure: Episodes
A dataset is organized as a collection of episodes — one recording per demonstration attempt.
dataset/ data/chunk-000/ episode_000000.parquet ← 300 rows (10s at 30fps) episode_000001.parquet ... videos/chunk-000/ observation.images.overhead_episode_000000.mp4 observation.images.side_episode_000000.mp4 ... meta/ info.json ← dataset schema, camera configs stats.json ← per-column mean/std (for normalization)Each row in a Parquet file represents one timestep at 30fps (33ms). Key columns:
| Column | Description |
|---|---|
observation.state | Robot joint positions at this timestep |
action | Joint positions commanded at this timestep |
episode_index | Which episode this row belongs to |
frame_index | Frame number within the episode |
timestamp | Time in seconds from episode start |
Episode Quality Criteria
Every episode is either clean training signal or noise. There is no middle ground.
| Criterion | Why It Matters |
|---|---|
| Complete task | An incomplete demonstration teaches the model a partial behavior. For pick-and-place, the robot must complete the full grasp-and-place sequence, not just reach. |
| No idle frames at start | Frames where the robot is stationary before moving teach the model to hesitate. Start recording only after the robot is already in motion. |
| No idle frames at end | Frames after task completion teach the model to hold still once done — which is wrong in a looping policy. Stop recording the instant the task completes. |
| Consistent technique | Varying your grasp angle or approach direction across episodes creates contradictory training signal. Pick one technique and repeat it. |
| Full gripper cycle (pick tasks) | For any pick-and-place task: the gripper must open → close → open within the episode. A recording where the gripper never closes means the robot never grasped — that demonstration is training the model to fail. |
Spatial Diversity
A policy trained on a single object position will only work at that position. Divide your workspace into zones and collect roughly equal coverage across all of them.
┌───────────────────────┐ │ Z1 │ Z2 │ Z3 │ ← far (25cm from base) ├────────┼───────┼──────┤ │ Z4 │ Z5 │ Z6 │ ← near (15cm from base) └───────────────────────┘ left center rightAim for ~8 episodes per zone for a 50-episode dataset. Also vary object orientation (±45°) within each zone — a policy that only saw objects aligned with the camera axis will fail when the object is rotated.
How Many Episodes?
| Task Type | Minimum | Reliable |
|---|---|---|
| Simple pick-place, fixed position | 20 | 50 |
| Pick-place with spatial diversity | 50 | 150+ |
| Complex manipulation | 100 | 300+ |
| Bimanual coordination | 200 | 500+ |
These are rough guidelines. More important than raw count: all episodes must pass quality criteria. 50 clean episodes beat 200 mixed-quality ones.
Dataset Formats
| Format | Storage | Notes |
|---|---|---|
| LeRobot v3 | Parquet + MP4 | HuggingFace-native format, used by most modern imitation learning research |
| RLDS | TensorFlow Datasets | Common in academic robotics, used by RT-2 and Open X-Embodiment |
| MCAP | Binary (ROS 2) | Raw recording format before conversion. Captures everything including topics not needed for training. |
| HDF5 | Binary | Legacy format from the original ACT paper; still used by some labs |
Related Terms
Sources
- LeRobot Dataset Format — Official documentation for the LeRobot v3 Parquet + video format
- ACT: Action Chunking with Transformers — Original paper introducing the HDF5 dataset format for bimanual manipulation