Skip to content

Robot Training Dataset

Practical

A robot training dataset is the collection of recorded teleoperation demonstrations used to train an imitation learning policy. It’s the bridge between human expertise and robot capability — every successful policy starts with a human showing the robot what to do, and the dataset is the permanent record of those demonstrations.

Quality beats quantity. A small set of clean, diverse demonstrations consistently outperforms a large set of mixed-quality recordings.

Structure: Episodes

A dataset is organized as a collection of episodes — one recording per demonstration attempt.

dataset/
data/chunk-000/
episode_000000.parquet ← 300 rows (10s at 30fps)
episode_000001.parquet
...
videos/chunk-000/
observation.images.overhead_episode_000000.mp4
observation.images.side_episode_000000.mp4
...
meta/
info.json ← dataset schema, camera configs
stats.json ← per-column mean/std (for normalization)

Each row in a Parquet file represents one timestep at 30fps (33ms). Key columns:

ColumnDescription
observation.stateRobot joint positions at this timestep
actionJoint positions commanded at this timestep
episode_indexWhich episode this row belongs to
frame_indexFrame number within the episode
timestampTime in seconds from episode start

Episode Quality Criteria

Every episode is either clean training signal or noise. There is no middle ground.

CriterionWhy It Matters
Complete taskAn incomplete demonstration teaches the model a partial behavior. For pick-and-place, the robot must complete the full grasp-and-place sequence, not just reach.
No idle frames at startFrames where the robot is stationary before moving teach the model to hesitate. Start recording only after the robot is already in motion.
No idle frames at endFrames after task completion teach the model to hold still once done — which is wrong in a looping policy. Stop recording the instant the task completes.
Consistent techniqueVarying your grasp angle or approach direction across episodes creates contradictory training signal. Pick one technique and repeat it.
Full gripper cycle (pick tasks)For any pick-and-place task: the gripper must open → close → open within the episode. A recording where the gripper never closes means the robot never grasped — that demonstration is training the model to fail.

Spatial Diversity

A policy trained on a single object position will only work at that position. Divide your workspace into zones and collect roughly equal coverage across all of them.

┌───────────────────────┐
│ Z1 │ Z2 │ Z3 │ ← far (25cm from base)
├────────┼───────┼──────┤
│ Z4 │ Z5 │ Z6 │ ← near (15cm from base)
└───────────────────────┘
left center right

Aim for ~8 episodes per zone for a 50-episode dataset. Also vary object orientation (±45°) within each zone — a policy that only saw objects aligned with the camera axis will fail when the object is rotated.

How Many Episodes?

Task TypeMinimumReliable
Simple pick-place, fixed position2050
Pick-place with spatial diversity50150+
Complex manipulation100300+
Bimanual coordination200500+

These are rough guidelines. More important than raw count: all episodes must pass quality criteria. 50 clean episodes beat 200 mixed-quality ones.

Dataset Formats

FormatStorageNotes
LeRobot v3Parquet + MP4HuggingFace-native format, used by most modern imitation learning research
RLDSTensorFlow DatasetsCommon in academic robotics, used by RT-2 and Open X-Embodiment
MCAPBinary (ROS 2)Raw recording format before conversion. Captures everything including topics not needed for training.
HDF5BinaryLegacy format from the original ACT paper; still used by some labs

Sources