Skip to content

Model Checkpoint

Practical

A model checkpoint is a saved snapshot of a neural network’s weights at a specific point during training. For robot policies, it is the artifact you actually deploy — it encodes everything the model learned from the training dataset up to that moment. When a robot runs a trained policy, it is running a checkpoint.

What’s in a Checkpoint

A checkpoint is a directory, not a single file. For a LeRobot ACT policy trained to 100K steps, it looks like this:

checkpoints/
100000/
pretrained_model/
config.json ← model architecture + hyperparameters
model.safetensors ← weight values (~200MB for ACT)
stats.json ← dataset normalization statistics (mean/std per channel)

model.safetensors stores the weight tensors in a safe, fast-loading binary format (an alternative to PyTorch’s .pt files that avoids arbitrary code execution on load). config.json records the architecture decisions — input dimensions, number of transformer layers, chunk size — so the model can be reconstructed exactly. Neither file is useful without the other.

Training Curve and Checkpoint Selection

Loss decreases over the course of training, but not linearly. A typical ACT run on a 56-episode manipulation dataset:

Step 10K → loss 0.180 (model is barely tracking structure)
Step 50K → loss 0.072 (grasping emerges)
Step 80K → loss 0.058 (reliable on common positions)
Step 100K → loss 0.049 (diminishing returns begin)
Step 120K → loss 0.051 (slight overfit — worse than 100K)

The checkpoint at lowest validation loss is the best candidate, but evaluation on the real robot is the ground truth. A checkpoint at step 80K may generalize better than 100K if the dataset is small — the lower loss can reflect memorization of the training positions rather than genuine task understanding.

Checkpoint Lifecycle

Once training completes, a checkpoint follows a predictable path before reaching the robot:

Training completes
Checkpoint saved to disk
Sim validation (optional — reject if below success threshold)
Deploy to robot via OTA update
Physical evaluation (N trials across M positions)
If regression: rollback to previous checkpoint

The rollback step is important in production. Because checkpoints are versioned snapshots, reverting to a known-good deployment is straightforward — the robot just loads the previous checkpoint file. This is qualitatively different from a software service rollback: the failure mode is behavioral (the robot picks up objects less reliably) rather than a crash, and it may not be obvious until enough trials have accumulated.

Sources

  • LeRobot checkpoint format — HuggingFace’s open-source library for robot imitation learning; defines the pretrained_model/ directory structure used above
  • HuggingFace safetensors — the safe, fast tensor serialization format used for model.safetensors; avoids the arbitrary code execution risk of Python pickle-based .pt files