OTA Model Update
OTA (Over-the-Air) Model Update is the practice of pushing new neural network weights to a deployed robot remotely — without physical access or manual file transfer. When a training run completes and produces a better checkpoint, OTA infrastructure delivers it to every robot in the fleet automatically. This is what makes continuous improvement practical at fleet scale.
Why It Matters
Without OTA, updating a robot model looks like this:
- Training completes — checkpoint ready on the training server
- Copy checkpoint to a USB drive
- Walk to the robot (or SSH in individually)
- Stop the running policy
- Replace the checkpoint directory
- Restart the policy
- Verify it loaded correctly
Multiply that by 10 robots and you have a full day’s work just delivering a model update. OTA collapses this to a single operation:
Training completes → push to fleet → robots hot-swap automatically
The improvement loop — record failures, retrain, redeploy — only closes if deployment is fast and cheap. OTA is what makes that loop practical.
Hot-Swapping
The most important property of a production OTA system is that robots do not stop moving during an update. This requires hot-swapping: loading new weights into memory and switching the active model without halting the control loop.
Training service pushes checkpoint (safetensors + stats.json) │ ▼Robot receives MODEL_UPDATE via WebSocket │ ▼Downloads checkpoint + verifies SHA256 hash │ ▼Loads new weights into memory (thread-safe) │ ▼Control loop uses new model on next inference cycle │ ▼Old weights garbage collectedThe transition happens between inference cycles — typically at a chunk boundary — so there is no interruption to robot motion. The robot completes its current action chunk, then begins predicting with the updated model.
Rollback
Deploying a bad model to a production robot is a real risk. OTA systems need three things to handle it safely:
- Retain N recent checkpoints on the robot — so rollback does not require another download
- A success metric that can be evaluated autonomously — task completion rate, anomaly score, or a simple watchdog that detects the robot has entered an unexpected state
- A remote rollback command — the same channel used to push new models should be able to instruct a robot to revert to the previous checkpoint
Without automated rollback, a bad deployment requires the same manual intervention OTA was meant to eliminate.
Fleet Deployment
For fleets larger than a handful of robots, staged rollouts reduce the blast radius of a bad update:
Training completes → new checkpoint available │ ▼Deploy to 1 robot (canary) │ ▼Evaluate for 24 hours │ ┌─────┴─────┐ │ │Success Failurerate OK? rate high? │ │ ▼ ▼Deploy to Rollback canaryfull fleet InvestigateThe canary period catches regressions before they affect the whole fleet. The evaluation window depends on task frequency — a robot that runs a task 100 times per hour generates signal faster than one that runs it 5 times per day.
Related Terms
Sources
- AWS IoT Greengrass OTA Updates — AWS’s documentation on over-the-air deployment patterns for edge devices, covering delta updates, rollback, and staged rollouts
- NVIDIA Isaac ROS — NVIDIA’s ROS 2 packages for robot autonomy; the broader Isaac platform includes OTA update infrastructure for Jetson-based robots