Skip to content

OTA Model Update

Practical

OTA (Over-the-Air) Model Update is the practice of pushing new neural network weights to a deployed robot remotely — without physical access or manual file transfer. When a training run completes and produces a better checkpoint, OTA infrastructure delivers it to every robot in the fleet automatically. This is what makes continuous improvement practical at fleet scale.

Why It Matters

Without OTA, updating a robot model looks like this:

  1. Training completes — checkpoint ready on the training server
  2. Copy checkpoint to a USB drive
  3. Walk to the robot (or SSH in individually)
  4. Stop the running policy
  5. Replace the checkpoint directory
  6. Restart the policy
  7. Verify it loaded correctly

Multiply that by 10 robots and you have a full day’s work just delivering a model update. OTA collapses this to a single operation:

Training completes → push to fleet → robots hot-swap automatically

The improvement loop — record failures, retrain, redeploy — only closes if deployment is fast and cheap. OTA is what makes that loop practical.

Hot-Swapping

The most important property of a production OTA system is that robots do not stop moving during an update. This requires hot-swapping: loading new weights into memory and switching the active model without halting the control loop.

Training service pushes checkpoint
(safetensors + stats.json)
Robot receives MODEL_UPDATE via WebSocket
Downloads checkpoint + verifies SHA256 hash
Loads new weights into memory (thread-safe)
Control loop uses new model on next inference cycle
Old weights garbage collected

The transition happens between inference cycles — typically at a chunk boundary — so there is no interruption to robot motion. The robot completes its current action chunk, then begins predicting with the updated model.

Rollback

Deploying a bad model to a production robot is a real risk. OTA systems need three things to handle it safely:

  • Retain N recent checkpoints on the robot — so rollback does not require another download
  • A success metric that can be evaluated autonomously — task completion rate, anomaly score, or a simple watchdog that detects the robot has entered an unexpected state
  • A remote rollback command — the same channel used to push new models should be able to instruct a robot to revert to the previous checkpoint

Without automated rollback, a bad deployment requires the same manual intervention OTA was meant to eliminate.

Fleet Deployment

For fleets larger than a handful of robots, staged rollouts reduce the blast radius of a bad update:

Training completes → new checkpoint available
Deploy to 1 robot (canary)
Evaluate for 24 hours
┌─────┴─────┐
│ │
Success Failure
rate OK? rate high?
│ │
▼ ▼
Deploy to Rollback canary
full fleet Investigate

The canary period catches regressions before they affect the whole fleet. The evaluation window depends on task frequency — a robot that runs a task 100 times per hour generates signal faster than one that runs it 5 times per day.

Sources

  • AWS IoT Greengrass OTA Updates — AWS’s documentation on over-the-air deployment patterns for edge devices, covering delta updates, rollback, and staged rollouts
  • NVIDIA Isaac ROS — NVIDIA’s ROS 2 packages for robot autonomy; the broader Isaac platform includes OTA update infrastructure for Jetson-based robots