Skip to content

Computer Vision

Conceptual

Computer vision is the set of algorithms and techniques that extract useful information from images and video. For robots, it answers the questions that make autonomous operation possible: where are objects? how far away are they? what is this thing in front of me?

Cameras are cheap and information-dense — a single RGB frame contains millions of data points about the environment. But raw pixels mean nothing without processing. Computer vision is the translation layer between sensor data and robot understanding.

The Vision Pipeline

Raw image (pixels)
└─ Preprocessing (resize, normalize, denoise)
└─ Feature extraction (edges, keypoints, semantic regions)
└─ Interpretation (depth, object detection, segmentation)
└─ Robot action (grasp pose, navigation waypoint, avoidance)

Each stage reduces noise and increases abstraction. The robot never reasons about pixels directly — it reasons about objects, distances, and spatial relationships derived from them.

Key Tasks in Robotics Vision

TaskWhat it doesCommon use
Depth estimationPer-pixel distance from cameraObstacle avoidance, grasp planning
Object detectionBounding boxes around known classes”Find the cup”
Semantic segmentationPer-pixel class labelsFloor vs. obstacle vs. table
Keypoint detectionSpecific points on objects or humansPose estimation, manipulation
Optical flowPixel motion between framesVisual odometry, tracking
Feature matchingSame point across framesSLAM, relocalization

No single task is sufficient on its own. A manipulation robot typically needs detection (find the object), depth estimation (how far), and keypoint detection (where to grasp) working together.

Depth from Cameras

Two approaches dominate in robotics:

Stereo cameras use two cameras at a known horizontal offset. Matching the same point in both images gives a disparity value; disparity converts directly to depth via the camera baseline. No active illumination, works well outdoors, effective range up to 20–40m depending on baseline width.

RGB-D cameras (Intel RealSense, Microsoft Azure Kinect) emit structured light or measure time-of-flight to produce a dense depth map alongside the color image. They work well indoors at ranges up to ~5m, require no stereo matching computation, but are sensitive to sunlight interference.

Deep Learning for Vision

Modern robotics vision is dominated by two neural network architectures:

Convolutional Neural Networks (CNNs) are the standard architecture for image classification and object detection. Stacked convolutional layers detect increasingly abstract features: early layers find edges and textures, deeper layers find object parts and whole objects. CNNs are computationally efficient and well-suited to local spatial patterns.

Vision Transformers (ViTs) treat an image as a sequence of fixed-size patches and apply self-attention across them. ViTs are better than CNNs at capturing global context — relationships between distant regions of an image. They are the backbone of modern Vision-Language-Action (VLA) models, where visual understanding must integrate with language and motor control.

NVIDIA Acceleration

Isaac ROS includes GPU-accelerated vision packages that run as CUDA-accelerated ROS 2 nodes on Jetson:

  • RTMDet: Real-time object detection and instance segmentation
  • Pose estimation: 6-DoF object pose from a single RGB image
  • ESS (Efficient Stereo Depth): Deep learning stereo depth using DNN rather than classical disparity matching
  • AprilTag detection: Fiducial marker detection for localization and camera calibration

All packages expose standard ROS 2 interfaces, making them composable with SLAM, navigation, and manipulation pipelines.

Sources