Skip to content

TensorRT

Practical

TensorRT is NVIDIA’s high-performance deep learning inference SDK that optimizes trained neural networks for deployment on NVIDIA GPUs. It compiles models into optimized engines that deliver low latency and high throughput through layer fusion, kernel auto-tuning, and precision calibration.

Prerequisites

Why TensorRT Matters

  • Inference speed: Up to 40x faster than CPU inference, 18x faster than TensorFlow
  • Memory efficiency: FP16 uses ~50% memory, INT8 uses ~25% memory
  • Edge deployment: Critical for real-time inference on Jetson platforms
  • Isaac ROS integration: Powers perception pipelines (detection, segmentation, depth estimation)
  • Production ready: Serialized engines load instantly without runtime compilation

Optimization Pipeline

┌─────────────────────────────────────────────────────────────────────┐
│ TensorRT Optimization Pipeline │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ Trained │───►│ ONNX │───►│ TensorRT │───►│ Optimized │ │
│ │ Model │ │ Export │ │ Builder │ │ Engine │ │
│ │(PyTorch) │ │ │ │ │ │ (.trt) │ │
│ └──────────┘ └──────────┘ └────┬─────┘ └──────────────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ │ │ │ │
│ ┌────▼────┐ ┌─────▼────┐ ┌────▼─────┐ │
│ │ Layer │ │ Kernel │ │Precision │ │
│ │ Fusion │ │Auto-Tune │ │ Calib. │ │
│ └─────────┘ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘

Core Optimizations

Layer Fusion

Combines sequential operations into single GPU kernels:

Before: Conv → Bias → ReLU (3 kernel launches, 3 memory transfers)
After: CBR (1 kernel launch, 1 memory transfer)

Kernel Auto-Tuning

Benchmarks multiple kernel implementations during build time and selects the fastest for your specific GPU and tensor dimensions.

Precision Calibration

PrecisionMemorySpeedUse Case
FP32BaselineBaselineTraining, debugging
FP16~50%~2xGeneral deployment
INT8~25%~4xEdge devices (needs calibration)
FP8~25%~4xAda/Hopper GPUs

Building Engines

The fastest way to convert ONNX models:

Terminal window
# Basic FP16 conversion
trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.trt --fp16
# With dynamic batch size
trtexec --onnx=model.onnx \
--fp16 \
--minShapes=input:1x3x224x224 \
--optShapes=input:8x3x224x224 \
--maxShapes=input:16x3x224x224 \
--saveEngine=model_dynamic.trt
# INT8 with calibration cache
trtexec --onnx=model.onnx --int8 --calib=calibration.cache \
--saveEngine=model_int8.trt

ROS 2 Integration

Isaac ROS provides TensorRT nodes for perception pipelines:

┌─────────────────────────────────────────────────────────────────┐
│ ROS 2 + TensorRT Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────┐ ┌─────────────────────┐ │
│ │ Camera │───►│ Image │───►│ TensorRT Node │ │
│ │ Driver │ │ Resize │ │ (isaac_ros_tensor) │ │
│ └──────────┘ └──────────────┘ └──────────┬──────────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Detection/ │ │
│ │ Segmentation│ │
│ │ Results │ │
│ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Isaac ROS Packages

PackagePurpose
isaac_ros_tensor_rtTensorRT inference node
isaac_ros_tritonTriton inference server integration
isaac_ros_dnn_inferenceDNN preprocessing utilities

Launch Configuration

tensor_rt_node:
ros__parameters:
model_file_path: "/models/detection.onnx"
engine_file_path: "/models/detection.trt"
input_tensor_names: ["input"]
output_tensor_names: ["output"]
input_binding_names: ["input"]
output_binding_names: ["output"]

Jetson Deployment

TensorRT is the primary inference runtime for Jetson platforms:

PlatformPerformanceNotes
Orin NanoYOLOv8n @ 47 FPSEntry-level edge AI
Orin NX2x Orin NanoMid-range applications
AGX Orin4x Orin NanoHigh-performance edge
Jetson ThorWith TensorRT Edge-LLMLLM/VLM on edge

Hardware Requirements

  • GPU: NVIDIA with Compute Capability 7.5+ (Turing and newer)
  • FP8 support: Ada Lovelace or Hopper architecture
  • Software: CUDA Toolkit 12.x, Python 3.8+

Learn More

Sources