TensorRT
Practical
Terminal window
TensorRT is NVIDIA’s high-performance deep learning inference SDK that optimizes trained neural networks for deployment on NVIDIA GPUs. It compiles models into optimized engines that deliver low latency and high throughput through layer fusion, kernel auto-tuning, and precision calibration.
Prerequisites
Neural Networks Deep learning model fundamentals
Isaac ROS NVIDIA GPU-accelerated ROS ecosystem
Why TensorRT Matters
- Inference speed: Up to 40x faster than CPU inference, 18x faster than TensorFlow
- Memory efficiency: FP16 uses ~50% memory, INT8 uses ~25% memory
- Edge deployment: Critical for real-time inference on Jetson platforms
- Isaac ROS integration: Powers perception pipelines (detection, segmentation, depth estimation)
- Production ready: Serialized engines load instantly without runtime compilation
Optimization Pipeline
┌─────────────────────────────────────────────────────────────────────┐│ TensorRT Optimization Pipeline │├─────────────────────────────────────────────────────────────────────┤│ ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ ││ │ Trained │───►│ ONNX │───►│ TensorRT │───►│ Optimized │ ││ │ Model │ │ Export │ │ Builder │ │ Engine │ ││ │(PyTorch) │ │ │ │ │ │ (.trt) │ ││ └──────────┘ └──────────┘ └────┬─────┘ └──────────────┘ ││ │ ││ ┌────────────┼────────────┐ ││ │ │ │ ││ ┌────▼────┐ ┌─────▼────┐ ┌────▼─────┐ ││ │ Layer │ │ Kernel │ │Precision │ ││ │ Fusion │ │Auto-Tune │ │ Calib. │ ││ └─────────┘ └──────────┘ └──────────┘ ││ │└─────────────────────────────────────────────────────────────────────┘Core Optimizations
Layer Fusion
Combines sequential operations into single GPU kernels:
Before: Conv → Bias → ReLU (3 kernel launches, 3 memory transfers)After: CBR (1 kernel launch, 1 memory transfer)Kernel Auto-Tuning
Benchmarks multiple kernel implementations during build time and selects the fastest for your specific GPU and tensor dimensions.
Precision Calibration
| Precision | Memory | Speed | Use Case |
|---|---|---|---|
| FP32 | Baseline | Baseline | Training, debugging |
| FP16 | ~50% | ~2x | General deployment |
| INT8 | ~25% | ~4x | Edge devices (needs calibration) |
| FP8 | ~25% | ~4x | Ada/Hopper GPUs |
Building Engines
The fastest way to convert ONNX models:
# Basic FP16 conversiontrtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.trt --fp16
# With dynamic batch sizetrtexec --onnx=model.onnx \ --fp16 \ --minShapes=input:1x3x224x224 \ --optShapes=input:8x3x224x224 \ --maxShapes=input:16x3x224x224 \ --saveEngine=model_dynamic.trt
# INT8 with calibration cachetrtexec --onnx=model.onnx --int8 --calib=calibration.cache \ --saveEngine=model_int8.trtFor programmatic engine building:
import tensorrt as trt
logger = trt.Logger(trt.Logger.WARNING)builder = trt.Builder(logger)network = builder.create_network( 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))parser = trt.OnnxParser(network, logger)
with open("model.onnx", "rb") as f: parser.parse(f.read())
config = builder.create_builder_config()config.set_flag(trt.BuilderFlag.FP16)
engine = builder.build_serialized_network(network, config)with open("model.trt", "wb") as f: f.write(engine)Direct PyTorch integration:
import torchimport torch_tensorrt
model = YourModel().eval().cuda()optimized = torch.compile(model, backend="tensorrt")
# First call compiles, subsequent calls are fastoutput = optimized(input_tensor)ROS 2 Integration
Isaac ROS provides TensorRT nodes for perception pipelines:
┌─────────────────────────────────────────────────────────────────┐│ ROS 2 + TensorRT Pipeline │├─────────────────────────────────────────────────────────────────┤│ ││ ┌──────────┐ ┌──────────────┐ ┌─────────────────────┐ ││ │ Camera │───►│ Image │───►│ TensorRT Node │ ││ │ Driver │ │ Resize │ │ (isaac_ros_tensor) │ ││ └──────────┘ └──────────────┘ └──────────┬──────────┘ ││ │ ││ ┌──────▼──────┐ ││ │ Detection/ │ ││ │ Segmentation│ ││ │ Results │ ││ └─────────────┘ ││ │└─────────────────────────────────────────────────────────────────┘Isaac ROS Packages
| Package | Purpose |
|---|---|
isaac_ros_tensor_rt | TensorRT inference node |
isaac_ros_triton | Triton inference server integration |
isaac_ros_dnn_inference | DNN preprocessing utilities |
Launch Configuration
tensor_rt_node: ros__parameters: model_file_path: "/models/detection.onnx" engine_file_path: "/models/detection.trt" input_tensor_names: ["input"] output_tensor_names: ["output"] input_binding_names: ["input"] output_binding_names: ["output"]Jetson Deployment
TensorRT is the primary inference runtime for Jetson platforms:
| Platform | Performance | Notes |
|---|---|---|
| Orin Nano | YOLOv8n @ 47 FPS | Entry-level edge AI |
| Orin NX | 2x Orin Nano | Mid-range applications |
| AGX Orin | 4x Orin Nano | High-performance edge |
| Jetson Thor | With TensorRT Edge-LLM | LLM/VLM on edge |
Hardware Requirements
- GPU: NVIDIA with Compute Capability 7.5+ (Turing and newer)
- FP8 support: Ada Lovelace or Hopper architecture
- Software: CUDA Toolkit 12.x, Python 3.8+
Related Terms
Isaac ROS NVIDIA GPU-accelerated ROS ecosystem
cuMotion GPU-accelerated motion planning
nvblox Real-time 3D reconstruction
Jetson Orin NVIDIA edge AI compute platform
Learn More
- TensorRT Quick Start Guide — Official getting started tutorial
- Torch-TensorRT Documentation — PyTorch integration guide
- Isaac ROS DNN Inference — ROS 2 integration
Sources
- TensorRT Documentation — Official SDK reference
- TensorRT GitHub Repository — Source code and samples
- TensorRT Release Notes 10.14.1 — Latest version features
- Isaac ROS DNN Inference — ROS 2 package source
- TensorRT Best Practices — Performance optimization guide