3 Numbers That Tell You Exactly How to Optimize Your AI Model
Before you spend a week trying FP4 quantization, compute these first.
We spent 30 hours trying optimization techniques on a real-time speech AI model. Eight of them failed. Every single failure was predictable from three numbers we could have computed in five minutes.
This post shares the framework we built after that experience. It works for any model on any hardware. You need a calculator and your model’s spec sheet.
The Problem: Optimization by Vibes
Here’s how most teams optimize AI inference:
- Model is too slow
- Try quantization (because everyone says to)
- Try TensorRT (because NVIDIA says to)
- Try kernel fusion, operator rewriting, graph optimization
- Some things help, most don’t, nobody knows why
The issue isn’t effort — it’s that nobody asks why the model is slow before trying to make it fast. Is it bottlenecked on memory bandwidth? Compute? Python overhead? The answer determines everything.
Quantization helps a bandwidth-bottlenecked model. It does nothing — or worse, hurts — an overhead-bottlenecked one. TensorRT helps a compute-bottlenecked model. It’s irrelevant at batch size 1 when memory bandwidth is the constraint.
You can figure out which one you’re dealing with before touching a line of code.
The Three Numbers
Every inference step on a GPU has three physical constraints. Think of them as speed limits — your model literally cannot run faster than the slowest one.
: The Weight-Read Floor
Every inference step streams the entire model’s weights from memory to the GPU compute units. At batch size 1, each weight is read exactly once (the model is too large to cache between layers). The time to read all those weights at maximum bandwidth is:
where is the parameter count, is bytes per weight, and is memory bandwidth.
Example: A 7B parameter model in bf16 (2 bytes per weight) on a Jetson Thor (200 GB/s bandwidth):
That’s it. No matter how clever your kernels are, you cannot read 14 GB of weights in less than 70 ms at 200 GB/s. This is physics.
Key insight: scales linearly with bytes per weight. FP8 (1 byte) cuts in half. FP4 (0.5 bytes) cuts it in half again. This is why quantization is the first thing to try on a bandwidth-bound model.
: The KV-Cache Floor
For autoregressive models (LLMs, speech transformers), each token generation step also reads the stored key-value attention cache for all previous tokens:
For short contexts, is small compared to (weights dominate). For long contexts, can exceed .
Non-autoregressive models (vision encoders, diffusion models, action policies): . Skip this.
: The Compute Floor
The minimum time to perform all the arithmetic:
For batch-1 transformer inference, total FLOPs parameter count (one multiply-add per parameter).
On current hardware at batch size 1, is almost always much smaller than . GPUs have way more compute than bandwidth. This is why most single-sample inference is bandwidth-bound, not compute-bound.
The Diagnostic: One Ratio Tells You Everything
Measure your actual inference latency (), then compute the overhead ratio:
This single number classifies your workload:
| Ratio | Regime | What It Means | What To Do |
|---|---|---|---|
| Bandwidth-bound | Memory bandwidth is the bottleneck | Quantize weights, reduce KV cache | |
| Overhead-bound | System overhead dominates | CUDA graphs, torch.compile, C++ runtime | |
| Compute-bound | Arithmetic is the bottleneck | TensorRT, batching, pruning |
That’s the whole framework. Three numbers, one ratio, one lookup table.
Real Example: Getting a 7B Speech Model Under 80 ms
Let’s walk through a real optimization. PersonaPlex is an 8.37B parameter speech-to-speech transformer (based on Moshi) that must produce one audio frame every 80 ms for real-time conversation. We deployed it on NVIDIA Jetson Thor.
Step 1: Compute the floors
Hardware: Jetson Thor, 200 GB/s bandwidth, ~2000 TOPS (FP8)
R1 (bf16) = 8.37B × 2 / 200 GB/s = 83.7 ms ← weight readsR2 (3000 context) = 7.5 ms ← KV cache readsR3 = ~8.4 ms ← computeR1 + R2 = 91.2 msStep 2: Measure and classify
Measured τ = 114 ms (lm_step)Overhead ratio = 114 / 91.2 = 1.25× → BANDWIDTH-BOUNDAt 1.25× the bandwidth floor, this model is running efficiently — it spends most of its time just reading weights from memory. The optimization path is clear: reduce the bytes we need to read.
Step 3: Predict the impact
FP8 quantization converts most weights from 2 bytes to 1 byte, roughly halving :
Predicted: R1 drops from 83.7 to ~33 ms → τ should drop ~40 msMeasured: τ dropped from 114 to 74 ms → exactly 40 ms reductionPrediction accuracy: within 5%. Not a coincidence — this is what happens when you understand the bottleneck before optimizing.
Step 4: Check the ratio again
Here’s where it gets interesting:
After FP8:R1 + R2 = 33 + 7.5 = 40.5 msMeasured τ = 74 msOverhead ratio = 74 / 40.5 = 1.83× → TRANSITIONINGThe ratio jumped from 1.25× to 1.83×. We reduced the bandwidth floor, but the non-bandwidth overhead (kernel dispatch, CPU-GPU sync, Python framework) stayed constant. The model is shifting from bandwidth-bound toward overhead-bound.
This is the most important insight in the entire framework.
The Trap: When Quantization Stops Helping
After FP8, the intuitive next step is “keep quantizing — try FP4!” Here’s what the framework predicts:
FP4 prediction:R1 + R2 = ~16.5 + 7.5 = ~23.2 msBut τ ≈ 74 ms (overhead won't shrink)Predicted overhead ratio = 74 / 23.2 = 3.19× → OVERHEAD-BOUNDFurther quantization shrinks (the bandwidth floor), but doesn’t follow because the bottleneck has shifted to overhead — kernel dispatch, CUDA graph replay, CPU-GPU synchronization. You’re optimizing the wrong thing.
We confirmed this the hard way. NVFP4 quantization on this model was 5.8–10.6× slower, not faster. The FP4 dequantization overhead vastly exceeded the bandwidth savings.
We also confirmed it cross-platform. On DGX Spark (273 GB/s — 36% more bandwidth than Jetson Thor), the lm_step at FP8 was… 70.0 ms. Virtually identical to Thor’s 70.2 ms. If the model were still bandwidth-bound, more bandwidth should have helped. It didn’t. The bottleneck is overhead.
The full scorecard
Here are all the optimization techniques we tried, and what the framework would have told us in advance:
| Technique | Framework Prediction | Actual Result | Hours Spent |
|---|---|---|---|
| FP8 quantization | halved → large improvement | 40 ms faster | Did this first |
| NVFP4 quantization | Overhead-bound; dequant adds overhead | 5.8–10.6× slower | ~8 hours wasted |
| TensorRT FP16 | FP16 doubles vs FP8 at batch=1 | 16% slower | ~4 hours wasted |
| Static activation scaling | CUDA graphs already hide launch overhead | 0 ms improvement | ~3 hours wasted |
| FP8 KV cache | is only 10% of ; SDPA rejects FP8 | Not implementable | ~3 hours wasted |
| FP8 depformer | Small matrices: overhead > BW savings | 0 ms improvement | ~4 hours wasted |
| bitsandbytes NF4 | F.linear bypasses module hooks | Doesn’t activate | ~2 hours wasted |
| torchao INT4 | Packed tensor incompatible | Crash | ~2 hours wasted |
Eight failed techniques. ~30 hours wasted. All predictable from three numbers.
The framework told us to stop quantizing after FP8 and focus on overhead reduction (CUDA graphs, torch.compile, eliminating unnecessary compute). The optimizations that actually worked after FP8:
- torch.compile on mimi codec: 2.5 ms (kernel fusion — overhead reduction)
- Skip redundant mimi decoder: 9.6 ms (dead compute elimination)
- MAXN power mode + locked clocks: 9 ms (hardware un-throttling)
- Pinned memory for DtoH transfers: 0.7 ms (transfer optimization)
Final result: 141 ms → 78.3 ms (44% reduction). 1.7 ms under the 80 ms budget.
It Gets More Interesting with Pipelines
Real systems don’t run one model — they chain several. We deploy a voice+vision assistant on Jetson Thor with three sequential models:
| Stage | Model | Floor | Measured | Ratio | Regime |
|---|---|---|---|---|---|
| STT | Nemotron 0.6B (bf16) | 6.0 ms | 76 ms | 12.7× | Overhead-bound |
| VLM | Qwen2.5-VL 7B (NVFP4 via vLLM) | ~21 ms/tok | ~34.2 ms/tok | 1.62× | Transitioning |
| TTS | Kokoro 82M (CPU) | — | 59 ms | — | CPU-bound |
| Total | ~899 ms | Mixed (28% over budget) |
Three models. Three completely different regimes. In the same pipeline.
The STT model is so small (0.6B params → = 6 ms) that NeMo API overhead and Python dispatch dominate — the GPU forward pass takes ~6 ms but the full transcribe() call takes 76 ms. Quantizing it would save fractions of a millisecond. The fix is bypassing the batch API (our fast path saves 20 ms) or TensorRT/CUDA graphs.
The VLM is at 1.62× its theoretical bandwidth floor — in the transition zone between bandwidth-bound and overhead-bound. About ~13 ms/token of the 34.2 ms/token is overhead from vLLM’s HTTP API, JSON serialization, and scheduler. Per-token quantization won’t help much. The practical optimization is reducing how many tokens you generate (shorter responses, tighter prompts) and reducing API overhead.
The TTS runs on CPU and takes only 59 ms for a typical 44-character response (~0.5 ms/char). GPU optimizations are irrelevant.
No single optimization technique addresses all three stages. End-to-end profiling would show you the total is 899 ms. The framework tells you why and where to focus.
How to Use This for Your Model
Quick-start recipe
- Look up your hardware specs: memory bandwidth (GB/s) and peak compute (TOPS/TFLOPS)
- Count your model’s parameters and note the precision (bf16 = 2 bytes, FP8 = 1 byte, FP4 = 0.5 bytes)
- Compute = params × bytes_per_weight / bandwidth
- Compute (if autoregressive) = layers × 2 × heads × context × head_dim × bytes_per_kv / bandwidth
- Measure with proper CUDA synchronization (
torch.cuda.synchronize()before and after) - Compute the ratio =
- Follow the decision tree:
Measure τ, Compute R1, R2, R3 │ ┌───────────────┼───────────────┐ │ │ │ ratio < 1.5 ratio > 2.0 τ/R3 < 1.5 │ │ │ BANDWIDTH-BOUND OVERHEAD-BOUND COMPUTE-BOUND │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ R1 >> R2 R2 >> R1 CUDA torch TensorRT Batching │ │ graphs compile Quantize Shrink KV weights context │ After quantizing: re-measure ratio If ratio > 2.0 → STOP QUANTIZING Switch to overhead reductionCommon hardware specs
| Platform | Memory Bandwidth | Compute (FP8) |
|---|---|---|
| Jetson Orin NX 16GB | 102 GB/s | ~100 TOPS |
| Jetson Orin AGX 64GB | 205 GB/s | ~275 TOPS |
| Jetson Thor 128GB | ~200 GB/s | ~2000 TOPS |
| DGX Spark 128GB | ~273 GB/s | — |
| RTX 4090 | 1008 GB/s | 660 TOPS |
| A100 80GB | 2039 GB/s | 624 TOPS |
| H100 SXM | 3350 GB/s | 1979 TOPS |
Rules of thumb
- Batch size 1 is almost always bandwidth-bound. Compute matters at larger batches.
- FP8 is usually the sweet spot. After FP8, most models shift from bandwidth-bound to overhead-bound on current hardware. FP4/INT4 rarely helps at batch=1.
- Small models (< 1B params) are often overhead-bound regardless of precision. is so low that dispatch overhead dominates. Skip quantization, go straight to TensorRT/CUDA graphs.
- only matters at long contexts. At 1K tokens, is typically < 10% of . At 100K tokens, can dominate.
- Re-check the ratio after every optimization. The regime can shift — what was bandwidth-bound at bf16 may be overhead-bound at FP8.
Try It Yourself
We built an interactive calculator that computes , , , classifies your regime, and shows the optimization decision tree for any model-hardware combination. It includes presets for common configurations (GPT-2, LLaMA-7B, PersonaPlex, multi-stage pipelines) with our measured validation data.
Open the Latency Floor Calculator →
The framework and all measured data come from deploying real-time AI models on NVIDIA Jetson hardware for DeviceNexus.ai — infrastructure for physical AI.
— Amar Balutkar, DeviceNexus.ai