TensorRT vs DLA on Jetson Orin: When to Use Each
If you’re deploying a computer vision model on Jetson Orin, you’ll hit the TensorRT vs DLA question within the first few days of optimization work. The short answer: use TensorRT GPU inference when latency is your primary constraint and your model uses operators outside the DLA-supported set. Use DLA when you need to run inference continuously at low power, or when you’re running multiple models concurrently and need to keep the GPU free for other workloads. Most production deployments on Orin end up splitting inference across both.
Key Insights
- DLA (Deep Learning Accelerator) is not a slower GPU. It’s a fixed-function hardware block optimized for a specific layer subset. It draws 2-5W per core vs 10-25W for GPU inference — but only for models where DLA-compatible layers dominate.
- TensorRT GPU inference gives the lowest single-stream latency and supports every operator. DLA gives lower power at the cost of 2-4x higher latency and layer support restrictions.
- All Jetson Orin modules include DLA. The AGX Orin has 2 DLA cores; Orin NX (16GB and 8GB) has 2; Orin Nano (8GB and 4GB) has 1.
- The TensorRT build process handles DLA fallback transparently. Use
trtexec --useDLACore=0 --allowGPUFallbackand inspect the build log to see which layers ran on DLA vs GPU. - Transformer-based models (ViT, attention layers) do not run on DLA as of JetPack 6.2. Classic CNN architectures (YOLO, MobileNet, ResNet) are largely DLA-compatible.
What Is DLA and How Does It Fit Into the Orin Architecture?
The Jetson Orin SoC (T234) includes a dedicated Deep Learning Accelerator alongside the Ampere GPU. DLA is not a stripped-down GPU — it’s a purpose-built fixed-function inference engine designed to run common DNN operations at very low power. It doesn’t share memory bandwidth with the GPU and runs completely independently, which is the key reason it’s useful for concurrent workloads.
The DLA core count varies by module:
| Module | DLA Cores | GPU (CUDA Cores) | Max DLA TOPS | Max GPU TOPS |
|---|---|---|---|---|
| AGX Orin 64GB | 2 | 2048 | 2x 5.5 = 11 | 275 |
| AGX Orin 32GB | 2 | 1792 | 2x 5.5 = 11 | 170 |
| Orin NX 16GB | 2 | 1024 | 2x 3.5 = 7 | 100 |
| Orin NX 8GB | 2 | 512 | 2x 2.3 = 4.6 | 70 |
| Orin Nano 8GB | 1 | 1024 | 5.5 | 40 |
| Orin Nano 4GB | 1 | 512 | 3.5 | 20 |
DLA on Orin supports FP16 and INT8 precision. INT8 on DLA is the right target for production deployments — it requires a calibration dataset during the TensorRT build step, but the power and throughput numbers assume INT8.
When to Use TensorRT GPU Inference
GPU inference via TensorRT is the right choice when:
-
Single-stream latency is the hard requirement. A robotics manipulator that needs a pick decision in under 30ms, a medical imaging device with a real-time feedback loop, or an inspection system where a missed detection has immediate downstream consequences. GPU at INT8 on AGX Orin can run YOLOv8m inference in under 5ms. DLA will run the same model in 15-25ms.
-
Your model uses operators outside the DLA-supported set. Transformer attention, dynamic shapes, custom CUDA plugins, LSTM/GRU layers, or any operation that needs arbitrary precision. These layers fall back to GPU when you specify DLA, which defeats much of the power benefit and adds PCIe transfer overhead.
-
You’re prototyping and need iteration speed. The DLA build path requires INT8 calibration and a fixed-shape engine. GPU inference builds faster, accepts FP32/FP16/INT8, and handles dynamic shapes. Build with GPU during development, then evaluate DLA for production once the model is frozen.
-
Your thermal budget is unconstrained. A device with active cooling and a 30W or 60W power budget can run GPU inference continuously without thermal throttling. DLA’s power advantage disappears if your system can handle the GPU load without a problem.
When to Use DLA on Jetson Orin
DLA is the right choice when:
-
Power budget is the constraint. Battery-powered drones, autonomous agricultural equipment, and wearable devices where every watt matters. Running a MobileNetV3 classification model on DLA at 30fps draws roughly 2-3W vs 8-12W on GPU. Over an 8-hour deployment, that’s the difference between a battery that lasts the day and one that doesn’t.
-
You’re running multiple models concurrently. On AGX Orin, you can run one model on DLA core 0, a second model on DLA core 1, and a third on the GPU simultaneously — all without the GPU being the bottleneck. A multi-camera pipeline processing four streams with classification on each can run the detection model on GPU and the classification model on DLA, keeping both accelerators occupied.
-
Your model is a classic CNN architecture at a fixed input size. MobileNet, EfficientDet, YOLO (v5 through v8 with standard heads), ResNet variants — these are well within DLA’s supported operator set. If you’re running a model the team has been deploying for more than a year and it hasn’t changed architecture, it’s a good DLA candidate.
-
You need continuous inference with strict thermal limits. Edge devices that run 24/7 without active cooling need to stay within a tight thermal envelope. DLA inference generates significantly less heat than GPU inference, which can be the difference between reliable operation and thermal throttling in a sealed enclosure.
How to Profile DLA Compatibility for Your Model
The fastest way to evaluate DLA compatibility is with trtexec on the target hardware. Never build TensorRT engines on your development machine — they’re not portable across Jetson SKUs, and an engine built on an AGX Orin won’t load on an Orin NX.
On the Jetson device:
# Evaluate DLA compatibility with GPU fallback for unsupported layers
trtexec \
--onnx=model.onnx \
--useDLACore=0 \
--allowGPUFallback \
--int8 \
--calib=calibration.cache \
--saveEngine=model_dla.engine \
--buildOnly \
--verbose 2>&1 | grep -E "DLA|GPU|Layer"
The build log shows each layer and its assigned device. A layer assigned to DLA looks like:
[08/15/2025-14:22:31] [I] [TRT] DLA Compatibility Check: Convolution_0 -> DLA
[08/15/2025-14:22:31] [I] [TRT] DLA Compatibility Check: Pooling_1 -> DLA
[08/15/2025-14:22:31] [I] [TRT] DLA Compatibility Check: Attention_2 -> GPU (unsupported)
The decision rule is practical: if more than 70-80% of the compute graph (by FLOPs) is assigned to DLA, the power and throughput benefits are real. If the GPU fallback layer count is high — especially if the fallback layers sit in the middle of the graph, forcing frequent DLA-to-GPU transfers — the efficiency benefit largely evaporates and you’d be better off running fully on GPU.
Splitting Inference Across DLA and GPU
The most powerful Orin deployment pattern isn’t DLA-only or GPU-only — it’s splitting a multi-model pipeline across both accelerators.
A practical example from a multi-camera agricultural inspection deployment:
Camera 0-3 (GMSL2) → GStreamer pipeline
└─► Detection model (YOLO) → TensorRT GPU (latency-critical, 5ms/frame)
└─► Classification model (MNet) → TensorRT DLA (throughput, 20ms/frame)
└─► Anomaly scoring (ResNet) → TensorRT DLA (power-sensitive, 25ms/frame)
This configuration keeps the GPU dedicated to the latency-critical detection step while DLA handles the classification and anomaly models concurrently, freeing GPU cycles for GStreamer pipeline management and any post-processing work.
To run on DLA core 1 (second core, on AGX Orin or Orin NX):
import tensorrt as trt
config = builder.create_builder_config()
config.default_device_type = trt.DeviceType.DLA
config.DLA_core = 1 # 0 or 1 on AGX Orin / Orin NX
config.set_flag(trt.BuilderFlag.GPU_FALLBACK)
config.set_flag(trt.BuilderFlag.INT8)
The INT8 flag is mandatory for DLA on Orin. FP32 DLA is not supported; FP16 DLA is technically available but INT8 gives materially better throughput and power numbers.
The Decision Framework: Which to Use
Apply this in order:
-
Does your model have unsupported layers (attention, LSTM, custom CUDA, dynamic shapes)? If yes, use GPU inference. Check with
trtexec --useDLACore=0 --allowGPUFallback --buildOnlyand inspect the fallback count. -
Is single-stream latency under 10ms a hard requirement? If yes, use GPU. DLA latency on most vision models is 15-40ms depending on model size.
-
Are you running multiple models concurrently? If yes, assign the less latency-critical models to DLA and reserve the GPU for your most time-sensitive workload.
-
Is power or thermal budget the binding constraint? If yes, profile both GPU and DLA at production throughput, measure actual system draw, and choose based on measured numbers, not specs.
-
Is the model a frozen, fixed-shape CNN (YOLO, MobileNet, ResNet family)? If yes, DLA is almost certainly compatible. Run the compatibility check to confirm.
Frequently Asked Questions
What is the difference between TensorRT GPU inference and DLA inference on Jetson Orin?
TensorRT GPU inference runs on the Orin’s Ampere GPU cores using CUDA. It supports the full range of operators and layer types and gives the lowest single-stream latency. DLA is a fixed-function hardware block that runs a subset of common DNN layers at very low power. DLA frees up the GPU for other workloads, but it only supports a specific layer set and runs at roughly 2-4x higher latency per model than GPU inference.
Which Jetson Orin modules have DLA and how many cores?
All Jetson Orin modules include DLA. AGX Orin has 2 DLA cores. Orin NX 16GB and 8GB have 2 DLA cores. Orin Nano 8GB has 1 DLA core. Orin Nano 4GB has 1 DLA core. On AGX Orin with 2 DLA cores, you can run two models concurrently on DLA while the GPU handles a third.
What layer types are not supported on Jetson Orin DLA?
DLA does not support custom CUDA plugins, dynamic shapes (fixed batch size and input dimensions required), recurrent layers (LSTM, GRU), transformer attention layers (as of JetPack 6.2), most element-wise operations beyond simple add or multiply, and operations requiring precision beyond INT8 or FP16. If your model uses any of these, those layers will fall back to GPU when running in DLA mode.
How do I profile which layers in my model are compatible with DLA?
Use trtexec with the --useDLACore=0 and --allowGPUFallback flags: trtexec --onnx=model.onnx --useDLACore=0 --allowGPUFallback --buildOnly. The build output lists each layer and its assigned device. If more than 20-30% of your compute falls back to GPU, the DLA efficiency benefit is largely lost.
What is the power saving from DLA vs GPU inference on Jetson Orin?
DLA inference draws roughly 2-5W per DLA core for a typical vision model like YOLOv8n or MobileNetV3, compared to 10-25W for equivalent GPU inference. A well-quantized INT8 model on DLA running at 30fps might save 8-15W system-wide compared to GPU. This matters significantly for battery-powered devices or multi-camera deployments where thermal headroom is the constraint.
ProventusNova deploys production-grade inference pipelines on Jetson Orin — TensorRT optimization, DLA profiling, GStreamer integration, and full IP transfer. See our EdgeAI Model Deployment service or schedule a scoping call.
NVIDIA Jetson Expert Support
Stuck on a Jetson bring-up?
We've debugged this failure mode before. BSP, device tree, camera pipelines, OTA — most blockers clear in the first session. No long retainers. No guessing.
Frequently Asked Questions
What is the difference between TensorRT GPU inference and DLA inference on Jetson Orin?
TensorRT GPU inference runs on the Orin's Ampere GPU cores using CUDA. It supports the full range of operators and layer types and gives the lowest single-stream latency. DLA (Deep Learning Accelerator) is a fixed-function hardware block that runs a subset of common DNN layers at very low power. DLA frees up the GPU for other workloads, but it only supports a specific layer set and runs at roughly 2-4x higher latency per model than GPU inference.
Which Jetson Orin modules have DLA and how many DLA cores do they have?
All Jetson Orin modules include DLA. The AGX Orin has 2 DLA cores. Orin NX 16GB and 8GB have 2 DLA cores. Orin Nano 8GB has 1 DLA core. Orin Nano 4GB has 1 DLA core. Each DLA core can run an independent model or pipeline stage. On the AGX Orin with 2 DLA cores, you can run two models concurrently on DLA while the GPU handles a third.
What layer types are not supported on Jetson Orin DLA?
DLA does not support custom CUDA plugins, dynamic shapes (fixed batch size and input dimensions required), recurrent layers (LSTM, GRU), transformer attention layers (as of JetPack 6.2), most element-wise operations that are not simple add or multiply, and operations that require arbitrary precision beyond INT8 or FP16. If your model uses any of these, those specific layers will fall back to GPU when running in DLA mode.
How do I profile which layers in my model are compatible with DLA?
Use trtexec with the --useDLACore=0 and --allowGPUFallback flags: trtexec --onnx=model.onnx --useDLACore=0 --allowGPUFallback --buildOnly. The build output lists each layer and whether it was assigned to DLA or GPU. Layers listed as 'DLA' ran on the accelerator; layers listed as 'GPU' fell back. If more than 20-30% of your compute falls back to GPU, the DLA efficiency benefit is largely lost.
What is the power saving from running inference on DLA vs GPU on Jetson Orin?
DLA inference on Jetson Orin draws roughly 2-5W per DLA core for a typical vision model like YOLOv8n or MobileNetV3, compared to 10-25W for equivalent GPU inference. The actual saving depends heavily on the model -- a well-quantized INT8 model on DLA running at 30fps might save 8-15W system-wide compared to GPU. This matters significantly for battery-powered devices or multi-camera deployments where thermal headroom is the constraint.
Written by
Andrés CamposCo-Founder & CTO · ProventusNova
8 years deep in embedded systems — from underwater ROVs to edge AI. Andrés leads every technical delivery personally.
Connect on LinkedIn