Abstract GStreamer pipeline visualization showing hardware-accelerated video processing stages on Jetson
jetsongstreamerperformancenvvidconvnvmmpipeline optimization

GStreamer pipeline performance on Jetson: how to find the bottleneck

Andres Campos ·

GStreamer pipeline performance on Jetson depends heavily on how well the pipeline uses NVIDIA hardware-accelerated elements. A pipeline with the wrong element choices can run at the same throughput on Jetson as on a generic ARM board — losing all the GPU acceleration. This post covers how to build pipelines that actually use the hardware.

How to reduce end-to-end latency in a Jetson GStreamer pipeline

The dominant source of latency in a Jetson GStreamer pipeline is almost always a memory copy between system RAM and NVMM — not compute. Every time a CPU-based element like videoconvert or appsink receives an NVMM buffer, GStreamer copies the frame to system memory before passing it. On a 1080p30 stream that copy costs ~5-15ms per frame and saturates memory bandwidth.

The five most effective latency reductions, in order of impact:

  1. Replace videoconvert with nvvidconv everywhere — VIC hardware handles format conversion in NVMM without CPU involvement
  2. Add sync=false on all sinks (e.g., nv3dsink sync=false, appsink sync=false) — disabling clock synchronization eliminates the presentation timestamp delay
  3. Use queue leaky=2 max-size-buffers=1 before slow elements — drop old frames instead of buffering them, keeping end-to-end delay at one frame
  4. Replace avdec_h264 with nvv4l2decoder for decode — hardware decode adds ~2ms vs ~20ms for software decode at 1080p
  5. Process frames in a GLib idle callback or separate thread from appsink — a slow Python callback in on_new_sample blocks the entire pipeline

A camera-to-display pipeline with all five applied typically achieves under 50ms end-to-end latency on Jetson Orin at 1080p30.

Key Insights

  • The main bottleneck on Jetson GStreamer pipelines is almost always memory copies between system RAM and NVMM, not compute — keep data in NVMM from camera to output
  • Replace videoconvert with nvvidconv everywhere in a Jetson pipeline; the VIC hardware handles format conversion without touching the CPU
  • Use nvv4l2decoder instead of avdec_h264 for decode — hardware decode on Jetson handles 4K at a fraction of the CPU cost of software decode
  • appsink with a slow Python callback is the most common hidden bottleneck — the pipeline backs up waiting for your callback to return
  • GST_DEBUG_DUMP_DOT_DIR lets you visualize the full pipeline graph and spot where formats are negotiated incorrectly

The unified memory architecture, and why it matters for pipelines

Jetson uses unified memory — the CPU and GPU share the same physical DRAM. This is different from a discrete GPU setup where CPU memory and GPU VRAM are separate. On Jetson, zero-copy between CPU and GPU is theoretically possible, but only when both sides use the right memory type.

NVIDIA’s GStreamer plugins use NVMM buffers: a memory-mapped buffer that the GPU hardware engines (VIC, NVDEC, NVENC, DLA) can access without a DMA copy. The problem comes when you insert a CPU-based element into a pipeline that was otherwise entirely NVMM.

nvarguscamerasrc → nvvidconv → nvinfer → nvvidconv → nveglglessink
      NVMM          NVMM        NVMM       NVMM          NVMM

This is a zero-copy pipeline. Everything stays in NVMM.

nvarguscamerasrc → videoconvert → appsink
      NVMM           ← COPY →    system RAM

This copies every frame from NVMM to system RAM at videoconvert. On a 4K 30fps stream, that’s roughly 720MB/s of unnecessary memory bandwidth.

Identify the element causing the bottleneck

The fastest diagnostic is the GST pipeline graph. Set this before running your pipeline:

export GST_DEBUG_DUMP_DOT_DIR=/tmp

Then run your pipeline. After it starts (or crashes), .dot files appear in /tmp. Convert to PNG:

sudo apt install graphviz
dot -Tpng /tmp/pipeline*.dot -o /tmp/pipeline.png

Open the PNG. Look for:

  • Elements showing video/x-raw (no memory:NVMM) — these are CPU-path elements causing copies
  • Caps negotiation failures — elements trying to send NVMM to a CPU-only sink

For runtime profiling, use GST_DEBUG with specific elements:

GST_DEBUG=nvvidconv:4,nvinfer:4,appsink:4 gst-launch-1.0 \
  nvarguscamerasrc ! nvvidconv ! nvinfer config-file-path=det.cfg ! \
  nvvidconv ! nvdrmvideosink

The logs will show per-element timing and buffer flow.

The 4 patterns that kill throughput

Pattern 1: videoconvert in a hardware pipeline

The symptom is high CPU usage (70%+) on a pipeline that should be GPU-accelerated. The cause is one videoconvert element in the middle, forcing everything before and after it through system RAM.

Replace every videoconvert with nvvidconv:

# Slow (CPU path)
... ! videoconvert ! video/x-raw,format=BGR ! appsink

# Fast (hardware path)
... ! nvvidconv ! video/x-raw(memory:NVMM),format=NV12 ! nvvidconv ! \
    video/x-raw,format=BGRx ! appsink

The second nvvidconv handles the final conversion out of NVMM before the CPU-side sink.

Pattern 2: Software decode

avdec_h264 is a CPU-based H.264 decoder. On Jetson, use nvv4l2decoder instead:

# Slow
... ! h264parse ! avdec_h264 ! videoconvert ! autovideosink

# Fast
... ! h264parse ! nvv4l2decoder ! nvvidconv ! nvdrmvideosink

nvv4l2decoder uses the NVDEC hardware engine. On Jetson Orin, this handles 4K H.264 at roughly 1% CPU vs 40%+ for the software decoder.

Pattern 3: Slow appsink callback

appsink is how you pull frames into Python or C++ code. If your callback takes longer to process a frame than the pipeline produces them, the queue fills up and GStreamer stalls.

The default queue behavior is to block. Your pipeline backs up, your camera drops frames, and latency climbs. Fix it by setting a maximum buffer count and drop policy:

appsink = pipeline.get_by_name("appsink0")
appsink.set_property("max-buffers", 1)
appsink.set_property("drop", True)
appsink.set_property("emit-signals", True)

drop=True means the sink drops old frames rather than blocking the pipeline. You lose frames but maintain real-time processing. If you can’t afford to drop frames, you need to speed up the callback — move heavy work off the GStreamer thread with a separate processing queue.

Pattern 4: Missing queue elements

GStreamer runs elements in the same thread by default unless you insert queue elements to split into separate threads. A pipeline with a slow sink element (like a network sink) will block the camera source thread without a queue between them:

nvarguscamerasrc ! nvvidconv ! queue max-size-buffers=4 ! \
    nvv4l2h264enc ! rtph264pay ! udpsink host=192.168.1.100 port=5000

The queue element decouples the camera thread from the encoder/network thread. Without it, network jitter stalls the camera.

A profiling pipeline for benchmarking

This pipeline measures pure throughput from camera to nowhere — useful for finding your theoretical maximum before you add processing:

gst-launch-1.0 -v \
  nvarguscamerasrc num-buffers=300 ! \
  'video/x-raw(memory:NVMM),width=1920,height=1080,framerate=30/1' ! \
  nvvidconv ! \
  'video/x-raw(memory:NVMM),format=NV12' ! \
  fakesink sync=false

fakesink drops all frames immediately with no rendering overhead. If this pipeline runs at full framerate, your hardware can handle the resolution. If it drops frames here, the bottleneck is upstream of processing — check camera driver configuration and MIPI bandwidth.

For network streaming pipelines, see the GStreamer development service page. If you’re comparing building this yourself against using an external specialist like RidgeRun, the RidgeRun vs ProventusNova comparison has a direct breakdown.


NVIDIA Jetson Expert Support

Stuck on a Jetson bring-up?

We've debugged this failure mode before. BSP, device tree, camera pipelines, OTA — most blockers clear in the first session. No long retainers. No guessing.

Frequently Asked Questions

Why is my GStreamer pipeline slow on Jetson even though it runs fine on a desktop?

Desktop CPUs have more cores and higher memory bandwidth. On Jetson, the CPU and GPU share memory (unified memory architecture), but GStreamer pipelines that mix CPU-based elements with GPU-accelerated elements cause data copies between system memory and NVMM. Each memory copy is expensive. The fix is to use hardware-accelerated elements (nvvidconv, nvv4l2decoder, nvinfer) throughout the pipeline and keep data in NVMM memory as long as possible.

What is NVMM memory in GStreamer on Jetson?

NVMM (NVIDIA Memory Manager) is a zero-copy memory buffer that lives in the GPU's address space on Jetson. Elements like nvvidconv, nvinfer, and nvarguscamerasrc produce and consume NVMM buffers without copying to system RAM. When a CPU-based element like videoconvert receives an NVMM buffer, it has to copy it to system memory first — that copy is usually where latency comes from on Jetson pipelines.

What is the difference between nvvidconv and videoconvert in GStreamer?

videoconvert is a CPU-based color space conversion element. nvvidconv is NVIDIA's hardware-accelerated equivalent that runs on the Jetson VIC (Vision Image Compositor) hardware engine and operates on NVMM buffers. On a Jetson pipeline doing 4K or multi-stream work, replacing videoconvert with nvvidconv can cut CPU usage by 60-80% and reduce latency by half.

How do I enable GStreamer debug logs to diagnose pipeline issues?

Set GST_DEBUG=3 for general debug info: GST_DEBUG=3 gst-launch-1.0 .... For element-specific logs: GST_DEBUG=nvvidconv:5,nvinfer:4. To generate a pipeline graph: export GST_DEBUG_DUMP_DOT_DIR=/tmp, run the pipeline, then convert the .dot file with 'dot -Tpng /tmp/*.dot -o pipeline.png'. The graph shows all elements, caps, and buffer flows.

What is the maximum number of camera streams Jetson Orin can handle in GStreamer?

It depends on resolution, frame rate, and codec. Jetson AGX Orin can typically handle 8-16 1080p30 streams simultaneously using hardware decode (nvv4l2decoder) and nvstreammux for batching. Using software decode (avdec_h264) instead drops that to 2-4 streams before the CPU saturates. The VIC and NVENC/NVDEC engines are the real limit, not the CPU.

Andrés Campos, Co-Founder & CTO at ProventusNova

Written by

Andrés Campos

Co-Founder & CTO · ProventusNova

8 years deep in embedded systems — from underwater ROVs to edge AI. Andrés leads every technical delivery personally.

Connect on LinkedIn