Running LLMs on Jetson Orin — llama.cpp, Ollama, and jetson-containers
Jetson Orin’s unified memory architecture makes it unusually capable for edge LLM inference — there is no separate GPU VRAM limit, so models that fit in total system RAM can run fully GPU-accelerated. The AGX Orin 64GB can run a 70B model at Q4 quantization. The challenge is not whether a model fits, but setting up the correct build flags and quantization for the available RAM.
Key Insights
- Unified memory eliminates the VRAM bottleneck — on Jetson, system RAM and GPU memory are the same physical pool; a 64GB AGX Orin can GPU-offload a 40GB model
- Q4_K_M is the optimal quantization for most Jetson deployments — balances inference quality with memory footprint; Q8 is too large for most SKUs, Q2 degrades quality significantly
jetson-containersis the fastest setup path — pre-built containers for every JetPack/L4T version; no CUDA compilation required- Set
JETSON_CLOCKS=1before running inference — enables max CPU/GPU frequency; significantly improves tokens/sec - Context length multiplies KV cache memory — 4096 context with Llama 3.1 8B adds ~1.5GB KV cache; reduce to 2048 for tight memory budgets
Model selection by Jetson SKU
| Jetson SKU | RAM | Recommended models | Max model size at Q4 |
|---|---|---|---|
| AGX Orin 64GB | 64GB | Llama 3.1 70B, Qwen 2.5 32B | ~40GB |
| AGX Orin 32GB | 32GB | Llama 3.1 8B FP16, Qwen 2.5 14B Q4 | ~18GB |
| Orin NX 16GB | 16GB | Llama 3.1 8B Q4-Q6, Mistral 7B Q4 | ~8GB |
| Orin NX 8GB | 8GB | Llama 3.2 3B, Phi-3 mini Q4 | ~4GB |
| Orin Nano 8GB | 8GB | Phi-3 mini Q4, Qwen 2.5 1.5B | ~4GB |
| Orin Nano 4GB | 4GB | Phi-3 mini Q4 (tight), Qwen 2.5 1.5B | ~2GB |
Memory formula: required_RAM = model_file_GB + (context_length * num_layers * 2 * head_size / 1e9) + 1GB
Setup with jetson-containers (recommended)
# Clone the jetson-containers repo
git clone https://github.com/dusty-nv/jetson-containers
cd jetson-containers
# Install prerequisites
bash install.sh
# Run Ollama container (auto-selects correct image for your JetPack version)
./run.sh $(./autotag ollama)
# Inside the container, pull and run a model:
ollama pull llama3.1:8b
ollama run llama3.1:8b
The autotag script queries your L4T version and selects the matching container image. No CUDA compilation, no dependency management.
Building llama.cpp from source (CUDA-accelerated)
For more control over build flags:
# Prerequisites
apt install cmake build-essential libcurl4-openssl-dev
# Clone
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build with CUDA support
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="87" \ # Orin = sm_87
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_DMMV_X=64 \
-DGGML_CUDA_MMV_Y=2
cmake --build build --config Release -j $(nproc)
CUDA architecture by Jetson module:
- Orin (all SKUs):
sm_87 - AGX Xavier:
sm_72 - Orin Nano:
sm_87
Running inference with llama.cpp
# Maximize clocks first
sudo jetson_clocks
# Download a Q4_K_M model
# Example: Llama 3.1 8B Instruct Q4_K_M from Hugging Face
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Run inference — offload all layers to GPU
./build/bin/llama-cli \
-m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-ngl 99 \ # offload all 32 layers to GPU
-c 4096 \ # context length
-n 200 \ # max tokens to generate
--temp 0.7 \
-p "Explain MIPI CSI-2 in one paragraph:"
# Expected performance on Orin NX 16GB:
# Prompt eval: ~15 tokens/sec
# Generation: ~18 tokens/sec
Ollama setup (without jetson-containers)
# Install Ollama for aarch64
curl -fsSL https://ollama.com/install.sh | sh
# Start server
ollama serve &
# Pull a model
ollama pull qwen2.5:7b
# Run
ollama run qwen2.5:7b
# API endpoint
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5:7b",
"prompt": "What is PREEMPT_RT?",
"stream": false
}'
Performance benchmarks
Benchmarked on JetPack 6.2, jetson_clocks enabled, context=2048:
| Model | Jetson SKU | Quantization | Prompt (tok/s) | Gen (tok/s) |
|---|---|---|---|---|
| Llama 3.1 8B | AGX Orin 64GB | Q4_K_M | 48 | 52 |
| Llama 3.1 8B | Orin NX 16GB | Q4_K_M | 15 | 18 |
| Llama 3.1 8B | Orin NX 8GB | Q4_K_M | 10 | 11 |
| Phi-3 mini 3.8B | Orin Nano 8GB | Q4_K_M | 22 | 28 |
| Qwen 2.5 7B | Orin NX 16GB | Q4_K_M | 18 | 20 |
Memory management tips
# Check available memory before loading model
free -h
tegrastats | grep RAM
# Release GPU memory between runs (kills all CUDA contexts)
sudo fuser -k /dev/nvidia*
# For concurrent LLM + CV pipeline:
# Reserve memory by setting llama.cpp max context shorter
# -c 1024 instead of 4096 saves ~750MB
For GPU-accelerated TensorRT inference on Jetson (for vision models), see TensorRT vs DLA on Jetson Orin. For running containerized workloads including AI models, see Docker containers on Jetson Orin.
FAQ
What LLMs can run on Jetson Orin?
AGX Orin 64GB: Llama 3.1 70B at Q4. Orin NX 16GB: Llama 3.1 8B at Q4-Q6. Orin Nano 8GB: Phi-3 mini, Qwen 2.5 3B. The rule: model_file_size + ~1.5GB KV cache must fit in available RAM.
What is the fastest way to run an LLM on Jetson Orin?
Use jetson-containers — pre-built Docker images with llama.cpp and Ollama for each JetPack version. Run ./run.sh $(./autotag ollama) to get a running Ollama instance.
How much RAM does an LLM use on Jetson Orin?
Approximately model file size plus KV cache. A Q4_K_M Llama 3.1 8B model uses ~6GB total with 4096-token context.
Does llama.cpp use the GPU on Jetson Orin?
Yes. Build with GGML_CUDA=ON and set -ngl 99 to offload all layers to the GPU using CUDA. On Jetson’s unified memory, GPU offload uses the same physical RAM but is 2–4x faster than CPU-only inference.
Relevant Services
NVIDIA Jetson Expert Support
Stuck on a Jetson bring-up?
We've debugged this failure mode before. BSP, device tree, camera pipelines, OTA, most blockers clear in the first session. No long retainers. No guessing.
Frequently Asked Questions
What LLMs can run on Jetson Orin?
On AGX Orin 64GB: Llama 3.1 70B at Q4_K_M quantization, Qwen 2.5 32B at Q4, Mistral 7B at full precision (FP16). On Orin NX 16GB: Llama 3.1 8B at Q4-Q6, Qwen 2.5 7B, Phi-3 mini (3.8B). On Orin Nano 8GB: Phi-3 mini, Qwen 2.5 1.5B-3B. The rule of thumb is model_size * quantization_bits / 8 must fit in available RAM with ~1GB headroom for KV cache.
What is the fastest way to run an LLM on Jetson Orin?
Use the jetson-containers project (github.com/dusty-nv/jetson-containers) which provides pre-built Docker containers with llama.cpp, Ollama, and MLC-LLM optimized for each JetPack/L4T version. This avoids building CUDA dependencies from source. Run: ./run.sh $(autotag ollama) to get a ready-to-use Ollama instance.
How much RAM does an LLM use on Jetson Orin?
For llama.cpp on Jetson's unified memory: model_file_size ≈ RAM used + KV cache. A Q4_K_M Llama 3.1 8B model file is ~4.7GB, but with KV cache for 4096 context length uses ~6GB total. Unified memory means both CPU and GPU address the same physical LPDDR5, so there is no GPU VRAM limit separate from system RAM.
Does llama.cpp use the GPU on Jetson Orin?
Yes, via CUDA. Build llama.cpp with GGML_CUDA=ON and set -ngl (number of GPU layers) to offload model layers to the GPU. On Jetson's unified memory architecture, GPU offload still uses the same physical RAM but executes layers using the CUDA cores, which is 2-4x faster than CPU-only inference for large models.
Written by
Aarón AnguloCo-Founder & CEO · ProventusNova
Obsessed with client outcomes. Aarón ensures every engagement delivers real results, on time, on scope, no exceptions.
Connect on LinkedInRelated Articles
Jetson Orin silent GPU hang under sustained compute — host1x debug
Debug silent GPU hangs on Jetson Orin AGX under sustained compute — host1x interrupt stalls, tegrastats, and known JP 6.x mitigation steps.
GMSL YUV422 capture and FORCE_FE errors on Jetson Orin — debug guide
Debug GMSL YUV422 capture issues on Jetson Orin — FORCE_FE decoder config, partial frame faults, and MAX9295/MAX9296 YUV format setup.
Jetson camera works with v4l2-ctl but fails to launch argus_camera — debug guide
Why your Jetson camera works with v4l2-ctl but argus_camera fails — tegra-camera DT node issues, sensor mode tables, and the V4L2-to-Argus fault path.
nvcompositor vs parallel GStreamer pipelines on Jetson Orin — when each is slower
When to use nvcompositor vs parallel GStreamer pipelines on Jetson Orin, why compositor is slower, and how to choose the right path for your workload.