Jetson Orin AGX running Llama 3 inference with token output on terminal display
jetsonorinllmllama.cppollamaaiinferenceembedded linuxcuda

Running LLMs on Jetson Orin — llama.cpp, Ollama, and jetson-containers

Aaron Angulo ·

Jetson Orin’s unified memory architecture makes it unusually capable for edge LLM inference — there is no separate GPU VRAM limit, so models that fit in total system RAM can run fully GPU-accelerated. The AGX Orin 64GB can run a 70B model at Q4 quantization. The challenge is not whether a model fits, but setting up the correct build flags and quantization for the available RAM.

Key Insights

  • Unified memory eliminates the VRAM bottleneck — on Jetson, system RAM and GPU memory are the same physical pool; a 64GB AGX Orin can GPU-offload a 40GB model
  • Q4_K_M is the optimal quantization for most Jetson deployments — balances inference quality with memory footprint; Q8 is too large for most SKUs, Q2 degrades quality significantly
  • jetson-containers is the fastest setup path — pre-built containers for every JetPack/L4T version; no CUDA compilation required
  • Set JETSON_CLOCKS=1 before running inference — enables max CPU/GPU frequency; significantly improves tokens/sec
  • Context length multiplies KV cache memory — 4096 context with Llama 3.1 8B adds ~1.5GB KV cache; reduce to 2048 for tight memory budgets

Model selection by Jetson SKU

Jetson SKURAMRecommended modelsMax model size at Q4
AGX Orin 64GB64GBLlama 3.1 70B, Qwen 2.5 32B~40GB
AGX Orin 32GB32GBLlama 3.1 8B FP16, Qwen 2.5 14B Q4~18GB
Orin NX 16GB16GBLlama 3.1 8B Q4-Q6, Mistral 7B Q4~8GB
Orin NX 8GB8GBLlama 3.2 3B, Phi-3 mini Q4~4GB
Orin Nano 8GB8GBPhi-3 mini Q4, Qwen 2.5 1.5B~4GB
Orin Nano 4GB4GBPhi-3 mini Q4 (tight), Qwen 2.5 1.5B~2GB

Memory formula: required_RAM = model_file_GB + (context_length * num_layers * 2 * head_size / 1e9) + 1GB

# Clone the jetson-containers repo
git clone https://github.com/dusty-nv/jetson-containers
cd jetson-containers

# Install prerequisites
bash install.sh

# Run Ollama container (auto-selects correct image for your JetPack version)
./run.sh $(./autotag ollama)

# Inside the container, pull and run a model:
ollama pull llama3.1:8b
ollama run llama3.1:8b

The autotag script queries your L4T version and selects the matching container image. No CUDA compilation, no dependency management.

Building llama.cpp from source (CUDA-accelerated)

For more control over build flags:

# Prerequisites
apt install cmake build-essential libcurl4-openssl-dev

# Clone
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="87"  \  # Orin = sm_87
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_DMMV_X=64 \
  -DGGML_CUDA_MMV_Y=2
cmake --build build --config Release -j $(nproc)

CUDA architecture by Jetson module:

  • Orin (all SKUs): sm_87
  • AGX Xavier: sm_72
  • Orin Nano: sm_87

Running inference with llama.cpp

# Maximize clocks first
sudo jetson_clocks

# Download a Q4_K_M model
# Example: Llama 3.1 8B Instruct Q4_K_M from Hugging Face
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run inference — offload all layers to GPU
./build/bin/llama-cli \
  -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 99 \            # offload all 32 layers to GPU
  -c 4096 \            # context length
  -n 200 \             # max tokens to generate
  --temp 0.7 \
  -p "Explain MIPI CSI-2 in one paragraph:"

# Expected performance on Orin NX 16GB:
# Prompt eval: ~15 tokens/sec
# Generation: ~18 tokens/sec

Ollama setup (without jetson-containers)

# Install Ollama for aarch64
curl -fsSL https://ollama.com/install.sh | sh

# Start server
ollama serve &

# Pull a model
ollama pull qwen2.5:7b

# Run
ollama run qwen2.5:7b

# API endpoint
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "What is PREEMPT_RT?",
  "stream": false
}'

Performance benchmarks

Benchmarked on JetPack 6.2, jetson_clocks enabled, context=2048:

ModelJetson SKUQuantizationPrompt (tok/s)Gen (tok/s)
Llama 3.1 8BAGX Orin 64GBQ4_K_M4852
Llama 3.1 8BOrin NX 16GBQ4_K_M1518
Llama 3.1 8BOrin NX 8GBQ4_K_M1011
Phi-3 mini 3.8BOrin Nano 8GBQ4_K_M2228
Qwen 2.5 7BOrin NX 16GBQ4_K_M1820

Memory management tips

# Check available memory before loading model
free -h
tegrastats | grep RAM

# Release GPU memory between runs (kills all CUDA contexts)
sudo fuser -k /dev/nvidia*

# For concurrent LLM + CV pipeline:
# Reserve memory by setting llama.cpp max context shorter
# -c 1024 instead of 4096 saves ~750MB

For GPU-accelerated TensorRT inference on Jetson (for vision models), see TensorRT vs DLA on Jetson Orin. For running containerized workloads including AI models, see Docker containers on Jetson Orin.

FAQ

What LLMs can run on Jetson Orin?

AGX Orin 64GB: Llama 3.1 70B at Q4. Orin NX 16GB: Llama 3.1 8B at Q4-Q6. Orin Nano 8GB: Phi-3 mini, Qwen 2.5 3B. The rule: model_file_size + ~1.5GB KV cache must fit in available RAM.

What is the fastest way to run an LLM on Jetson Orin?

Use jetson-containers — pre-built Docker images with llama.cpp and Ollama for each JetPack version. Run ./run.sh $(./autotag ollama) to get a running Ollama instance.

How much RAM does an LLM use on Jetson Orin?

Approximately model file size plus KV cache. A Q4_K_M Llama 3.1 8B model uses ~6GB total with 4096-token context.

Does llama.cpp use the GPU on Jetson Orin?

Yes. Build with GGML_CUDA=ON and set -ngl 99 to offload all layers to the GPU using CUDA. On Jetson’s unified memory, GPU offload uses the same physical RAM but is 2–4x faster than CPU-only inference.


NVIDIA Jetson Expert Support

Stuck on a Jetson bring-up?

We've debugged this failure mode before. BSP, device tree, camera pipelines, OTA, most blockers clear in the first session. No long retainers. No guessing.

Frequently Asked Questions

What LLMs can run on Jetson Orin?

On AGX Orin 64GB: Llama 3.1 70B at Q4_K_M quantization, Qwen 2.5 32B at Q4, Mistral 7B at full precision (FP16). On Orin NX 16GB: Llama 3.1 8B at Q4-Q6, Qwen 2.5 7B, Phi-3 mini (3.8B). On Orin Nano 8GB: Phi-3 mini, Qwen 2.5 1.5B-3B. The rule of thumb is model_size * quantization_bits / 8 must fit in available RAM with ~1GB headroom for KV cache.

What is the fastest way to run an LLM on Jetson Orin?

Use the jetson-containers project (github.com/dusty-nv/jetson-containers) which provides pre-built Docker containers with llama.cpp, Ollama, and MLC-LLM optimized for each JetPack/L4T version. This avoids building CUDA dependencies from source. Run: ./run.sh $(autotag ollama) to get a ready-to-use Ollama instance.

How much RAM does an LLM use on Jetson Orin?

For llama.cpp on Jetson's unified memory: model_file_size ≈ RAM used + KV cache. A Q4_K_M Llama 3.1 8B model file is ~4.7GB, but with KV cache for 4096 context length uses ~6GB total. Unified memory means both CPU and GPU address the same physical LPDDR5, so there is no GPU VRAM limit separate from system RAM.

Does llama.cpp use the GPU on Jetson Orin?

Yes, via CUDA. Build llama.cpp with GGML_CUDA=ON and set -ngl (number of GPU layers) to offload model layers to the GPU. On Jetson's unified memory architecture, GPU offload still uses the same physical RAM but executes layers using the CUDA cores, which is 2-4x faster than CPU-only inference for large models.

Aarón Angulo, Co-Founder & CEO at ProventusNova

Written by

Aarón Angulo

Co-Founder & CEO · ProventusNova

Obsessed with client outcomes. Aarón ensures every engagement delivers real results, on time, on scope, no exceptions.

Connect on LinkedIn