Running LLMs on Jetson Orin — llama.cpp, Ollama, and jetson-containers

Q: What LLMs can run on Jetson Orin?

On AGX Orin 64GB: Llama 3.1 70B at Q4_K_M quantization, Qwen 2.5 32B at Q4, Mistral 7B at full precision (FP16). On Orin NX 16GB: Llama 3.1 8B at Q4-Q6, Qwen 2.5 7B, Phi-3 mini (3.8B). On Orin Nano 8GB: Phi-3 mini, Qwen 2.5 1.5B-3B. The rule of thumb is model_size * quantization_bits / 8 must fit in available RAM with ~1GB headroom for KV cache.

Q: What is the fastest way to run an LLM on Jetson Orin?

Use the jetson-containers project (github.com/dusty-nv/jetson-containers) which provides pre-built Docker containers with llama.cpp, Ollama, and MLC-LLM optimized for each JetPack/L4T version. This avoids building CUDA dependencies from source. Run: ./run.sh $(autotag ollama) to get a ready-to-use Ollama instance.

Q: How much RAM does an LLM use on Jetson Orin?

For llama.cpp on Jetson's unified memory: model_file_size ≈ RAM used + KV cache. A Q4_K_M Llama 3.1 8B model file is ~4.7GB, but with KV cache for 4096 context length uses ~6GB total. Unified memory means both CPU and GPU address the same physical LPDDR5, so there is no GPU VRAM limit separate from system RAM.

Q: Does llama.cpp use the GPU on Jetson Orin?

Yes, via CUDA. Build llama.cpp with GGML_CUDA=ON and set -ngl (number of GPU layers) to offload model layers to the GPU. On Jetson's unified memory architecture, GPU offload still uses the same physical RAM but executes layers using the CUDA cores, which is 2-4x faster than CPU-only inference for large models.

Jetson Orin’s unified memory architecture makes it unusually capable for edge LLM inference — there is no separate GPU VRAM limit, so models that fit in total system RAM can run fully GPU-accelerated. The AGX Orin 64GB can run a 70B model at Q4 quantization. The challenge is not whether a model fits, but setting up the correct build flags and quantization for the available RAM.

Key Insights

Unified memory eliminates the VRAM bottleneck — on Jetson, system RAM and GPU memory are the same physical pool; a 64GB AGX Orin can GPU-offload a 40GB model
Q4_K_M is the optimal quantization for most Jetson deployments — balances inference quality with memory footprint; Q8 is too large for most SKUs, Q2 degrades quality significantly
jetson-containers is the fastest setup path — pre-built containers for every JetPack/L4T version; no CUDA compilation required
Set JETSON_CLOCKS=1 before running inference — enables max CPU/GPU frequency; significantly improves tokens/sec
Context length multiplies KV cache memory — 4096 context with Llama 3.1 8B adds ~1.5GB KV cache; reduce to 2048 for tight memory budgets

Model selection by Jetson SKU

Jetson SKU	RAM	Recommended models	Max model size at Q4
AGX Orin 64GB	64GB	Llama 3.1 70B, Qwen 2.5 32B	~40GB
AGX Orin 32GB	32GB	Llama 3.1 8B FP16, Qwen 2.5 14B Q4	~18GB
Orin NX 16GB	16GB	Llama 3.1 8B Q4-Q6, Mistral 7B Q4	~8GB
Orin NX 8GB	8GB	Llama 3.2 3B, Phi-3 mini Q4	~4GB
Orin Nano 8GB	8GB	Phi-3 mini Q4, Qwen 2.5 1.5B	~4GB
Orin Nano 4GB	4GB	Phi-3 mini Q4 (tight), Qwen 2.5 1.5B	~2GB

Memory formula: required_RAM = model_file_GB + (context_length * num_layers * 2 * head_size / 1e9) + 1GB

Setup with jetson-containers (recommended)

# Clone the jetson-containers repo
git clone https://github.com/dusty-nv/jetson-containers
cd jetson-containers

# Install prerequisites
bash install.sh

# Run Ollama container (auto-selects correct image for your JetPack version)
./run.sh $(./autotag ollama)

# Inside the container, pull and run a model:
ollama pull llama3.1:8b
ollama run llama3.1:8b

The autotag script queries your L4T version and selects the matching container image. No CUDA compilation, no dependency management.

Building llama.cpp from source (CUDA-accelerated)

For more control over build flags:

# Prerequisites
apt install cmake build-essential libcurl4-openssl-dev

# Clone
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build with CUDA support
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="87"  \  # Orin = sm_87
  -DGGML_CUDA_F16=ON \
  -DGGML_CUDA_DMMV_X=64 \
  -DGGML_CUDA_MMV_Y=2
cmake --build build --config Release -j $(nproc)

CUDA architecture by Jetson module:

Orin (all SKUs): sm_87
AGX Xavier: sm_72
Orin Nano: sm_87

Running inference with llama.cpp

# Maximize clocks first
sudo jetson_clocks

# Download a Q4_K_M model
# Example: Llama 3.1 8B Instruct Q4_K_M from Hugging Face
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf

# Run inference — offload all layers to GPU
./build/bin/llama-cli \
  -m Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  -ngl 99 \            # offload all 32 layers to GPU
  -c 4096 \            # context length
  -n 200 \             # max tokens to generate
  --temp 0.7 \
  -p "Explain MIPI CSI-2 in one paragraph:"

# Expected performance on Orin NX 16GB:
# Prompt eval: ~15 tokens/sec
# Generation: ~18 tokens/sec

Ollama setup (without jetson-containers)

# Install Ollama for aarch64
curl -fsSL https://ollama.com/install.sh | sh

# Start server
ollama serve &

# Pull a model
ollama pull qwen2.5:7b

# Run
ollama run qwen2.5:7b

# API endpoint
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "What is PREEMPT_RT?",
  "stream": false
}'

Performance benchmarks

Benchmarked on JetPack 6.2, jetson_clocks enabled, context=2048:

Model	Jetson SKU	Quantization	Prompt (tok/s)	Gen (tok/s)
Llama 3.1 8B	AGX Orin 64GB	Q4_K_M	48	52
Llama 3.1 8B	Orin NX 16GB	Q4_K_M	15	18
Llama 3.1 8B	Orin NX 8GB	Q4_K_M	10	11
Phi-3 mini 3.8B	Orin Nano 8GB	Q4_K_M	22	28
Qwen 2.5 7B	Orin NX 16GB	Q4_K_M	18	20

Memory management tips

# Check available memory before loading model
free -h
tegrastats | grep RAM

# Release GPU memory between runs (kills all CUDA contexts)
sudo fuser -k /dev/nvidia*

# For concurrent LLM + CV pipeline:
# Reserve memory by setting llama.cpp max context shorter
# -c 1024 instead of 4096 saves ~750MB

For GPU-accelerated TensorRT inference on Jetson (for vision models), see TensorRT vs DLA on Jetson Orin. For running containerized workloads including AI models, see Docker containers on Jetson Orin.

FAQ

What LLMs can run on Jetson Orin?

AGX Orin 64GB: Llama 3.1 70B at Q4. Orin NX 16GB: Llama 3.1 8B at Q4-Q6. Orin Nano 8GB: Phi-3 mini, Qwen 2.5 3B. The rule: model_file_size + ~1.5GB KV cache must fit in available RAM.

What is the fastest way to run an LLM on Jetson Orin?

Use jetson-containers — pre-built Docker images with llama.cpp and Ollama for each JetPack version. Run ./run.sh $(./autotag ollama) to get a running Ollama instance.

How much RAM does an LLM use on Jetson Orin?

Approximately model file size plus KV cache. A Q4_K_M Llama 3.1 8B model uses ~6GB total with 4096-token context.

Does llama.cpp use the GPU on Jetson Orin?

Yes. Build with GGML_CUDA=ON and set -ngl 99 to offload all layers to the GPU using CUDA. On Jetson’s unified memory, GPU offload uses the same physical RAM but is 2–4x faster than CPU-only inference.