ONNX Runtime on the MediaTek Genio NPU (520 and 720)
If you are deploying an ONNX Runtime model to a MediaTek Genio and you need it accelerated on the NPU rather than the CPU, the platform choice is narrower than the spec sheets suggest: only the Genio 520 and Genio 720 support the ONNX Runtime NeuronExecutionProvider out of the box. This post covers which chips can do it, how to enable it, the one flag that trips up every first deployment, and the offline compilation path for production.
Which Genio chips support ONNX Runtime on the NPU?
MediaTek’s AI stack on Genio Yocto runs two open runtimes, TFLite (LiteRT) and ONNX Runtime, both accelerated on the APUSys MDLA (Multi-Dimension Learning Accelerator) through the NeuronSDK. The catch is that NPU coverage is not uniform across the family. Here is the actual support matrix:
| Platform | TFLite CPU/GPU | TFLite Neuron Delegate (NPU) | ONNX CPU | ONNX NeuronEP (NPU) |
|---|---|---|---|---|
| Genio 350 | Yes | No | Yes | No |
| Genio 510 / 700 | Yes | Yes | Yes | Opt-in¹ |
| Genio 1200 | Yes | Yes | Yes | Opt-in² |
| Genio 520 / 720 | Yes | Yes | Yes | Yes (default) |
¹ Genio 510/700: requires ENABLE_NEURON_EP = "1" in local.conf, then rebuild the image.
² Genio 1200: same opt-in, but not all ONNX operators are supported on the NPU, which can limit performance.
The bottom line: if ONNX Runtime NPU acceleration is a hard requirement, the Genio 520 and 720 are the only platforms that give it to you by default. Everything else is either an opt-in rebuild or, on the Genio 350, not available at all. This is the kind of constraint that does not appear when you compare TOPS numbers, but it determines whether your existing pipeline ports without a re-architecture. For the broader platform tradeoffs, see MediaTek Genio vs NVIDIA Jetson Orin.
How do you set up the NeuronExecutionProvider?
On the Genio 520 and 720 the provider is already in the image. You select it in the provider list when you create the inference session, highest priority first, with CPU fallbacks behind it:
import onnxruntime as ort
neuron_opts = {
"NEURON_FLAG_USE_FP16": "1", # MANDATORY — the NPU does not support FP32
"NEURON_FLAG_MIN_GROUP_SIZE": "1", # minimum nodes per NPU subgraph
}
providers = [
("NeuronExecutionProvider", neuron_opts),
"XnnpackExecutionProvider",
"CPUExecutionProvider",
]
session = ort.InferenceSession("model.onnx", providers=providers)
The C++ API mirrors this through SessionOptionsAppendExecutionProvider with the "NeuronExecutionProvider" name and the same key/value option pairs.
The one flag that breaks the first deployment
NEURON_FLAG_USE_FP16 = "1" is mandatory. The MDLA NPU does not execute FP32, so without the flag an FP32 model will not run on the NPU. Set it, and FP32 layers execute as FP16. The two model formats the NPU accepts are FP32-executed-as-FP16 (with the flag) and QDQ INT8. This single line is the most common reason a model “works” but quietly runs on the CPU on a first bring-up.
Enabling the NPU path on Genio 510, 700, and 1200
These platforms have the hardware but ship with the provider off. Turn it on at build time:
# local.conf
ENABLE_NEURON_EP = "1"
Then rebuild the image (rity-demo-image). On the Genio 1200, expect some operators to fall back, since not all ONNX ops are mapped to its NPU path.
Online or offline: which inference path should you use?
There are two ways to get a model onto the NPU, and the right one depends on whether the model is fixed:
- Online (runtime delegation). ONNX Runtime via the NeuronExecutionProvider, or TFLite via the Neuron Stable Delegate, compiles NPU subgraphs as the application loads the model. Simplest to integrate; pays a compile cost at load time.
- Offline (ahead-of-time compilation). Compile the model on a host PC into a hardware-specific
.dlabinary, then load it on the device. Removes the runtime compile step and is the production path for a model that does not change.
For the offline path, the host-side compiler is ncc-tflite:
# SDK from neuropilot.mediatek.com (All-In-One Bundle)
export LD_LIBRARY_PATH=/path/to/neuropilot-sdk-basic-<version>/neuron_sdk/host/lib
# INT8 quantized model
./ncc-tflite --arch=mdla3.0 model_int8.tflite -o model_int8.dla
# FP32 model, executed as FP16 on the NPU
./ncc-tflite --arch=mdla3.0 --relax-fp32 model.tflite -o model_fp32.dla
All of the NPU-capable Genio parts (510, 700, 520, 720, 1200) target --arch=mdla3.0. The --relax-fp32 flag is the offline equivalent of the runtime FP16 flag.
How fast is the Genio NPU in practice?
On the device, neuronrt runs a compiled .dla and reports timing. A YOLOv5s INT8 detector gives a useful reference point:
# YOLOv5s input: 640x640x3 = 1,228,800 bytes
neuronrt -m hw -a model_int8.dla -i input.bin -c 10
# Total inference time = 52.648 ms (5.2648 ms/inf), Avg. FPS: 186.1
About 5.26 ms per inference, roughly 186 FPS for the model compute alone. Treat that as a ceiling: once you add camera capture, pre-processing, and non-maximum suppression in a real pipeline, end-to-end throughput drops. But it shows the MDLA has real headroom for a single detector at video frame rates. Note that neuronrt is for verification and benchmarking; production applications should drive inference through the Neuron Runtime API for control over tensors and scheduling.
Putting inference in a GStreamer pipeline
For camera-to-inference products, NNStreamer integrates the NPU into a GStreamer pipeline through the tensor_filter element, so capture, inference, and downstream video all run in one pipeline:
gst-launch-1.0 v4l2src ! videoconvert ! \
video/x-raw,format=RGB,width=640,height=640 ! \
tensor_convert ! \
tensor_filter framework=tflite model=yolov5s_int8.tflite \
custom=Delegate:NNAPI,NNAPIOptions:NEURON_FLAG_USE_FP16=1 ! \
tensor_sink
The same NEURON_FLAG_USE_FP16 requirement carries through to the delegate options here.
The short version
If your edge AI product is built on ONNX Runtime and needs NPU acceleration on MediaTek Genio, design around the Genio 520 or 720, set NEURON_FLAG_USE_FP16, and decide early between the online and offline paths based on whether your model is fixed. Picking a higher-tier Genio for the headline compute without checking the runtime matrix is how teams end up running ONNX on the CPU and wondering where the NPU went.
We bring up the MediaTek Genio AI stack, NeuroPilot, and camera-to-inference pipelines on a fixed-bid, fixed-timeline basis. If you are evaluating Genio for an AI product, see our MediaTek Genio support services.
Relevant Services
MediaTek Genio Expert Support
Building on MediaTek Genio?
BSP bring-up, GStreamer pipelines, NeuroPilot integration, we've shipped it. Get unblocked fast. One call to scope it, fixed bid to deliver it.
Frequently Asked Questions
Which MediaTek Genio chips support ONNX Runtime on the NPU?
Out of the box, only the Genio 520 and Genio 720. They ship with the ONNX Runtime NeuronExecutionProvider enabled by default, targeting the APUSys MDLA accelerator. The Genio 510, 700, and 1200 can enable it as a build-time opt-in (ENABLE_NEURON_EP = "1" in local.conf, then rebuild the image), though the Genio 1200 does not support all ONNX operators on the NPU. The Genio 350 has no Neuron NPU path at all. If your stack depends on ONNX Runtime NPU acceleration, the 520 and 720 are the safe choices.
Why does my FP32 ONNX model fail to run on the Genio NPU?
The MediaTek MDLA NPU does not execute FP32. You must pass NEURON_FLAG_USE_FP16 = "1" in the NeuronExecutionProvider options, which causes FP32 layers to execute as FP16 on the NPU. Without that flag, an FP32 model will not run on the NPU and will silently fall back to the CPU or fail. The other supported format is QDQ INT8, which runs natively without the flag.
What is the difference between the online and offline NPU paths on Genio?
The online path delegates at runtime: ONNX Runtime via the NeuronExecutionProvider, or TFLite via the Neuron Stable Delegate, compiles subgraphs to the NPU as the application loads the model. The offline path compiles the model ahead of time on a host PC with ncc-tflite into a hardware-specific .dla binary, which is then loaded on the device through neuronrt or the Neuron Runtime API. Offline compilation removes runtime compile latency and is the production path for fixed models.
How fast is the Genio 720 NPU for object detection?
On a YOLOv5s INT8 model compiled to MDLA 3.0, the on-device neuronrt benchmark reports roughly 5.26 ms per inference, about 186 FPS, for the model compute alone. Real pipeline throughput is lower once camera capture, pre-processing, and NMS are included, but it shows the NPU has substantial headroom for a single detector at real-time frame rates.
Can I run ONNX Runtime on the NPU on the Genio 1200?
Yes, but as an opt-in. You set ENABLE_NEURON_EP = "1" in local.conf and rebuild the image, the same as the Genio 510 and 700. The caveat on the 1200 is that not all ONNX operators are supported on its NPU path, so some models will partially fall back and lose performance. The Genio 520 and 720 are the only platforms where the NeuronExecutionProvider is enabled and fully supported by default.
Written by
Aarón AnguloCo-Founder & CEO · ProventusNova
Obsessed with client outcomes. Aarón ensures every engagement delivers real results, on time, on scope, no exceptions.
Connect on LinkedInRelated Articles
Getting started with Ubuntu on MediaTek Genio
Run Ubuntu on MediaTek Genio: supported boards, first boot, the genio-public BSP PPA, hardware video and NPU packages, and how it differs from Yocto.
APU, NPU, VPU, and MDLA on MediaTek Genio: what each one does
Clear explanation of APU, NPU, VPU, and MDLA on MediaTek Genio. What each accelerator handles, which Genio SoCs include them, and when to use each.
Running inference on MediaTek Genio: NeuroPilot, TFLite, and ONNX
How to run offline inference on MediaTek Genio using NeuroPilot SDK, TFLite APU delegate, and ONNX. Model conversion workflow, supported ops, and latency.
On-device AI without the cloud on MediaTek Genio
Run AI inference on MediaTek Genio without cloud. NeuroPilot NPU, TFLite, ONNX Runtime, model conversion, and practical deployment patterns for edge AI.