MediaTek Genio board running AI inference locally with NeuroPilot NPU acceleration
mediatekgenioainpuneuropilottfliteonnxedge aion-device

On-device AI without the cloud on MediaTek Genio

Andres Campos ·

Running AI inference on Genio without cloud connectivity is the primary use case for the platform. Every Genio SoC above the 350 includes the MDLA (Multi-Dimension Learning Accelerator) NPU. MediaTek’s NeuroPilot stack integrates the NPU into TFLite and ONNX Runtime through execution providers, so standard model files work without rewriting inference code. This post covers the full path from a trained model to running inference on the NPU.

Key Insights

  • NEURON_FLAG_USE_FP16 = "1" is mandatory for NPU inference — the MDLA does not support FP32; without this flag models fall back to CPU silently
  • Two NPU paths: TFLite with Neuron Stable Delegate, and ONNX Runtime with NeuronExecutionProvider — both work; TFLite is more mature
  • Genio 350 has no NPU — it runs TFLite and ONNX on CPU/GPU only
  • ONNX NeuronExecutionProvider is default only on Genio 520/720 — other platforms need ENABLE_NEURON_EP = "1" in the Yocto build
  • CPU fallback is automatic — ops unsupported by the NPU fall back to CPU within the same inference session; you don’t need to handle it manually

NPU support matrix

PlatformTFLite (CPU/GPU)TFLite Neuron DelegateONNX CPUONNX NeuronEP (NPU)
Genio 350
Genio 510 / 700Opt-in¹
Genio 520 / 720Default
Genio 1200Opt-in¹

¹ Requires ENABLE_NEURON_EP = "1" in local.conf, then rebuild image.

TFLite inference with Neuron Stable Delegate

The Neuron Stable Delegate is the TFLite execution delegate that routes supported ops to the NPU.

import tflite_runtime.interpreter as tflite
import numpy as np

# Load model with Neuron Stable Delegate for NPU acceleration
interpreter = tflite.Interpreter(
    model_path="mobilenet_v2.tflite",
    experimental_delegates=[
        tflite.load_delegate("libNeuronStableDelegate.so")
    ]
)
interpreter.allocate_tensors()

# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Prepare input (example: 224x224 RGB image)
input_data = np.expand_dims(image, axis=0).astype(np.float32)
input_data = (input_data / 255.0 - 0.5) / 0.5  # Normalize

# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()

# Get output
output = interpreter.get_tensor(output_details[0]['index'])

C++ (production)

#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/delegates/external/external_delegate.h"

// Build external delegate options for Neuron
TfLiteExternalDelegateOptions opts =
    TfLiteExternalDelegateOptionsDefault("libNeuronStableDelegate.so");
auto* delegate = TfLiteExternalDelegateCreate(&opts);

// Build interpreter and add delegate
tflite::ops::builtin::BuiltinOpResolver resolver;
tflite::InterpreterBuilder builder(*model, resolver);
builder.AddDelegate(delegate);
builder(&interpreter);
interpreter->AllocateTensors();

ONNX Runtime with NeuronExecutionProvider

Python setup

import onnxruntime as ort
import numpy as np

# Configure NeuronExecutionProvider options
neuron_opts = {
    "NEURON_FLAG_USE_FP16": "1",        # MANDATORY — NPU is FP16 only
    "NEURON_FLAG_MIN_GROUP_SIZE": "1",  # Minimum ops per NPU subgraph
}

providers = [
    ("NeuronExecutionProvider", neuron_opts),
    "XnnpackExecutionProvider",   # Fallback: XNNPACK on CPU
    "CPUExecutionProvider",       # Final fallback
]

session = ort.InferenceSession("model.onnx", providers=providers)

# Run inference
input_name = session.get_inputs()[0].name
output = session.run(None, {input_name: input_data})

Verify NPU is being used

# Check which providers are active
print(session.get_providers())
# ['NeuronExecutionProvider', 'XnnpackExecutionProvider', 'CPUExecutionProvider']

# Check if NeuronEP was actually used (it appears first if it took the model)
active = session.get_provider_options()
print(active)

If NeuronExecutionProvider is in the returned list, at least some ops ran on the NPU. Ops that NeuronEP doesn’t support automatically fall through to the next provider.

Enabling NeuronEP on Genio 510/700/1200 (Yocto)

# conf/local.conf
ENABLE_NEURON_EP = "1"

Rebuild the image after setting this flag. The flag adds the NeuronExecutionProvider shared library to the ONNX Runtime installation.

Model requirements for NPU acceleration

The NPU handles most standard CNN ops natively. Limitations to be aware of:

Op categoryNPU support
Conv2D, DepthwiseConv2D✅ Full
BatchNorm, ReLU, ReLU6✅ Full
Add, Mul, Concat✅ Full
LSTM, GRU⚠️ Partial (some variants)
Transformer attention⚠️ Partial
Custom ops❌ CPU fallback
FP32 weights⚠️ Requires NEURON_FLAG_USE_FP16=1 to run as FP16
INT8 quantized✅ Best performance path

INT8 quantized models deliver the best NPU performance and the smallest memory footprint. Use post-training quantization in TFLite or ONNX Runtime’s quantization tools before deployment.

Post-training quantization (TFLite)

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Representative dataset for calibration
def representative_data_gen():
    for sample in calibration_samples:
        yield [sample.astype(np.float32)]

converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

Practical deployment patterns

Pattern 1: Continuous camera inference

import tflite_runtime.interpreter as tflite
import cv2

# Load model once at startup
interpreter = tflite.Interpreter(
    model_path="detect.tflite",
    experimental_delegates=[tflite.load_delegate("libNeuronStableDelegate.so")]
)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Open camera
cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Preprocess
    resized = cv2.resize(frame, (300, 300))
    input_data = np.expand_dims(resized, axis=0).astype(np.uint8)

    # Infer
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()

    boxes = interpreter.get_tensor(output_details[0]['index'])
    scores = interpreter.get_tensor(output_details[2]['index'])
    # ... process results

Pattern 2: GStreamer + NNStreamer pipeline

NNStreamer integrates TFLite inference directly into GStreamer pipelines for zero-copy frame processing:

gst-launch-1.0 \
  v4l2src device=/dev/video0 ! \
  video/x-raw,format=RGB,width=640,height=480 ! \
  videoconvert ! \
  tensor_converter ! \
  tensor_filter \
    framework=tflite \
    model=mobilenet_v2_int8.tflite \
    accelerator=true:npu ! \
  tensor_decoder mode=image_labeling \
    option1=labels.txt ! \
  overlay ! waylandsink

NNStreamer is included in packagegroup-rity-ai-ml in the RITY Yocto image.

Performance benchmarks on Genio 720

Approximate inference times on Genio 720 EVK with INT8 quantized models:

ModelCPU (ms)NPU (ms)Speedup
MobileNetV24585.6×
EfficientDet-Lite0120225.5×
YOLOv5s INT8210385.5×
BERT-base (text)8501804.7×

NPU inference is 4–6× faster than CPU for typical vision models. Combined with the reduction in CPU load (CPU is free for other work during NPU inference), the real-world benefit is larger than the raw latency numbers suggest.

For the full NeuroPilot stack including Yocto packagegroups and NDA tier features, see What is RITY? MediaTek’s Genio reference distribution explained. For a complete computer vision pipeline from camera to inference to display, see MediaTek Genio for computer vision.

FAQ

What AI frameworks run on MediaTek Genio NPU?

TensorFlow Lite (LiteRT) via the Neuron Stable Delegate and ONNX Runtime via NeuronExecutionProvider both run on the Genio NPU (MDLA). TFLite is the more mature path. ONNX Runtime with NeuronExecutionProvider is available by default on Genio 520 and 720.

Do I need cloud connectivity for AI inference on Genio?

No. The NeuroPilot NPU, TFLite, and ONNX Runtime all run entirely on-device. Models run locally with no network required after the model is loaded onto the device.

What is the NEURON_FLAG_USE_FP16 flag and why is it mandatory?

NEURON_FLAG_USE_FP16=1 tells the NeuroPilot runtime to execute FP32 model weights as FP16 on the NPU hardware. The Genio MDLA does not natively support FP32 inference — without this flag, FP32 models fail to run on the NPU and fall back to CPU.

Which Genio platform has the strongest NPU for on-device AI?

Genio 1200 (MT8395) has the highest NPU TOPS. For the single-channel DDR platforms, Genio 720 (MT8391) has the best NPU-to-cost ratio with ONNX NeuronExecutionProvider enabled by default. Genio 350 has no NPU.


MediaTek Genio Expert Support

Building on MediaTek Genio?

BSP bring-up, GStreamer pipelines, NeuroPilot integration, we've shipped it. Get unblocked fast. One call to scope it, fixed bid to deliver it.

Frequently Asked Questions

What AI frameworks run on MediaTek Genio NPU?

TensorFlow Lite (LiteRT) via the Neuron Stable Delegate and ONNX Runtime via NeuronExecutionProvider both run on the Genio NPU (MDLA). TFLite is the more mature path. ONNX Runtime with NeuronExecutionProvider is available out of the box on Genio 520 and 720; other platforms require opt-in via ENABLE_NEURON_EP=1.

Do I need cloud connectivity for AI inference on Genio?

No. The NeuroPilot NPU, TFLite, and ONNX Runtime all run entirely on-device. Models run locally with no network required after the model is loaded onto the device. This makes Genio suitable for air-gapped deployments, privacy-sensitive applications, and use cases where network latency is unacceptable.

What is the NEURON_FLAG_USE_FP16 flag and why is it mandatory?

NEURON_FLAG_USE_FP16=1 tells the NeuroPilot runtime to execute FP32 model weights as FP16 on the NPU hardware. The Genio MDLA does not natively support FP32 inference — without this flag, FP32 models fail to run on the NPU and fall back to CPU. Always set this flag when using NeuronExecutionProvider or the Neuron Stable Delegate.

Which Genio platform has the strongest NPU for on-device AI?

Genio 1200 (MT8395) has the highest NPU TOPS. For the single-channel DDR platforms, Genio 720 (MT8391) has the best NPU-to-cost ratio with ONNX NeuronExecutionProvider enabled by default. Genio 350 has no NPU — it is CPU/GPU inference only.

Andrés Campos, Co-Founder & CTO at ProventusNova

Written by

Andrés Campos

Co-Founder & CTO · ProventusNova

8 years deep in embedded systems, from underwater ROVs to edge AI. Andrés leads every technical delivery personally.

Connect on LinkedIn