On-device AI without the cloud on MediaTek Genio

Q: What is the NEURON_FLAG_USE_FP16 flag and why is it mandatory?

NEURON_FLAG_USE_FP16=1 tells the NeuroPilot runtime to execute FP32 model weights as FP16 on the NPU hardware. The Genio MDLA does not natively support FP32 inference — without this flag, FP32 models fail to run on the NPU and fall back to CPU. Always set this flag when using NeuronExecutionProvider or the Neuron Stable Delegate.

Running AI inference on Genio without cloud connectivity is the primary use case for the platform. Every Genio SoC above the 350 includes the MDLA (Multi-Dimension Learning Accelerator) NPU. MediaTek’s NeuroPilot stack integrates the NPU into TFLite and ONNX Runtime through execution providers, so standard model files work without rewriting inference code. This post covers the full path from a trained model to running inference on the NPU.

Key Insights

NEURON_FLAG_USE_FP16 = "1" is mandatory for NPU inference — the MDLA does not support FP32; without this flag models fall back to CPU silently
Two NPU paths: TFLite with Neuron Stable Delegate, and ONNX Runtime with NeuronExecutionProvider — both work; TFLite is more mature
Genio 350 has no NPU — it runs TFLite and ONNX on CPU/GPU only
ONNX NeuronExecutionProvider is default only on Genio 520/720 — other platforms need ENABLE_NEURON_EP = "1" in the Yocto build
CPU fallback is automatic — ops unsupported by the NPU fall back to CPU within the same inference session; you don’t need to handle it manually

NPU support matrix

Platform	TFLite (CPU/GPU)	TFLite Neuron Delegate	ONNX CPU	ONNX NeuronEP (NPU)
Genio 350	✅	❌	✅	❌
Genio 510 / 700	✅	✅	✅	Opt-in¹
Genio 520 / 720	✅	✅	✅	Default
Genio 1200	✅	✅	✅	Opt-in¹

¹ Requires ENABLE_NEURON_EP = "1" in local.conf, then rebuild image.

TFLite inference with Neuron Stable Delegate

The Neuron Stable Delegate is the TFLite execution delegate that routes supported ops to the NPU.

Python (recommended for prototyping)

import tflite_runtime.interpreter as tflite
import numpy as np

# Load model with Neuron Stable Delegate for NPU acceleration
interpreter = tflite.Interpreter(
    model_path="mobilenet_v2.tflite",
    experimental_delegates=[
        tflite.load_delegate("libNeuronStableDelegate.so")
    ]
)
interpreter.allocate_tensors()

# Get input/output details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Prepare input (example: 224x224 RGB image)
input_data = np.expand_dims(image, axis=0).astype(np.float32)
input_data = (input_data / 255.0 - 0.5) / 0.5  # Normalize

# Run inference
interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()

# Get output
output = interpreter.get_tensor(output_details[0]['index'])

C++ (production)

#include "tensorflow/lite/interpreter.h"
#include "tensorflow/lite/delegates/external/external_delegate.h"

// Build external delegate options for Neuron
TfLiteExternalDelegateOptions opts =
    TfLiteExternalDelegateOptionsDefault("libNeuronStableDelegate.so");
auto* delegate = TfLiteExternalDelegateCreate(&opts);

// Build interpreter and add delegate
tflite::ops::builtin::BuiltinOpResolver resolver;
tflite::InterpreterBuilder builder(*model, resolver);
builder.AddDelegate(delegate);
builder(&interpreter);
interpreter->AllocateTensors();

ONNX Runtime with NeuronExecutionProvider

ONNX Runtime runs directly on the NPU through the NeuronExecutionProvider — enabled by default on Genio 520 and 720, opt-in on 510/700/1200 via ENABLE_NEURON_EP = "1". The same NEURON_FLAG_USE_FP16 = "1" rule applies, and unsupported ops fall through to XNNPACK or CPU automatically within the session.

We cover the full ONNX path — provider setup, verifying NPU execution, online vs offline (ncc-tflite) compilation, the 520/720 distinction, and neuronrt benchmarks — in a dedicated guide: ONNX Runtime on the MediaTek Genio NPU.

Model requirements for NPU acceleration

The NPU handles most standard CNN ops natively. Limitations to be aware of:

Op category	NPU support
Conv2D, DepthwiseConv2D	✅ Full
BatchNorm, ReLU, ReLU6	✅ Full
Add, Mul, Concat	✅ Full
LSTM, GRU	⚠️ Partial (some variants)
Transformer attention	⚠️ Partial
Custom ops	❌ CPU fallback
FP32 weights	⚠️ Requires `NEURON_FLAG_USE_FP16=1` to run as FP16
INT8 quantized	✅ Best performance path

INT8 quantized models deliver the best NPU performance and the smallest memory footprint. Use post-training quantization in TFLite or ONNX Runtime’s quantization tools before deployment.

Post-training quantization (TFLite)

import tensorflow as tf

converter = tf.lite.TFLiteConverter.from_saved_model("saved_model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Representative dataset for calibration
def representative_data_gen():
    for sample in calibration_samples:
        yield [sample.astype(np.float32)]

converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

Practical deployment patterns

Pattern 1: Continuous camera inference

import tflite_runtime.interpreter as tflite
import cv2

# Load model once at startup
interpreter = tflite.Interpreter(
    model_path="detect.tflite",
    experimental_delegates=[tflite.load_delegate("libNeuronStableDelegate.so")]
)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Open camera
cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    if not ret:
        break

    # Preprocess
    resized = cv2.resize(frame, (300, 300))
    input_data = np.expand_dims(resized, axis=0).astype(np.uint8)

    # Infer
    interpreter.set_tensor(input_details[0]['index'], input_data)
    interpreter.invoke()

    boxes = interpreter.get_tensor(output_details[0]['index'])
    scores = interpreter.get_tensor(output_details[2]['index'])
    # ... process results

Pattern 2: GStreamer + NNStreamer pipeline

NNStreamer integrates TFLite inference directly into GStreamer pipelines for zero-copy frame processing:

gst-launch-1.0 \
  v4l2src device=/dev/video0 ! \
  video/x-raw,format=RGB,width=640,height=480 ! \
  videoconvert ! \
  tensor_converter ! \
  tensor_filter \
    framework=tflite \
    model=mobilenet_v2_int8.tflite \
    accelerator=true:npu ! \
  tensor_decoder mode=image_labeling \
    option1=labels.txt ! \
  overlay ! waylandsink

NNStreamer is included in packagegroup-rity-ai-ml in the RITY Yocto image.

Performance benchmarks on Genio 720

Approximate inference times on Genio 720 EVK with INT8 quantized models:

Model	CPU (ms)	NPU (ms)	Speedup
MobileNetV2	45	8	5.6×
EfficientDet-Lite0	120	22	5.5×
YOLOv5s INT8	210	38	5.5×
BERT-base (text)	850	180	4.7×

NPU inference is 4–6× faster than CPU for typical vision models. Combined with the reduction in CPU load (CPU is free for other work during NPU inference), the real-world benefit is larger than the raw latency numbers suggest.

For the full NeuroPilot stack including Yocto packagegroups and NDA tier features, see What is RITY? MediaTek’s Genio reference distribution explained. For a complete computer vision pipeline from camera to inference to display, see MediaTek Genio for computer vision.

FAQ

What AI frameworks run on MediaTek Genio NPU?

TensorFlow Lite (LiteRT) via the Neuron Stable Delegate and ONNX Runtime via NeuronExecutionProvider both run on the Genio NPU (MDLA). TFLite is the more mature path. ONNX Runtime with NeuronExecutionProvider is available by default on Genio 520 and 720.

Do I need cloud connectivity for AI inference on Genio?

No. The NeuroPilot NPU, TFLite, and ONNX Runtime all run entirely on-device. Models run locally with no network required after the model is loaded onto the device.

What is the NEURON_FLAG_USE_FP16 flag and why is it mandatory?

NEURON_FLAG_USE_FP16=1 tells the NeuroPilot runtime to execute FP32 model weights as FP16 on the NPU hardware. The Genio MDLA does not natively support FP32 inference — without this flag, FP32 models fail to run on the NPU and fall back to CPU.

Which Genio platform has the strongest NPU for on-device AI?

Genio 1200 (MT8395) has the highest NPU TOPS. For the single-channel DDR platforms, Genio 720 (MT8391) has the best NPU-to-cost ratio with ONNX NeuronExecutionProvider enabled by default. Genio 350 has no NPU.