Neural network inference visualization on MediaTek Genio APU using NeuroPilot SDK and TFLite delegate

mediatek genioneuropilottfliteonnxinferenceedge ai

Running inference on MediaTek Genio: NeuroPilot, TFLite, and ONNX

Aaron Angulo · May 12, 2026 · Updated July 12, 2026

Running inference on MediaTek Genio’s APU requires INT8 quantized models and the NeuroPilot TFLite delegate, a different stack than TensorRT on Jetson or standard TFLite on CPU. The performance you get depends on how well your model architecture maps to the APU’s supported operator set. This post covers the full deployment workflow from model conversion to running on-device.

Key Insights

NeuroPilot SDK provides a TFLite delegate that routes supported ops to the Genio APU, no separate compilation step required for basic deployment
INT8 quantization is required for APU execution, float32 models run on CPU by default
ONNX → TFLite is one path, but on Genio 520/720 ONNX Runtime also runs directly on the NPU via the NeuronExecutionProvider (see ONNX Runtime on the Genio NPU); the TFLite-delegate path below applies to every NPU-capable Genio part
Unsupported ops fall back to CPU automatically, which can negate APU performance gains on models with many unsupported layers
Latency scales with model complexity non-linearly, benchmark on hardware, not TOPS estimates

How Genio’s APU inference stack works

The MediaTek Genio APU is a dedicated neural network accelerator. It is not a GPU and has no general compute capability. It runs a fixed set of operations at INT8 precision, very efficiently, and routes anything it cannot handle to the CPU automatically.

The software stack is NeuroPilot SDK, which implements a TFLite delegate, a plug-in mechanism that lets TFLite route specific ops to hardware accelerators while keeping the rest on CPU. From the application’s perspective, inference still happens through the TFLite runtime. The delegate handles dispatching operations to the APU transparently.

This design has two implications. First, deployment is simpler than platforms that require a full offline compilation step (like TensorRT on Jetson). You do not need to compile the model to a hardware-specific binary at deployment time, the delegate does this at load time. Second, operator compatibility becomes important. If your model has a high percentage of unsupported operations, those run on CPU, and the APU only handles the supported subset.

Model requirements for APU execution

Before the APU runs your model, two requirements must be met:

INT8 quantization. The APU does not execute float32. A float32 TFLite model loaded with the NeuroPilot delegate runs on CPU, you get NeuroPilot overhead with no APU benefit. Quantize your model before deployment.

Two quantization paths:

Post-training quantization (PTQ), convert an existing float32 model. Requires a representative dataset for calibration. Fast, but accuracy may drop for some model types.
Quantization-aware training (QAT), train the model with simulated quantization. Better accuracy preservation, requires retraining.

For detection models (YOLO, EfficientDet), QAT typically preserves accuracy better than PTQ. For classification (MobileNet, EfficientNet), PTQ usually works well.

Supported operator set. Common CNN operators are supported. Transformer and LSTM-heavy architectures are partially or not supported at the APU level. Check the NeuroPilot SDK documentation for the current operator list, it changes between SDK versions.

Deployment workflow

From TensorFlow / Keras:

# Post-training INT8 quantization
import tensorflow as tf

def representative_dataset():
    for image in calibration_images:
        yield [image[np.newaxis, :].astype(np.float32)]

converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

From ONNX:

# Convert ONNX → TFLite (using onnx2tf)
pip install onnx2tf
onnx2tf -i model.onnx -o ./tflite_output

# Then quantize the exported TFLite model using the PTQ path above

Running inference with NeuroPilot delegate:

import tflite_runtime.interpreter as tflite

# Load the NeuroPilot delegate
delegate = tflite.load_delegate("libneuropilot.so")

interpreter = tflite.Interpreter(
    model_path="model_int8.tflite",
    experimental_delegates=[delegate]
)
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
result = interpreter.get_tensor(output_details[0]['index'])

The libneuropilot.so path depends on your Yocto build configuration, check the SDK sysroot for the exact location.

Benchmarking and verifying APU utilization

The NeuroPilot SDK includes a benchmark tool. Use it to verify that your model is actually running on the APU (not falling back to CPU):

# Run the benchmark tool
neuropilot_benchmark \
  --model model_int8.tflite \
  --num_runs 50 \
  --report_delegate_profiling true

The output shows per-layer execution location (APU vs CPU) and latency. If most layers are showing as CPU in the profiling output, check your quantization, the model may be running as float32.

Common causes of unexpected CPU fallback:

Model not fully quantized (mixed float32/int8 layers)
Operator not in the supported list (the benchmark output names which ops are unsupported)
Tensor shape not supported by APU (very small or unusual shapes may fall back)

Handling unsupported ops: what to do when layers fall back

Fallback is not just “that one layer runs slower.” Every unsupported op splits the graph: the runtime builds separate APU and CPU subgraphs and hands tensors across the boundary each inference. One unsupported op in the middle of a network costs two transitions per frame, and a model with scattered unsupported ops can fragment into so many subgraphs that APU acceleration loses to plain CPU execution. The subgraph count in the profiling output matters as much as the op list.

The workaround ladder, cheapest first:

Swap the op at the model level. Most fallbacks come from a handful of choices that have APU-friendly equivalents: exotic activations (replace with ReLU6), unusual resize modes, or dynamic shapes that can be fixed at export time. Retrain or re-export with the substitution and the whole graph stays on the APU.
Re-export through a different path. The same network exported with a different opset or converter (onnx2tf vs tf2onnx round trips) often lands on supported ops. Converter choice changes the op mix more than people expect.
Move the op out of the model. If the offender sits at the head or tail (normalization, NMS, decoding), delete it from the graph and do it in application code. A contiguous APU subgraph plus explicit CPU pre/post-processing beats a fragmented graph, and detection post-processing on CPU is normal practice.
Accept it, but only at the edges. A CPU head or tail costs one transition. CPU islands in the middle cost two each; those are the ones worth engineering away.

On the ONNX Runtime path (Genio 520/720, or opt-in on 510/700/1200), the same physics applies with two extra knobs: NEURON_FLAG_USE_FP16 is mandatory for float models to run on the NPU at all, and NEURON_FLAG_MIN_GROUP_SIZE sets how many nodes justify an NPU subgraph, which is your fragmentation control. Details in ONNX Runtime on the Genio NPU. One platform note: on Genio 1200 the NeuronEP op coverage is narrower than on 520/720, so expect more fallback there and budget accordingly.

Offline compilation: ncc-tflite and neuronrt

The delegate path above compiles the model to the APU at load time. For production — fixed models, faster startup, no runtime compile cost — compile ahead of time on a host PC into a hardware-specific .dla binary with the NeuronSDK compiler, then run it on the device.

Host-side compile (ncc-tflite):

export LD_LIBRARY_PATH=/path/to/neuropilot-sdk-basic-<version>/neuron_sdk/host/lib

# INT8 quantized model
./ncc-tflite --arch=mdla3.0 model_int8.tflite -o model_int8.dla

# FP32 model, executed as FP16 on the NPU
./ncc-tflite --arch=mdla3.0 --relax-fp32 model.tflite -o model_fp32.dla

All NPU-capable Genio parts (510/700/520/720/1200) target --arch=mdla3.0. The --relax-fp32 flag is the offline equivalent of the runtime NEURON_FLAG_USE_FP16.

On-device run and benchmark (neuronrt):

neuronrt -m hw -a model_int8.dla -i input.bin -c 10
# Total inference time = 52.648 ms (5.2648 ms/inf), Avg. FPS: 186.1   (YOLOv5s INT8)

neuronrt is for verification and benchmarking; production applications drive inference through the Neuron Runtime API (C/C++) for control over tensors and scheduling.

Latency expectations by model class

These are representative figures on Genio 700 APU. Treat them as planning estimates, benchmark your actual model.

Model	Quantization	Estimated APU latency
MobileNetV3-Small	INT8	5–8 ms
MobileNetV2 (224×224)	INT8	8–12 ms
EfficientNet-Lite0	INT8	10–15 ms
EfficientDet-Lite0	INT8	12–18 ms
YOLOv5n	INT8	20–30 ms
EfficientDet-Lite2	INT8	30–50 ms

For models with significant unsupported op content (transformers, complex attention mechanisms), the effective latency is the APU latency for supported ops plus CPU latency for unsupported ops. Models designed specifically for efficient deployment on mobile/edge APUs (MobileNet family, EfficientNet-Lite family) tend to have the best APU utilization.

For module comparison on how APU TOPS affects model selection, see Genio 510 vs 700 vs 1200: which MediaTek module for your product.

MediaTek’s NeuroPilot developer portal is at iot.mediatek.com. The TFLite documentation for delegates is at tensorflow.org.

Relevant Services

EdgeAI Model Deployment

TensorRT optimization, INT8 quantization, and DLA acceleration on Jetson.

Learn more

MediaTek Genio Expert Support

Building on MediaTek Genio?

BSP bring-up, GStreamer pipelines, NeuroPilot integration, we've shipped it. Get unblocked fast. One call to scope it, fixed bid to deliver it.

Book a scoping call

Frequently Asked Questions

What inference frameworks does MediaTek Genio support for running models on the APU?

MediaTek NeuroPilot SDK is the primary path to APU acceleration on Genio. It provides a TFLite delegate that routes supported operations to the APU and falls back unsupported ops to CPU. ONNX models must be converted to TFLite first before they can use the NeuroPilot delegate. The APU requires INT8 quantized models, float32 models run on CPU at significantly lower performance.

How do I convert an ONNX model to run on MediaTek Genio's APU?

There are two paths. On Genio 520 and 720, ONNX Runtime runs directly on the NPU via the NeuronExecutionProvider with no TFLite conversion needed. On the other NPU-capable parts (or whenever you want the TFLite delegate path), convert ONNX → TFLite (via onnx2tf or tf-onnx) → INT8 quantization → deploy with the NeuroPilot delegate. Either way the APU needs INT8 (or FP16 via the FP16 flag), not float32, and you should verify operator compatibility before converting, as unsupported ops fall back to CPU.

Which operators does MediaTek NeuroPilot support on the APU?

NeuroPilot supports common CNN and detection operators: Conv2D, DepthwiseConv2D, MaxPool2D, AveragePool2D, FullyConnected, ReLU, ReLU6, Add, Mul, Softmax, Reshape, and Concatenation, among others. Transformer-based operators (MultiHeadAttention, complex LSTM variants) have limited or no APU support and fall back to CPU. Check the NeuroPilot SDK documentation for the complete operator compatibility table for your specific SDK version.

What inference latency should I expect from the MediaTek Genio APU?

For INT8 models on Genio 700 APU (~4 TOPS): MobileNetV3-Small runs 5–8ms, EfficientDet-Lite0 around 12–18ms, small YOLO variants (YOLOv5n-level) around 20–30ms. Latency is sensitive to model architecture, operator type, tensor shapes, and memory access patterns matter more than raw TOPS. Always benchmark with the NeuroPilot benchmark tool on your actual model rather than estimating from TOPS alone.

How do I get the NeuroPilot SDK for MediaTek Genio?

NeuroPilot SDK is available from MediaTek's IoT developer portal (iot.mediatek.com). Some SDK components require an NDA or partner agreement. The TFLite delegate library is included in the Genio IoT Yocto SDK and available through the standard SDK download. For the full NeuroPilot Compiler (offline DLA compilation), you may need to contact MediaTek directly or access through a module vendor partnership.

Written by

Aarón Angulo

Co-Founder & CEO · ProventusNova

Obsessed with client outcomes. Aarón ensures every engagement delivers real results, on time, on scope, no exceptions.

Connect on LinkedIn

On-device AI without the cloud on MediaTek Genio

Run AI inference on MediaTek Genio without cloud. NeuroPilot NPU, TFLite, ONNX Runtime, model conversion, and practical deployment patterns for edge AI.

ONNX Runtime on the MediaTek Genio NPU (520 and 720)

Run ONNX Runtime on the Genio NPU. Only the Genio 520 and 720 support the NeuronExecutionProvider out of the box, plus the mandatory FP16 flag and benchmarks.

MediaTek Genio for robotics edge AI: inference, camera, BSP reality

Is MediaTek Genio viable for robotics edge AI? Honest assessment of inference latency, camera pipeline, ROS 2 support, and BSP limitations for robotics builds.

Getting started with Ubuntu on MediaTek Genio

Run Ubuntu on MediaTek Genio: supported boards, first boot, the genio-public BSP PPA, hardware video and NPU packages, and how it differs from Yocto.

← Back to Blog