Neural network inference visualization on MediaTek Genio APU using NeuroPilot SDK and TFLite delegate
mediatek genioneuropilottfliteonnxinferenceedge ai

Running inference on MediaTek Genio: NeuroPilot, TFLite, and ONNX

Aaron Angulo ·

Running inference on MediaTek Genio’s APU requires INT8 quantized models and the NeuroPilot TFLite delegate — a different stack than TensorRT on Jetson or standard TFLite on CPU. The performance you get depends on how well your model architecture maps to the APU’s supported operator set. This post covers the full deployment workflow from model conversion to running on-device.

Key Insights

  • NeuroPilot SDK provides a TFLite delegate that routes supported ops to the Genio APU — no separate compilation step required for basic deployment
  • INT8 quantization is required for APU execution — float32 models run on CPU by default
  • ONNX → TFLite conversion is a required step before using NeuroPilot; there is no direct ONNX-to-APU path
  • Unsupported ops fall back to CPU automatically, which can negate APU performance gains on models with many unsupported layers
  • Latency scales with model complexity non-linearly — benchmark on hardware, not TOPS estimates

How Genio’s APU inference stack works

The MediaTek Genio APU is a dedicated neural network accelerator. It is not a GPU and has no general compute capability. It runs a fixed set of operations at INT8 precision, very efficiently, and routes anything it cannot handle to the CPU automatically.

The software stack is NeuroPilot SDK, which implements a TFLite delegate — a plug-in mechanism that lets TFLite route specific ops to hardware accelerators while keeping the rest on CPU. From the application’s perspective, inference still happens through the TFLite runtime. The delegate handles dispatching operations to the APU transparently.

This design has two implications. First, deployment is simpler than platforms that require a full offline compilation step (like TensorRT on Jetson). You do not need to compile the model to a hardware-specific binary at deployment time — the delegate does this at load time. Second, operator compatibility becomes important. If your model has a high percentage of unsupported operations, those run on CPU, and the APU only handles the supported subset.

Model requirements for APU execution

Before the APU runs your model, two requirements must be met:

INT8 quantization. The APU does not execute float32. A float32 TFLite model loaded with the NeuroPilot delegate runs on CPU — you get NeuroPilot overhead with no APU benefit. Quantize your model before deployment.

Two quantization paths:

  1. Post-training quantization (PTQ) — convert an existing float32 model. Requires a representative dataset for calibration. Fast, but accuracy may drop for some model types.
  2. Quantization-aware training (QAT) — train the model with simulated quantization. Better accuracy preservation, requires retraining.

For detection models (YOLO, EfficientDet), QAT typically preserves accuracy better than PTQ. For classification (MobileNet, EfficientNet), PTQ usually works well.

Supported operator set. Common CNN operators are supported. Transformer and LSTM-heavy architectures are partially or not supported at the APU level. Check the NeuroPilot SDK documentation for the current operator list — it changes between SDK versions.

Deployment workflow

From TensorFlow / Keras:

# Post-training INT8 quantization
import tensorflow as tf

def representative_dataset():
    for image in calibration_images:
        yield [image[np.newaxis, :].astype(np.float32)]

converter = tf.lite.TFLiteConverter.from_saved_model("saved_model_dir")
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

From ONNX:

# Convert ONNX → TFLite (using onnx2tf)
pip install onnx2tf
onnx2tf -i model.onnx -o ./tflite_output

# Then quantize the exported TFLite model using the PTQ path above

Running inference with NeuroPilot delegate:

import tflite_runtime.interpreter as tflite

# Load the NeuroPilot delegate
delegate = tflite.load_delegate("libneuropilot.so")

interpreter = tflite.Interpreter(
    model_path="model_int8.tflite",
    experimental_delegates=[delegate]
)
interpreter.allocate_tensors()

input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

interpreter.set_tensor(input_details[0]['index'], input_data)
interpreter.invoke()
result = interpreter.get_tensor(output_details[0]['index'])

The libneuropilot.so path depends on your Yocto build configuration — check the SDK sysroot for the exact location.

Benchmarking and verifying APU utilization

The NeuroPilot SDK includes a benchmark tool. Use it to verify that your model is actually running on the APU (not falling back to CPU):

# Run the benchmark tool
neuropilot_benchmark \
  --model model_int8.tflite \
  --num_runs 50 \
  --report_delegate_profiling true

The output shows per-layer execution location (APU vs CPU) and latency. If most layers are showing as CPU in the profiling output, check your quantization — the model may be running as float32.

Common causes of unexpected CPU fallback:

  • Model not fully quantized (mixed float32/int8 layers)
  • Operator not in the supported list (the benchmark output names which ops are unsupported)
  • Tensor shape not supported by APU (very small or unusual shapes may fall back)

Latency expectations by model class

These are representative figures on Genio 700 APU. Treat them as planning estimates — benchmark your actual model.

ModelQuantizationEstimated APU latency
MobileNetV3-SmallINT85–8 ms
MobileNetV2 (224×224)INT88–12 ms
EfficientNet-Lite0INT810–15 ms
EfficientDet-Lite0INT812–18 ms
YOLOv5nINT820–30 ms
EfficientDet-Lite2INT830–50 ms

For models with significant unsupported op content (transformers, complex attention mechanisms), the effective latency is the APU latency for supported ops plus CPU latency for unsupported ops. Models designed specifically for efficient deployment on mobile/edge APUs (MobileNet family, EfficientNet-Lite family) tend to have the best APU utilization.

For module comparison on how APU TOPS affects model selection, see Genio 510 vs 700 vs 1200: which MediaTek module for your product.

MediaTek’s NeuroPilot developer portal is at iot.mediatek.com. The TFLite documentation for delegates is at tensorflow.org.

MediaTek Genio Expert Support

Building on MediaTek Genio?

BSP bring-up, GStreamer pipelines, NeuroPilot integration — we've shipped it. Get unblocked fast. One call to scope it, fixed bid to deliver it.

Frequently Asked Questions

What inference frameworks does MediaTek Genio support for running models on the APU?

MediaTek NeuroPilot SDK is the primary path to APU acceleration on Genio. It provides a TFLite delegate that routes supported operations to the APU and falls back unsupported ops to CPU. ONNX models must be converted to TFLite first before they can use the NeuroPilot delegate. The APU requires INT8 quantized models — float32 models run on CPU at significantly lower performance.

How do I convert an ONNX model to run on MediaTek Genio's APU?

The conversion path is: ONNX → TFLite (via onnx2tf or tf-onnx) → INT8 quantization (post-training or quantization-aware training) → deploy with NeuroPilot TFLite delegate. There is no direct ONNX-to-APU path. The quantization step is required — the APU does not support float32 inference. Verify operator compatibility against the NeuroPilot supported ops list before converting, as unsupported ops will fall back to CPU.

Which operators does MediaTek NeuroPilot support on the APU?

NeuroPilot supports common CNN and detection operators: Conv2D, DepthwiseConv2D, MaxPool2D, AveragePool2D, FullyConnected, ReLU, ReLU6, Add, Mul, Softmax, Reshape, and Concatenation, among others. Transformer-based operators (MultiHeadAttention, complex LSTM variants) have limited or no APU support and fall back to CPU. Check the NeuroPilot SDK documentation for the complete operator compatibility table for your specific SDK version.

What inference latency should I expect from the MediaTek Genio APU?

For INT8 models on Genio 700 APU (~4 TOPS): MobileNetV3-Small runs 5–8ms, EfficientDet-Lite0 around 12–18ms, small YOLO variants (YOLOv5n-level) around 20–30ms. Latency is sensitive to model architecture — operator type, tensor shapes, and memory access patterns matter more than raw TOPS. Always benchmark with the NeuroPilot benchmark tool on your actual model rather than estimating from TOPS alone.

How do I get the NeuroPilot SDK for MediaTek Genio?

NeuroPilot SDK is available from MediaTek's IoT developer portal (iot.mediatek.com). Some SDK components require an NDA or partner agreement. The TFLite delegate library is included in the Genio IoT Yocto SDK and available through the standard SDK download. For the full NeuroPilot Compiler (offline DLA compilation), you may need to contact MediaTek directly or access through a module vendor partnership.

Aarón Angulo, Co-Founder & CEO at ProventusNova

Written by

Aarón Angulo

Co-Founder & CEO · ProventusNova

Obsessed with client outcomes. Aarón ensures every engagement delivers real results — on time, on scope, no exceptions.

Connect on LinkedIn