Defining an AI Function
Let’s begin with a function that classifies an image, returning the label along with a confidence score. To do so, we will use the MobileNet v2 model fromtorchvision:
ai.py
The code above has nothing to do with Muna. It is plain PyTorch code.
Compiling the AI Function
There are a few steps needed to prepare an AI function for compilation:Decorating the Function
First, apply the@compile decorator to the function to prepare it for compilation:
ai.py
Defining the Compiler Sandbox
Depending on how you run AI inference, you will likely have to install libraries (e.g. PyTorch) and/or upload model weights. To do so, create aSandbox:
ai.py
Specifying an Inference Backend
Let’s use the ONNXRuntime inference backend to run the AI model:ai.py
Compiling the Function
Now, compile the function using the Muna CLI:Terminal
Inference Backends
Muna supports a fixed set of backends for running AI inference. You must opt in to using an inference backend for a specific model by providing inference metadata. The provided metadata will allow the Muna compiler to lower the inference operation to native code.Supported Inference Backends
Below are supported inference backends:Llama.cpp
Llama.cpp
We support compiling In order to compile a Llama.cpp prediction function like the above, special
care must be taken to create the compilation sandbox:
Llama instances for inference
with llama.cpp:llm.py
Installing a C++ Compiler
Thellama-cpp-python Python package builds llama.cpp from source, so the compiler sandbox
must contain a C++ compiler toolchain:llm.py
Installing Llama Cpp Python
Next, install thellama-cpp-python Python package:llm.py
Uploading the GGUF Model
Finally, upload the GGUF model so that it is available in the compiler sandbox:llm.py
OnnxRuntime
OnnxRuntime
Use the
OnnxRuntimeInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with ONNXRuntime:ai.py
OnnxRuntime (Inference Session)
OnnxRuntime (Inference Session)
Use the
OnnxRuntimeInferenceSessionMetadata metadata type to compile an OnnxRuntime InferenceSession:ai.py
TensorRT
TensorRT
Use the
TensorRTInferenceMetadata metadata type to compile a PyTorch nn.Module to TensorRT:ai.py
The TensorRT inference backend is only available on Linux and Windows devices with compatible Nvidia GPUs.
Target CUDA Architectures
TensorRT engines must be compiled for specific target CUDA architectures. Below are CUDA architectures that our compiler supports:| CUDA Architecture | GPU Family |
|---|---|
sm_80 | Ampere (e.g. A100) |
sm_86 | Ampere |
sm_87 | Ampere |
sm_89 | Ada Lovelace (e.g. L40S) |
sm_90 | Hopper (e.g. H100) |
sm_100 | Blackwell (e.g. B200) |
TensorRT Inference Precision
TensorRT allows for specifying the inference engine’s precision. Below are supported precision modes:| Precision | Notes |
|---|---|
fp32 | 32-bit single precision inference. |
fp16 | 16-bit half precision inference. |
int8 | 8-bit quantized integer inference. |
CoreML
CoreML
ExecuTorch
ExecuTorch
Use the
ExecuTorchInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with ExecuTorch:ai.py
The ExecuTorch inference backend is only available on Android.
ExecuTorch Hardware Backends
ExecuTorch supports several hardware backends to accelerate model inference. Below are targets that are currently supported by Muna:| Backend | Notes |
|---|---|
xnnpack | XNNPACK CPU backend. Always enabled. |
vulkan | Vulkan GPU backend. Only supported on Android. |
LiteRT (TensorFlow Lite)
LiteRT (TensorFlow Lite)
QNN
QNN
Use the
QnnInferenceMetadata metadata type to compile a PyTorch nn.Module to a Qualcomm QNN context binary:ai.py
The QNN inference backend is only available on Android and Windows devices with Qualcomm processors.
QNN Hardware Backends
QNN requires that a hardware devicebackend is specified ahead of time. Below are supported backends:| Backend | Notes |
|---|---|
cpu | Reference aarch64 CPU backend. |
gpu | Adreno GPU backend, accelerated by OpenCL. |
htp | Hexagon NPU backend. |
Learn more about QNN hardware backends.
QNN Model Quantization
When using thehtp backend, you must specify a model quantization mode as the Hexagon NPU only supports
running integer-quantized models. Below are supported quantization modes:| Quantization | Notes |
|---|---|
w8a8 | Weights and activations are quantized to uint8. |
w8a16 | Weights are quantized to uint8 while activations are quantized to uint16. |
w4a8 | Weights are quantized to uint4 while activations are quantized to uint8. |
w4a16 | Weights are quantized to uint4 while activations are quantized to uint16. |
OpenVINO
OpenVINO
Use the At runtime, the OpenVINO IR will be used for inference with the OpenVINO toolkit.
OpenVINOInferenceMetadata metadata type to compile a PyTorch nn.Module to OpenVINO IR:ai.py
The OpenVINO inference backend is only available on Linux and Windows
x86_64 devices with Intel processors.IREE
IREE
Use the
muna.beta.IREEInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with IREE:ai.py
The IREE inference backend is only available on Android devices.
IREE HAL Target Backends
IREE supports several HAL target backends that themodel can be compiled against. Below are targets that are currently supported by Muna:| Target | Notes |
|---|---|
vulkan | Vulkan GPU backend. Only supported on Android. |
MIGraphX
MIGraphX
Coming soon 🤫.