Defining an AI Function
Let’s begin with a function that classifies an image, returning the label along with a confidence score. To do so, we will use the MobileNet v2 model fromtorchvision
:
ai.py
The code above has nothing to do with Muna. It is plain PyTorch code.
Compiling the AI Function
There are a few steps needed to prepare an AI function for compilation:In this section, required changes to the above code are highlighted.
Decorating the Function
First, apply the@compile
decorator to the function to prepare it for compilation:
ai.py
Defining the Compiler Sandbox
Depending on how you run AI inference, you will likely have to install libraries (e.g. PyTorch) and/or upload model weights. To do so, create aSandbox
:
ai.py
Specifying an Inference Backend
Let’s use the ONNXRuntime inference backend to run the AI model:ai.py
Compiling the Function
Now, compile the function using the Muna CLI:Terminal
Inference Backends
Muna supports a fixed set of backends for running AI inference. You must opt in to using an inference backend for a specific model by providing inference metadata. The provided metadata will allow the Muna compiler to lower the inference operation to native code.Supported Inference Backends
Below are supported inference backends:Llama.cpp
Llama.cpp
We support compiling In order to compile a Llama.cpp prediction function like the above, special
care must be taken to create the compilation sandbox:
Llama
instances for inference
with llama.cpp
:llm.py
Installing a C++ Compiler
Thellama-cpp-python
Python package builds llama.cpp
from source, so the compiler sandbox
must contain a C++ compiler toolchain:llm.py
Installing Llama Cpp Python
Next, install thellama-cpp-python
Python package:llm.py
Uploading the GGUF Model
Finally, upload the GGUF model so that it is available in the compiler sandbox:llm.py
OnnxRuntime
OnnxRuntime
Use the
OnnxRuntimeInferenceMetadata
metadata type to compile a PyTorch nn.Module
for inference with ONNXRuntime:ai.py
OnnxRuntime (Inference Session)
OnnxRuntime (Inference Session)
Use the
OnnxRuntimeInferenceSessionMetadata
metadata type to compile an OnnxRuntime InferenceSession
:ai.py
The model must exist at the provided
model_path
within the compiler sandbox.TensorRT
TensorRT
Use the
TensorRTInferenceMetadata
metadata type to compile a PyTorch nn.Module
to TensorRT:ai.py
The TensorRT inference backend is only available on Linux and Windows devices with compatible Nvidia GPUs.
We are working on adding support for consumer RTX GPUs with TensorRT for RTX.
Target CUDA Architectures
TensorRT engines must be compiled for specific target CUDA architectures. Below are CUDA architectures that our compiler supports:CUDA Architecture | GPU Family |
---|---|
sm_80 | Ampere (e.g. A100) |
sm_86 | Ampere |
sm_87 | Ampere |
sm_89 | Ada Lovelace (e.g. L40S) |
sm_90 | Hopper (e.g. H100) |
sm_100 | Blackwell (e.g. B200) |
TensorRT Inference Precision
TensorRT allows for specifying the inference engine’s precision. Below are supported precision modes:Precision | Notes |
---|---|
fp32 | 32-bit single precision inference. |
fp16 | 16-bit half precision inference. |
int8 | 8-bit quantized integer inference. |
CoreML
CoreML
ExecuTorch
ExecuTorch
Use the
ExecuTorchInferenceMetadata
metadata type to compile a PyTorch nn.Module
for inference with ExecuTorch:ai.py
The ExecuTorch inference backend is only available on Android.
ExecuTorch Hardware Backends
ExecuTorch supports several hardware backends to accelerate model inference. Below are targets that are currently supported by Muna:Backend | Notes |
---|---|
xnnpack | XNNPACK CPU backend. Always enabled. |
vulkan | Vulkan GPU backend. Only supported on Android. |
LiteRT (TensorFlow Lite)
LiteRT (TensorFlow Lite)
QNN
QNN
Use the
QnnInferenceMetadata
metadata type to compile a PyTorch nn.Module
to a Qualcomm QNN context binary:ai.py
The QNN inference backend is only available on Android and Windows devices with Qualcomm processors.
QNN Hardware Backends
QNN requires that a hardware devicebackend
is specified ahead of time. Below are supported backends:Backend | Notes |
---|---|
cpu | Reference aarch64 CPU backend. |
gpu | Adreno GPU backend, accelerated by OpenCL. |
htp | Hexagon NPU backend. |
Learn more about QNN hardware backends.
QNN Model Quantization
When using thehtp
backend, you must specify a model quantization
mode as the Hexagon NPU only supports
running integer-quantized models. Below are supported quantization modes:Quantization | Notes |
---|---|
w8a8 | Weights and activations are quantized to uint8 . |
w8a16 | Weights are quantized to uint8 while activations are quantized to uint16 . |
w4a8 | Weights are quantized to uint4 while activations are quantized to uint8 . |
w4a16 | Weights are quantized to uint4 while activations are quantized to uint16 . |
OpenVINO
OpenVINO
Use the At runtime, the OpenVINO IR will be used for inference with the OpenVINO toolkit.
OpenVINOInferenceMetadata
metadata type to compile a PyTorch nn.Module
to OpenVINO IR:ai.py
The OpenVINO inference backend is only available on Linux and Windows
x86_64
devices with Intel processors.IREE
IREE
Use the
muna.beta.IREEInferenceMetadata
metadata type to compile a PyTorch nn.Module
for inference with IREE:ai.py
The IREE inference backend is only available on Android devices.
IREE HAL Target Backends
IREE supports several HAL target backends that themodel
can be compiled against. Below are targets that are currently supported by Muna:Target | Notes |
---|---|
vulkan | Vulkan GPU backend. Only supported on Android. |
MIGraphX
MIGraphX
Coming soon 🤫.
A single model can be lowered to use multiple inference backends. Simply provide multiple
metadata
instances
that refer to the model.