Compiling AI Functions

Muna is primarily designed to compile AI inference functions to run on-device. We will walk through the general workflow required to compile these functions.

Defining an AI Function

Let’s begin with a function that classifies an image, returning the label along with a confidence score. To do so, we will use the MobileNet v2 model from torchvision:

ai.py

from PIL import Image
from torch import argmax, inference_mode, softmax, randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torchvision.transforms import functional as F

weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights)
model.eval()

@inference_mode()
def classify_image(image: Image.Image) -> tuple[str, float]:
    """Classify an image."""
    # Preprocess
    image = image.convert("RGB")
    image = F.resize(image, 224)
    image = F.center_crop(image, 224)
    image_tensor = F.to_tensor(image)
    normalized_tensor = F.normalize(
        image_tensor,
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    # Run model
    logits = model(normalized_tensor[None])
    scores = softmax(logits, dim=1)
    idx = argmax(scores, dim=1)
    score = scores[0,idx].item()
    label = weights.meta["categories"][idx]
    # Return
    return label, score

The code above has nothing to do with Muna. It is plain PyTorch code.

Compiling the AI Function

There are a few steps needed to prepare an AI function for compilation:

In this section, required changes to the above code are highlighted.

Decorating the Function

First, apply the @compile decorator to the function to prepare it for compilation:

ai.py

from muna import compile
from PIL import Image
from torch import argmax, inference_mode, softmax, randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torchvision.transforms import functional as F

weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights)
model.eval()

@compile(
    tag="@yusuf/classify-image",
    description="Classify an image with AI."
)
@inference_mode()
def classify_image(image: Image.Image) -> tuple[str, float]:
    """Classify an image."""
    # Preprocess
    image = image.convert("RGB")
    image = F.resize(image, 224)
    image = F.center_crop(image, 224)
    image_tensor = F.to_tensor(image)
    normalized_tensor = F.normalize(
        image_tensor,
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    # Run model
    logits = model(normalized_tensor[None])
    scores = softmax(logits, dim=1)
    idx = argmax(scores, dim=1)
    score = scores[0,idx].item()
    label = weights.meta["categories"][idx]
    # Return
    return label, score

Defining the Compiler Sandbox

Depending on how you run AI inference, you will likely have to install libraries (e.g. PyTorch) and/or upload model weights. To do so, create a Sandbox:

ai.py

from muna import compile, Sandbox
from PIL import Image
from torch import argmax, inference_mode, softmax, randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torchvision.transforms import functional as F

weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights)
model.eval()

@compile(
    tag="@yusuf/classify-image",
    description="Classify an image with AI.",
    sandbox=Sandbox().pip_install("torch", "torchvision")
)
@inference_mode()
def classify_image(image: Image.Image) -> tuple[str, float]:
    """Classify an image."""
    # Preprocess
    image = image.convert("RGB")
    image = F.resize(image, 224)
    image = F.center_crop(image, 224)
    image_tensor = F.to_tensor(image)
    normalized_tensor = F.normalize(
        image_tensor,
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    # Run model
    logits = model(normalized_tensor[None])
    scores = softmax(logits, dim=1)
    idx = argmax(scores, dim=1)
    score = scores[0,idx].item()
    label = weights.meta["categories"][idx]
    # Return
    return label, score

Specifying an Inference Backend

Let’s use the ONNXRuntime inference backend to run the AI model:

ai.py

from muna import compile, Sandbox
from muna.beta import OnnxRuntimeInferenceMetadata
from PIL import Image
from torch import argmax, inference_mode, softmax, randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torchvision.transforms import functional as F

weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights).eval()

@compile(
    tag="@yusuf/classify-image",
    description="Classify an image with AI.",
    sandbox=Sandbox().pip_install("torch", "torchvision"),
    metadata=[
        OnnxRuntimeInferenceMetadata(
            model=model,
            model_args=[randn(1,3,224,224)]
        )
    ]
)
@inference_mode()
def classify_image(image: Image.Image) -> tuple[str, float]:
    """Classify an image."""
    # Preprocess
    image = image.convert("RGB")
    image = F.resize(image, 224)
    image = F.center_crop(image, 224)
    image_tensor = F.to_tensor(image)
    normalized_tensor = F.normalize(
        image_tensor,
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    # Run model
    logits = model(normalized_tensor[None])
    scores = softmax(logits, dim=1)
    idx = argmax(scores, dim=1)
    score = scores[0,idx].item()
    label = weights.meta["categories"][idx]
    # Return
    return label, score

Compiling the Function

Now, compile the function using the Muna CLI:

Terminal

# Compile the AI function
$ muna compile --overwrite ai.py

Inference Backends

Muna supports a fixed set of backends for running AI inference. You must opt in to using an inference backend for a specific model by providing inference metadata. The provided metadata will allow the Muna compiler to lower the inference operation to native code.

Supported Inference Backends

Below are supported inference backends:

Llama.cpp

We support compiling Llama instances for inference with llama.cpp:

llm.py

from muna import compile
from llama_cpp import Llama

llm = Llama("smollm2_135m.gguf")

@compile(...)
def llm_chat(messages: list[dict]) -> Iterator[str]:
    for token in llm.create_chat_completion(messages=messages, stream=True):
        yield token

In order to compile a Llama.cpp prediction function like the above, special care must be taken to create the compilation sandbox:

Installing a C++ Compiler

The llama-cpp-python Python package builds llama.cpp from source, so the compiler sandbox must contain a C++ compiler toolchain:

llm.py

llm = Llama("smollm2_135m.gguf")

@compile(
    ...,
    # `llama-cpp-python` builds Llama.cpp from source
    sandbox=Sandbox()
        .apt_install("clang")
)
def llm_chat(messages: list[dict]) -> Iterator[str]:
    for token in llm.create_chat_completion(messages=messages, stream=True):
        yield token

Installing Llama Cpp Python

Next, install the llama-cpp-python Python package:

llm.py

llm = Llama("smollm2_135m.gguf")

@compile(
    ...,
    # Install `llama-cpp-python`
    sandbox=Sandbox()
        .apt_install("clang")
        .pip_install("llama-cpp-python")
)
def llm_chat(messages: list[dict]) -> Iterator[str]:
    for token in llm.create_chat_completion(messages=messages, stream=True):
        yield token

Uploading the GGUF Model

Finally, upload the GGUF model so that it is available in the compiler sandbox:

llm.py

llm = Llama("smollm2_135m.gguf")

@compile(
    ...,
    # Upload the GGUF model
    sandbox=Sandbox()
        .apt_install("clang")
        .pip_install("llama-cpp-python")
        .upload_file(llm.model_path)
)
def llm_chat(messages: list[dict]) -> Iterator[str]:
    for token in llm.create_chat_completion(messages=messages, stream=True):
        yield token

OnnxRuntime

Use the OnnxRuntimeInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with ONNXRuntime:

ai.py

from muna.beta import OnnxRuntimeInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use ONNXRuntime for model inference
        OnnxRuntimeInferenceMetadata(
            model=model,
            model_args=example_args
        )
    ]
)
def predict() -> None:
    pass

OnnxRuntime (Inference Session)

Use the OnnxRuntimeInferenceSessionMetadata metadata type to compile an OnnxRuntime InferenceSession:

ai.py

from muna.beta import OnnxRuntimeInferenceSessionMetadata
from onnxruntime import InferenceSession

# Given an ONNXRuntime inference session...
onnx_path = "/path/to/model.onnx"
session = InferenceSession(onnx_path)

@compile(
    ...,
    metadata=[
        # Use ONNXRuntime for model inference
        OnnxRuntimeInferenceSessionMetadata(
            session=session,
            model_path=onnx_path
        )
    ]
)
def predict() -> None:
    pass

The model must exist at the provided model_path within the compiler sandbox.

TensorRT

Use the TensorRTInferenceMetadata metadata type to compile a PyTorch nn.Module to TensorRT:

ai.py

from muna.beta import TensorRTInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use TensorRT for model inference
        TensorRTInferenceMetadata(
            model=model,
            model_args=example_args,
            cuda_arch="sm_100",
            precision="int4"
        )
    ]
)
def predict() -> None:
    pass

The TensorRT inference backend is only available on Linux and Windows devices with compatible Nvidia GPUs.

We are working on adding support for consumer RTX GPUs with TensorRT for RTX.

Target CUDA Architectures

TensorRT engines must be compiled for specific target CUDA architectures. Below are CUDA architectures that our compiler supports:

CUDA Architecture	GPU Family
`sm_80`	Ampere (e.g. A100)
`sm_86`	Ampere
`sm_87`	Ampere
`sm_89`	Ada Lovelace (e.g. L40S)
`sm_90`	Hopper (e.g. H100)
`sm_100`	Blackwell (e.g. B200)

TensorRT Inference Precision

TensorRT allows for specifying the inference engine’s precision. Below are supported precision modes:

Precision	Notes
`fp32`	32-bit single precision inference.
`fp16`	16-bit half precision inference.
`int8`	8-bit quantized integer inference.

CoreML

Use the CoreMLInferenceMetadata metadata type to compile a PyTorch nn.Module to CoreML:

ai.py

from muna.beta import CoreMLInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use CoreML for model inference
        CoreMLInferenceMetadata(
            model=model,
            model_args=example_args
        )
    ]
)
def predict() -> None:
    pass

The CoreML inference backend is only available on iOS, macOS, and visionOS devices.

ExecuTorch

Use the ExecuTorchInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with ExecuTorch:

ai.py

from muna.beta import ExecuTorchInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use ExecuTorch for model inference
        ExecuTorchInferenceMetadata(
            model=model,
            model_args=example_args,
            backend="xnnpack"
        )
    ]
)
def predict() -> None:
    pass

The ExecuTorch inference backend is only available on Android.

ExecuTorch Hardware Backends

ExecuTorch supports several hardware backends to accelerate model inference. Below are targets that are currently supported by Muna:

Backend	Notes
`xnnpack`	XNNPACK CPU backend. Always enabled.
`vulkan`	Vulkan GPU backend. Only supported on Android.

LiteRT (TensorFlow Lite)

Use the LiteRTInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with LiteRT:

ai.py

from muna.beta import LiteRTInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use LiteRT for model inference
        LiteRTInferenceMetadata(
            model=model,
            model_args=example_args
        )
    ]
)
def predict() -> None:
    pass

The LiteRT inference backend is only available on Android devices.

QNN

Use the QnnInferenceMetadata metadata type to compile a PyTorch nn.Module to a Qualcomm QNN context binary:

ai.py

from muna.beta import QnnInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use QNN for model inference
        QnnInferenceMetadata(
            model=model,
            model_args=example_args,
            backend="gpu",
            quantization=None
        )
    ]
)
def predict() -> None:
    pass

The QNN inference backend is only available on Android and Windows devices with Qualcomm processors.

QNN Hardware Backends

QNN requires that a hardware device backend is specified ahead of time. Below are supported backends:

Backend	Notes
`cpu`	Reference `aarch64` CPU backend.
`gpu`	Adreno GPU backend, accelerated by OpenCL.
`htp`	Hexagon NPU backend.

Learn more about QNN hardware backends.

QNN Model Quantization

When using the htp backend, you must specify a model quantization mode as the Hexagon NPU only supports running integer-quantized models. Below are supported quantization modes:

Quantization	Notes
`w8a8`	Weights and activations are quantized to `uint8`.
`w8a16`	Weights are quantized to `uint8` while activations are quantized to `uint16`.
`w4a8`	Weights are quantized to `uint4` while activations are quantized to `uint8`.
`w4a16`	Weights are quantized to `uint4` while activations are quantized to `uint16`.

OpenVINO

Use the OpenVINOInferenceMetadata metadata type to compile a PyTorch nn.Module to OpenVINO IR:

ai.py

from muna.beta import OpenVINOInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use OpenVINO for model inference
        OpenVINOInferenceMetadata(
            model=model,
            model_args=example_args
        )
    ]
)
def predict() -> None:
    pass

At runtime, the OpenVINO IR will be used for inference with the OpenVINO toolkit.

The OpenVINO inference backend is only available on Linux and Windows x86_64 devices with Intel processors.

IREE

Use the muna.beta.IREEInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with IREE:

ai.py

from muna.beta import IREEInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use IREE for model inference
        IREEInferenceMetadata(
            model=model,
            model_args=example_args,
            backend="vulkan"
        )
    ]
)
def predict() -> None:
    pass

The IREE inference backend is only available on Android devices.

IREE HAL Target Backends

IREE supports several HAL target backends that the model can be compiled against. Below are targets that are currently supported by Muna:

Target	Notes
`vulkan`	Vulkan GPU backend. Only supported on Android.

MIGraphX

Coming soon 🤫.

A single model can be lowered to use multiple inference backends. Simply provide multiple metadata instances that refer to the model.

Request an Inference Backend

We are always looking to add support for new inference backends. So if there is an inference backend you would like to see supported in Muna, please reach out to us.

Get Started

Making Predictions

Creating Predictors

Insiders

Compiling AI Functions

Defining an AI Function

Compiling the AI Function

Decorating the Function

Defining the Compiler Sandbox

Specifying an Inference Backend

Compiling the Function

Inference Backends

Supported Inference Backends

Installing a C++ Compiler

Installing Llama Cpp Python

Uploading the GGUF Model

Target CUDA Architectures

TensorRT Inference Precision

ExecuTorch Hardware Backends

QNN Hardware Backends

QNN Model Quantization

IREE HAL Target Backends

Request an Inference Backend

Get Started

Making Predictions

Creating Predictors

Insiders

​Defining an AI Function

​Compiling the AI Function

​Decorating the Function

​Defining the Compiler Sandbox

​Specifying an Inference Backend

​Compiling the Function

​Inference Backends

​Supported Inference Backends

​Installing a C++ Compiler

​Installing Llama Cpp Python

​Uploading the GGUF Model

​Target CUDA Architectures

​TensorRT Inference Precision

​ExecuTorch Hardware Backends

​QNN Hardware Backends

​QNN Model Quantization

​IREE HAL Target Backends

​Request an Inference Backend

Defining an AI Function

Compiling the AI Function

Decorating the Function

Defining the Compiler Sandbox

Specifying an Inference Backend

Compiling the Function

Inference Backends

Supported Inference Backends

Installing a C++ Compiler

Installing Llama Cpp Python

Uploading the GGUF Model

Target CUDA Architectures

TensorRT Inference Precision

ExecuTorch Hardware Backends

QNN Hardware Backends

QNN Model Quantization

IREE HAL Target Backends

Request an Inference Backend