Muna is primarily designed to compile AI inference functions to run on-device. We will walk through the general workflow required to compile these functions.

Defining an AI Function

Let’s begin with a function that classifies an image, returning the label along with a confidence score. To do so, we will use the MobileNet v2 model from torchvision:
ai.py
from PIL import Image
from torch import argmax, inference_mode, softmax, randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torchvision.transforms import functional as F

weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights)
model.eval()

@inference_mode()
def classify_image(image: Image.Image) -> tuple[str, float]:
    """Classify an image."""
    # Preprocess
    image = image.convert("RGB")
    image = F.resize(image, 224)
    image = F.center_crop(image, 224)
    image_tensor = F.to_tensor(image)
    normalized_tensor = F.normalize(
        image_tensor,
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    # Run model
    logits = model(normalized_tensor[None])
    scores = softmax(logits, dim=1)
    idx = argmax(scores, dim=1)
    score = scores[0,idx].item()
    label = weights.meta["categories"][idx]
    # Return
    return label, score
The code above has nothing to do with Muna. It is plain PyTorch code.

Compiling the AI Function

There are a few steps needed to prepare an AI function for compilation:
In this section, required changes to the above code are highlighted.

Decorating the Function

First, apply the @compile decorator to the function to prepare it for compilation:
ai.py
from muna import compile
from PIL import Image
from torch import argmax, inference_mode, softmax, randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torchvision.transforms import functional as F

weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights)
model.eval()

@compile(
    tag="@yusuf/classify-image",
    description="Classify an image with AI."
)
@inference_mode()
def classify_image(image: Image.Image) -> tuple[str, float]:
    """Classify an image."""
    # Preprocess
    image = image.convert("RGB")
    image = F.resize(image, 224)
    image = F.center_crop(image, 224)
    image_tensor = F.to_tensor(image)
    normalized_tensor = F.normalize(
        image_tensor,
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    # Run model
    logits = model(normalized_tensor[None])
    scores = softmax(logits, dim=1)
    idx = argmax(scores, dim=1)
    score = scores[0,idx].item()
    label = weights.meta["categories"][idx]
    # Return
    return label, score

Defining the Compiler Sandbox

Depending on how you run AI inference, you will likely have to install libraries (e.g. PyTorch) and/or upload model weights. To do so, create a Sandbox:
ai.py
from muna import compile, Sandbox
from PIL import Image
from torch import argmax, inference_mode, softmax, randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torchvision.transforms import functional as F

weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights)
model.eval()

@compile(
    tag="@yusuf/classify-image",
    description="Classify an image with AI.",
    sandbox=Sandbox().pip_install("torch", "torchvision")
)
@inference_mode()
def classify_image(image: Image.Image) -> tuple[str, float]:
    """Classify an image."""
    # Preprocess
    image = image.convert("RGB")
    image = F.resize(image, 224)
    image = F.center_crop(image, 224)
    image_tensor = F.to_tensor(image)
    normalized_tensor = F.normalize(
        image_tensor,
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    # Run model
    logits = model(normalized_tensor[None])
    scores = softmax(logits, dim=1)
    idx = argmax(scores, dim=1)
    score = scores[0,idx].item()
    label = weights.meta["categories"][idx]
    # Return
    return label, score

Specifying an Inference Backend

Let’s use the ONNXRuntime inference backend to run the AI model:
ai.py
from muna import compile, Sandbox
from muna.beta import OnnxRuntimeInferenceMetadata
from PIL import Image
from torch import argmax, inference_mode, softmax, randn
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from torchvision.transforms import functional as F

weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights).eval()

@compile(
    tag="@yusuf/classify-image",
    description="Classify an image with AI.",
    sandbox=Sandbox().pip_install("torch", "torchvision"),
    metadata=[
        OnnxRuntimeInferenceMetadata(
            model=model,
            model_args=[randn(1,3,224,224)]
        )
    ]
)
@inference_mode()
def classify_image(image: Image.Image) -> tuple[str, float]:
    """Classify an image."""
    # Preprocess
    image = image.convert("RGB")
    image = F.resize(image, 224)
    image = F.center_crop(image, 224)
    image_tensor = F.to_tensor(image)
    normalized_tensor = F.normalize(
        image_tensor,
        mean=[0.485, 0.456, 0.406],
        std=[0.229, 0.224, 0.225]
    )
    # Run model
    logits = model(normalized_tensor[None])
    scores = softmax(logits, dim=1)
    idx = argmax(scores, dim=1)
    score = scores[0,idx].item()
    label = weights.meta["categories"][idx]
    # Return
    return label, score

Compiling the Function

Now, compile the function using the Muna CLI:
Terminal
# Compile the AI function
$ muna compile --overwrite ai.py

Inference Backends

Muna supports a fixed set of backends for running AI inference. You must opt in to using an inference backend for a specific model by providing inference metadata. The provided metadata will allow the Muna compiler to lower the inference operation to native code.

Supported Inference Backends

Below are supported inference backends:
We support compiling Llama instances for inference with llama.cpp:
llm.py
from muna import compile
from llama_cpp import Llama

llm = Llama("smollm2_135m.gguf")

@compile(...)
def llm_chat(messages: list[dict]) -> Iterator[str]:
    for token in llm.create_chat_completion(messages=messages, stream=True):
        yield token
In order to compile a Llama.cpp prediction function like the above, special care must be taken to create the compilation sandbox:

Installing a C++ Compiler

The llama-cpp-python Python package builds llama.cpp from source, so the compiler sandbox must contain a C++ compiler toolchain:
llm.py
llm = Llama("smollm2_135m.gguf")

@compile(
    ...,
    # `llama-cpp-python` builds Llama.cpp from source
    sandbox=Sandbox()
        .apt_install("clang")
)
def llm_chat(messages: list[dict]) -> Iterator[str]:
    for token in llm.create_chat_completion(messages=messages, stream=True):
        yield token

Installing Llama Cpp Python

Next, install the llama-cpp-python Python package:
llm.py
llm = Llama("smollm2_135m.gguf")

@compile(
    ...,
    # Install `llama-cpp-python`
    sandbox=Sandbox()
        .apt_install("clang")
        .pip_install("llama-cpp-python")
)
def llm_chat(messages: list[dict]) -> Iterator[str]:
    for token in llm.create_chat_completion(messages=messages, stream=True):
        yield token

Uploading the GGUF Model

Finally, upload the GGUF model so that it is available in the compiler sandbox:
llm.py
llm = Llama("smollm2_135m.gguf")

@compile(
    ...,
    # Upload the GGUF model
    sandbox=Sandbox()
        .apt_install("clang")
        .pip_install("llama-cpp-python")
        .upload_file(llm.model_path)
)
def llm_chat(messages: list[dict]) -> Iterator[str]:
    for token in llm.create_chat_completion(messages=messages, stream=True):
        yield token
Use the OnnxRuntimeInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with ONNXRuntime:
ai.py
from muna.beta import OnnxRuntimeInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use ONNXRuntime for model inference
        OnnxRuntimeInferenceMetadata(
            model=model,
            model_args=example_args
        )
    ]
)
def predict() -> None:
    pass
Use the OnnxRuntimeInferenceSessionMetadata metadata type to compile an OnnxRuntime InferenceSession:
ai.py
from muna.beta import OnnxRuntimeInferenceSessionMetadata
from onnxruntime import InferenceSession

# Given an ONNXRuntime inference session...
onnx_path = "/path/to/model.onnx"
session = InferenceSession(onnx_path)

@compile(
    ...,
    metadata=[
        # Use ONNXRuntime for model inference
        OnnxRuntimeInferenceSessionMetadata(
            session=session,
            model_path=onnx_path
        )
    ]
)
def predict() -> None:
    pass
The model must exist at the provided model_path within the compiler sandbox.
Use the TensorRTInferenceMetadata metadata type to compile a PyTorch nn.Module to TensorRT:
ai.py
from muna.beta import TensorRTInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use TensorRT for model inference
        TensorRTInferenceMetadata(
            model=model,
            model_args=example_args,
            cuda_arch="sm_100",
            precision="int4"
        )
    ]
)
def predict() -> None:
    pass
The TensorRT inference backend is only available on Linux and Windows devices with compatible Nvidia GPUs.
We are working on adding support for consumer RTX GPUs with TensorRT for RTX.

Target CUDA Architectures

TensorRT engines must be compiled for specific target CUDA architectures. Below are CUDA architectures that our compiler supports:
CUDA ArchitectureGPU Family
sm_80Ampere (e.g. A100)
sm_86Ampere
sm_87Ampere
sm_89Ada Lovelace (e.g. L40S)
sm_90Hopper (e.g. H100)
sm_100Blackwell (e.g. B200)

TensorRT Inference Precision

TensorRT allows for specifying the inference engine’s precision. Below are supported precision modes:
PrecisionNotes
fp3232-bit single precision inference.
fp1616-bit half precision inference.
int88-bit quantized integer inference.
Use the CoreMLInferenceMetadata metadata type to compile a PyTorch nn.Module to CoreML:
ai.py
from muna.beta import CoreMLInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use CoreML for model inference
        CoreMLInferenceMetadata(
            model=model,
            model_args=example_args
        )
    ]
)
def predict() -> None:
    pass
The CoreML inference backend is only available on iOS, macOS, and visionOS devices.
Use the ExecuTorchInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with ExecuTorch:
ai.py
from muna.beta import ExecuTorchInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use ExecuTorch for model inference
        ExecuTorchInferenceMetadata(
            model=model,
            model_args=example_args,
            backend="xnnpack"
        )
    ]
)
def predict() -> None:
    pass
The ExecuTorch inference backend is only available on Android.

ExecuTorch Hardware Backends

ExecuTorch supports several hardware backends to accelerate model inference. Below are targets that are currently supported by Muna:
BackendNotes
xnnpackXNNPACK CPU backend. Always enabled.
vulkanVulkan GPU backend. Only supported on Android.
Use the LiteRTInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with LiteRT:
ai.py
from muna.beta import LiteRTInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use LiteRT for model inference
        LiteRTInferenceMetadata(
            model=model,
            model_args=example_args
        )
    ]
)
def predict() -> None:
    pass
The LiteRT inference backend is only available on Android devices.
Use the QnnInferenceMetadata metadata type to compile a PyTorch nn.Module to a Qualcomm QNN context binary:
ai.py
from muna.beta import QnnInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use QNN for model inference
        QnnInferenceMetadata(
            model=model,
            model_args=example_args,
            backend="gpu",
            quantization=None
        )
    ]
)
def predict() -> None:
    pass
The QNN inference backend is only available on Android and Windows devices with Qualcomm processors.

QNN Hardware Backends

QNN requires that a hardware device backend is specified ahead of time. Below are supported backends:
BackendNotes
cpuReference aarch64 CPU backend.
gpuAdreno GPU backend, accelerated by OpenCL.
htpHexagon NPU backend.
Learn more about QNN hardware backends.

QNN Model Quantization

When using the htp backend, you must specify a model quantization mode as the Hexagon NPU only supports running integer-quantized models. Below are supported quantization modes:
QuantizationNotes
w8a8Weights and activations are quantized to uint8.
w8a16Weights are quantized to uint8 while activations are quantized to uint16.
w4a8Weights are quantized to uint4 while activations are quantized to uint8.
w4a16Weights are quantized to uint4 while activations are quantized to uint16.
Use the OpenVINOInferenceMetadata metadata type to compile a PyTorch nn.Module to OpenVINO IR:
ai.py
from muna.beta import OpenVINOInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use OpenVINO for model inference
        OpenVINOInferenceMetadata(
            model=model,
            model_args=example_args
        )
    ]
)
def predict() -> None:
    pass
At runtime, the OpenVINO IR will be used for inference with the OpenVINO toolkit.
The OpenVINO inference backend is only available on Linux and Windows x86_64 devices with Intel processors.
Use the muna.beta.IREEInferenceMetadata metadata type to compile a PyTorch nn.Module for inference with IREE:
ai.py
from muna.beta import IREEInferenceMetadata
from torch import randn, Tensor
from torch.nn import Module

# Given a PyTorch model...
model: Module = ...
# With some example arguments...
example_args: list[Tensor] = [randn(1, 3, 224, 224)]

@compile(
    ...,
    metadata=[
        # Use IREE for model inference
        IREEInferenceMetadata(
            model=model,
            model_args=example_args,
            backend="vulkan"
        )
    ]
)
def predict() -> None:
    pass
The IREE inference backend is only available on Android devices.

IREE HAL Target Backends

IREE supports several HAL target backends that the model can be compiled against. Below are targets that are currently supported by Muna:
TargetNotes
vulkanVulkan GPU backend. Only supported on Android.
Coming soon 🤫.
A single model can be lowered to use multiple inference backends. Simply provide multiple metadata instances that refer to the model.

Request an Inference Backend

We are always looking to add support for new inference backends. So if there is an inference backend you would like to see supported in Muna, please reach out to us.