How it Works

Imagine if you could run Python code everywhere. Python is an incredibly simple and mature language. But because it requires an interpreter, it either cannot run natively cross-platform; or it incurs a significant performance cost compared to languages that are “closer to hardware”.

We are on a mission to build a world where to get the performance benefits thereof.

Muna works by lowering your Python code to native code. The benefit is that developers can think and write code in a high-level language, but still get the raw performance of a low-level language.

Symbolic Tracing

Our compiler begins by building an intermediate representation (IR) of your Python function using a combination of static analysis and symbolic tracing. We use PEP 523 to hook into a sandboxed Python interpreter before executing your function. We can then build a trace of every operation that happened within your function.

PEP 523 also forms the foundation of torch.compile in PyTorch 2.0. In fact, this is what spurred initial development of Muna. Read the paper.

For example, consider the following function which classifies an image:

classifier.py

from muna import compile, Sandbox
from PIL import Image
from torchvision.models import mobilenet_v2, MobileNet_V2_Weights
from typing import Tuple

# Create MobileNet model
weights = MobileNet_V2_Weights.DEFAULT
model = mobilenet_v2(weights=weights).eval()
preprocess = weights.transforms()
labels = weights.meta["categories"]

# Define predictor
@compile(
    tag="@vision-co/image-classifier",
    description="Classify an image using the MobileNet v2 model.",
    sandbox=Sandbox().pip_install("torch", "torchvision")
)
def predict(image: Image.Image) -> Tuple[str, float]:
    batch = preprocess(image)[None]
    logits = model(batch).squeeze(0).softmax(0)
    class_id = logits.argmax().item()
    score = logits[class_id].item()
    label = labels[class_id]
    return label, score

The symbolic tracer generates a graph that looks like the following:

type           name                 target                                         args
-------------  -------------------  ---------------------------------------------  -------------------------------------------------------------------
input          image                image                                          ()
call_function  resize               <function resize at 0x11d3ae950>               (image, [232])
call_function  center_crop          <function center_crop at 0x11d3aeb90>          (resize, [224])
call_function  pil_to_tensor        <function pil_to_tensor at 0x11d3ae5f0>        (center_crop,)
call_function  convert_image_dtype  <function convert_image_dtype at 0x11d3ae680>  (pil_to_tensor, torch.float32)
call_function  normalize            <function normalize at 0x11d3ae7a0>            (convert_image_dtype,)
call_function  getitem              <built-in function getitem>                    (normalize, None)
call_module    model                model                                          (getitem,)
call_method    squeeze              squeeze                                        (model, 0)
call_method    softmax              softmax                                        (squeeze, 0)
call_method    argmax               argmax                                         (softmax,)
call_method    item                 item                                           (argmax,)
call_function  getitem_1            <built-in function getitem>                    (softmax, item)
call_method    item_1               item                                           (getitem_1,)
call_function  list_index           <function list_index at 0x11d6d5c60>           (['tench', 'goldfish', 'plow', ..., 'toilet tissue'], item)  {}
output         output               output                                         ((list_index, item_1),)

We inject more metadata with static analysis to build a full IR for your Python code. After this, we then lower the IR to native code.

Lowering to Native Code

We lower an IR graph to several different platform-specific implementations in native code. We do so by walking through the graph’s nodes and finding one or more operators that implement the node. Take the resize node in the graph above:

Type	Name	Target	Args
`call_function`	`resize`	`<function resize at 0x11d3ae950>`	`(image, [232])`

We search through our library of native operators that perform a resize operation on an image. Here’s an example targeting Apple Silicon with Accelerate.framework:

resize.mm

@import Accelerate;

muna::image ResizeImageAppleSilicon(
    muna::image image,
    NSArray<NSNumber*> size
) {
    ...
    vImageScale_ARGB8888(...);
    ...
}

This approach gives us two main benefits:

Bare Metal Hardware Access

Our compiler infrastructure is hardware-aware and allows us to lower Python operations to code as low-level as Assembly and PTX.This also allows us to work with hardware vendors to hyper-optimize individual operations for their hardware.

Performance Optimization via Exhaustive Search

Because each IR node can map to several different low-level operations, we simply generate all possible implementations of a prediction function, ship them out to different users, and gather telemetry data to discover the implementation with the best performance for each unique device.As a result of this, we are able to run orders of magnitude more performance experiments than what is possible with manual performance tuning.

If you are a hardware vendor interested in enabling developers to leverage your custom accelerators, reach out to us! You’ll bring the hardware; we’ll bring the software.

Compiling the Native Binaries

Finally, we compile the lowered native code for each of our supported targets. When an application uses the muna.predictions.create method to create a prediction, our client SDK will download a compiled binary, load it into the process, and invoke it. You can inspect the native source code generated by Muna using the Muna CLI. Take an example Python function:

area.py

from muna import compile

@compile(
    tag="@yusuf/area",
    description="Compute the area of a circle given its radius."
)
def area(radius: float) -> float:
    return pi * radius ** 2

We can compile this function using the Muna CLI:

# Compile the function
$ muna compile --overwrite area.py

Inspecting the Generated Code

Once compiled, use the Muna CLI to inspect the generated native source code:

# Get the generated native code for the current device
$ muna source --predictor @yusuf/area

Muna can generate hundreds of implementations for a given compiled function. As such, prefer the --prediction <prediction id> option instead of --predictor <tag>.

The result is a reference source file including the relevant native methods:

float area__area(float radius) {
    int32_t const_2 = 2;
    float temp1 = _operator_pow(radius, const_2);
    float temp2 = _operator_mul(area__pi, temp1);
    return temp2;
}
...

Muna currently generates native code in C++, supported by Rust.

If you would like to compile Python functons to a custom platform, reach out to us.

Get Started

Making Predictions

Creating Predictors

Insiders

Symbolic Tracing

Lowering to Native Code

Compiling the Native Binaries

Inspecting the Generated Code

Get Started

Making Predictions

Creating Predictors

Insiders

​Symbolic Tracing

​Lowering to Native Code

​Compiling the Native Binaries

​Inspecting the Generated Code

Symbolic Tracing

Lowering to Native Code

Compiling the Native Binaries

Inspecting the Generated Code