Imagine if you could run Python code everywhere. Python is an incredibly simple and mature language. But
because it requires an interpreter, it either cannot run natively cross-platform; or it incurs a significant
performance cost compared to languages that are “closer to hardware”.
We are on a mission to build a world where to get the performance benefits thereof.
Muna works by lowering your Python code to native code. The benefit is that developers
can think and write code in a high-level language, but still get the raw performance of a low-level language.
Our compiler begins by building an intermediate representation (IR) of your Python function using a combination
of static analysis and symbolic tracing. We use PEP 523 to hook into a sandboxed Python
interpreter before executing your function. We can then build a trace of every operation that happened within your
function.
PEP 523 also forms the foundation of torch.compile in PyTorch 2.0.
In fact, this is what spurred initial development of Muna. Read the paper.
For example, consider the following function which classifies an image:
We lower an IR graph to several different platform-specific implementations in native code. We do so by walking through the
graph’s nodes and finding one or more operators that implement the node. Take the resize node in the
graph above:
Type
Name
Target
Args
call_function
resize
<function resize at 0x11d3ae950>
(image, [232])
We search through our library of native operators that perform a resize operation on an image.
Here’s an example targeting Apple Silicon with Accelerate.framework:
Our compiler infrastructure is hardware-aware and allows us to lower Python operations to code as low-level as Assembly and PTX.This also allows us to work with hardware vendors to hyper-optimize individual operations for their hardware.
Performance Optimization via Exhaustive Search
Because each IR node can map to several different low-level operations, we simply generate all possible
implementations of a prediction function, ship them out to different users, and gather telemetry data to
discover the implementation with the best performance for each unique device.As a result of this, we are able to run orders of magnitude more performance experiments than what is possible
with manual performance tuning.
If you are a hardware vendor interested in enabling developers to leverage your custom accelerators, reach out to us!
You’ll bring the hardware; we’ll bring the software.
Finally, we compile the lowered native code for each of our supported targets.
When an application uses the muna.predictions.create method to create a prediction, our client SDK will download
a compiled binary, load it into the process, and invoke it.You can inspect the native source code generated by Muna using the Muna CLI.
Take an example Python function:
area.py
Copy
from muna import compile@compile( tag="@yusuf/area", description="Compute the area of a circle given its radius.")def area(radius: float) -> float: return pi * radius ** 2
We can compile this function using the Muna CLI:
Copy
# Compile the function$ muna compile --overwrite area.py
Once compiled, use the Muna CLI to inspect the generated native source code:
Copy
# Get the generated native code for the current device$ muna source --predictor @yusuf/area
Muna can generate hundreds of implementations for a given compiled function. As such, prefer
the --prediction <prediction id> option instead of --predictor <tag>.
The result is a reference source file including the relevant native methods: