Muna’s raison d’être 🗽
torchvision
:
@compile
decorator to the function to prepare it for compilation:
Sandbox
:
Llama.cpp
Llama
instances for inference
with llama.cpp
:llama-cpp-python
Python package builds llama.cpp
from source, so the compiler sandbox
must contain a C++ compiler toolchain:llama-cpp-python
Python package:OnnxRuntime
OnnxRuntimeInferenceMetadata
metadata type to compile a PyTorch nn.Module
for inference with ONNXRuntime:OnnxRuntime (Inference Session)
OnnxRuntimeInferenceSessionMetadata
metadata type to compile an OnnxRuntime InferenceSession
:model_path
in the compiler sandbox.TensorRT
TensorRTInferenceMetadata
metadata type to compile a PyTorch nn.Module
to TensorRT:CUDA Architecture | GPU Family |
---|---|
sm_80 | Ampere (e.g. A100) |
sm_86 | Ampere |
sm_87 | Ampere |
sm_89 | Ada Lovelace (e.g. L40S) |
sm_90 | Hopper (e.g. H100) |
sm_100 | Blackwell (e.g. B200) |
Precision | Notes |
---|---|
fp32 | 32-bit single precision inference. |
fp16 | 16-bit half precision inference. |
int8 | 8-bit quantized integer inference. |
CoreML
ExecuTorch
ExecuTorchInferenceMetadata
metadata type to compile a PyTorch nn.Module
for inference with ExecuTorch:Backend | Notes |
---|---|
xnnpack | XNNPACK CPU backend. Always enabled. |
vulkan | Vulkan GPU backend. Only supported on Android. |
LiteRT (TensorFlow Lite)
QNN
QnnInferenceMetadata
metadata type to compile a PyTorch nn.Module
to a Qualcomm QNN context binary:backend
is specified ahead of time. Below are supported backends:Backend | Notes |
---|---|
cpu | Reference aarch64 CPU backend. |
gpu | Adreno GPU backend, accelerated by OpenCL. |
htp | Hexagon NPU backend. |
htp
backend, you must specify a model quantization
mode as the Hexagon NPU only supports
running integer-quantized models. Below are supported quantization modes:Quantization | Notes |
---|---|
w8a8 | Weights and activations are quantized to uint8 . |
w8a16 | Weights are quantized to uint8 while activations are quantized to uint16 . |
w4a8 | Weights are quantized to uint4 while activations are quantized to uint8 . |
w4a16 | Weights are quantized to uint4 while activations are quantized to uint16 . |
OpenVINO
OpenVINOInferenceMetadata
metadata type to compile a PyTorch nn.Module
to OpenVINO IR:x86_64
devices with Intel processors.IREE
muna.beta.IREEInferenceMetadata
metadata type to compile a PyTorch nn.Module
for inference with IREE:model
can be compiled against. Below are targets that are currently supported by Muna:Target | Notes |
---|---|
vulkan | Vulkan GPU backend. Only supported on Android. |
MIGraphX
metadata
instances
that refer to the model.