Skip to main content
Muna’s signature feature is allowing developers to choose where inference runs, per-request.

Running with Adaptive Placement

Muna can adaptively search for the best hardware to run models, depending on your cost, latency, and throughput requirements. Use the muna.predictions.create method, and specify your constraints in natural language:
// πŸ”₯ Run inference with the lowest latency
const prediction = await muna.predictions.create({
  tag: "@openai/gpt-oss-120b",
  inputs: { messages },
  acceleration: "lowest latency"
});
This feature is in early alpha, and is only offered to specific teams. Request access on our Slack.

Specifying Placement Constraints

We strongly recommend anchoring your placement constraints around these three canonical intents:
IntentExamples
Costcheapest, lowest cost in the cloud, under $0.02
Latencyfastest, minimize latency, lowest latency that runs locally
Throughputhighest throughput, at least 100 requests per second
You can also specify constraints with combinations of cost, latency, and throughput intents e.g. lowest cost under 200ms.

Running on Datacenter GPUs

Use the muna.beta.predictions.remote.create method to run inference on a datacenter GPU:
// πŸ”₯ Run inference with an Nvidia B200 GPU
const prediction = await muna.beta.predictions.remote.create({
  tag: "@bytedance/depth-anything-3",
  inputs: { image },
  acceleration: "remote_b200"
});

Supported Datacenter GPUs

Below are the currently supported cloud GPUs:
AccelerationNotes
remote_autoRun inference on the ideal datacenter hardware.
remote_cpuRun inference on AMD CPU servers.
remote_a10Run inference on an Nvidia A10 GPU.
remote_a100Run inference on an Nvidia A100 GPU.
remote_h100Run inference on an Nvidia H100 GPU.
remote_b200Run inference on an Nvidia B200 GPU.
remote_mi350xRun inference on an AMD MI350X GPU. Coming soon.
remote_mi355xRun inference on an AMD MI355X GPU. Coming soon.
remote_qaic100Run inference on a Qualcomm Cloud AI 100 accelerator. Coming soon.
If you want to self-host the GPU servers in your VPC or on-prem, reach out to us.

Running Locally

Use the muna.predictions.create method to run inference locally:
// πŸ”₯ Run inference with the local NPU
const prediction = await muna.predictions.create({
  tag: "@bytedance/depth-anything-3",
  inputs: { image },
  acceleration: "local_npu"
});

Supported Local Processors

Below are the currently supported local processors:
AccelerationNotes
local_cpuUse the CPU to accelerate predictions. This is always enabled.
local_gpuUse the GPU to accelerate predictions.
local_npuUse the neural processor to accelerate predictions.
Muna currently does not support multi-GPU acceleration. This is planned for the future.

Specifying the Local GPU

Some Muna clients allow you to specify the acceleration device used to make predictions. Our clients expose this field as an untyped integer or pointer. The underlying type depends on the current operating system:
OSDevice typeNotes
Android-Currently unsupported.
iOSid<MTLDevice>Metal device.
Linuxint*Pointer to CUDA device ID.
macOSid<MTLDevice>Metal device.
visionOSid<MTLDevice>Metal device.
WebGPUDeviceWebGPU device.
WindowsID3D12Device*DirectX 12 device.
The prediction device is merely a hint. Setting a device does not guarantee that all or any operation in the prediction function will actually use that acceleration device.
You should absolutely (absolutely) never ever do this unless you know what the hell you’re doing.