> ## Documentation Index
> Fetch the complete documentation index at: https://docs.muna.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Choosing Inference Placement

> Choosing where each and every inference runs.

Muna's signature feature is allowing developers to choose where inference runs, per-request.

## Running with Adaptive Placement

Muna can adaptively search for the best hardware to run models, depending on your cost, latency, and throughput requirements. Use the `muna.predictions.create` method, and specify your constraints in natural language:

<CodeGroup>
  ```ts JavaScript icon="js" focus={1,5} theme={null}
  // 🔥 Run inference with the lowest latency
  const prediction = await muna.predictions.create({
    tag: "@openai/gpt-oss-120b",
    inputs: { messages },
    acceleration: "lowest latency"
  });
  ```

  ```py Python icon="python" focus={1,5} theme={null}
  # 🔥 Run inference with the lowest latency
  prediction = muna.predictions.create(
    tag="@openai/gpt-oss-120b",
    inputs={ "messages": messages },
    acceleration="lowest latency"
  )
  ```

  ```ts React Native icon="react" focus={1,5} theme={null}
  // 🔥 Run inference with the lowest latency
  const prediction = await muna.predictions.create({
    tag: "@openai/gpt-oss-120b",
    inputs: { messages },
    acceleration: "lowest latency"
  });
  ```

  ```swift iOS icon="swift" focus={1,5} theme={null}
  // 🔥 Run inference with the lowest latency
  let prediction = try await muna.predictions.create(
    tag: "@openai/gpt-oss-120b",
    inputs: ["messages": messages],
    acceleration: "lowest latency"
  )
  ```

  ```kt Android icon="android" focus={1,5} theme={null}
  // 🔥 Run inference with the lowest latency
  val prediction = muna.predictions.create(
    "@openai/gpt-oss-120b",
    mapOf("messages" to messages),
    "lowest latency"
  )
  ```

  ```csharp Unity icon="unity" focus={1,5} theme={null}
  // 🔥 Run inference with the lowest latency
  var prediction = await muna.Beta.Predictions.Remote.Create(
    tag: "@openai/gpt-oss-120b",
    inputs: new() { ["messages"] = messages },
    acceleration: "lowest latency"
  );
  ```
</CodeGroup>

<Warning>
  This feature is in early alpha, and is only offered to specific teams.
  Request access [on our Slack](https://muna.ai/slack).
</Warning>

### Specifying Placement Constraints

We strongly recommend anchoring your placement constraints around these three canonical intents:

| Intent         | Examples                                                          |
| :------------- | :---------------------------------------------------------------- |
| **Cost**       | `cheapest`, `lowest cost in the cloud`, `under $0.02`             |
| **Latency**    | `fastest`, `minimize latency`, `lowest latency that runs locally` |
| **Throughput** | `highest throughput`, `at least 100 requests per second`          |

<Tip>
  You can also specify constraints with combinations of cost, latency, and
  throughput intents e.g. `lowest cost under 200ms`.
</Tip>

## Running on Datacenter GPUs

Use the `muna.predictions.create` method, and specify a `remote_*` acceleration to run inference on a datacenter GPU:

<CodeGroup>
  ```ts JavaScript icon="js" focus={1,5} theme={null}
  // 🔥 Run inference with an Nvidia B200 GPU
  const prediction = await muna.predictions.create({
    tag: "@bytedance/depth-anything-3",
    inputs: { image },
    acceleration: "remote_b200"
  });
  ```

  ```py Python icon="python" focus={1,5} theme={null}
  # 🔥 Run inference with an Nvidia B200 GPU
  prediction = muna.predictions.create(
    tag="@bytedance/depth-anything-3",
    inputs={ "image": image },
    acceleration="remote_b200"
  )
  ```

  ```ts React Native icon="react" focus={1,5} theme={null}
  // 🔥 Run inference with an Nvidia B200 GPU
  const prediction = await muna.beta.predictions.remote.create({
    tag: "@bytedance/depth-anything-3",
    inputs: { image },
    acceleration: "remote_b200"
  });
  ```

  ```swift iOS icon="swift" focus={1,5} theme={null}
  // 🔥 Run inference with an Nvidia B200 GPU
  let prediction = try await muna.beta.predictions.remote.create(
    tag: "@fxn/greeting",
    inputs: ["image": image],
    acceleration: .remote_b200
  )
  ```

  ```kt Android icon="android" focus={1,5} theme={null}
  // 🔥 Run inference with an Nvidia B200 GPU
  val prediction = muna.beta.predictions.remote.create(
    "@bytedance/depth-anything-3",
    mapOf("image" to image),
    "remote_b200"
  )
  ```

  ```csharp Unity icon="unity" focus={1,5} theme={null}
  // 🔥 Run inference with an Nvidia B200 GPU
  var prediction = await muna.Predictions.Create(
    tag: "@fxn/greeting",
    inputs: new() { ["name"] = "Sosa" },
    acceleration: "remote_b200"
  );
  ```
</CodeGroup>

### Supported Datacenter GPUs

Below are the currently supported cloud GPUs:

| Acceleration     | Notes                                                                  |
| :--------------- | :--------------------------------------------------------------------- |
| `remote_auto`    | Run inference on the ideal datacenter hardware.                        |
| `remote_cpu`     | Run inference on AMD CPU servers.                                      |
| `remote_a10`     | Run inference on an Nvidia A10 GPU.                                    |
| `remote_a100`    | Run inference on an Nvidia A100 GPU.                                   |
| `remote_h100`    | Run inference on an Nvidia H100 GPU.                                   |
| `remote_b200`    | Run inference on an Nvidia B200 GPU.                                   |
| `remote_mi350x`  | Run inference on an AMD MI350X GPU. **Coming soon**.                   |
| `remote_mi355x`  | Run inference on an AMD MI355X GPU. **Coming soon**.                   |
| `remote_qaic100` | Run inference on a Qualcomm Cloud AI 100 accelerator. **Coming soon**. |

<Tip>
  If you want to self-host the GPU servers in your VPC or on-prem, [reach out to us](mailto:hi@muna.ai).
</Tip>

## Running Locally

Use the `muna.predictions.create` method to run inference locally:

<CodeGroup>
  ```ts JavaScript icon="js" focus={1,5} theme={null}
  // 🔥 Run inference with the local NPU
  const prediction = await muna.predictions.create({
    tag: "@bytedance/depth-anything-3",
    inputs: { image },
    acceleration: "local_npu"
  });
  ```

  ```py Python icon="python" focus={1,5} theme={null}
  # 🔥 Run inference with the local NPU
  prediction = muna.predictions.create(
    tag="@bytedance/depth-anything-3",
    inputs={ "image": image },
    acceleration="local_npu"
  )
  ```

  ```ts React Native icon="react" focus={1,5} theme={null}
  // 🔥 Run inference with the local NPU
  const prediction = await muna.predictions.create({
    tag: "@bytedance/depth-anything-3",
    inputs: { image },
    acceleration: "local_npu"
  });
  ```

  ```swift iOS icon="swift" focus={1,5} theme={null}
  // 🔥 Run inference with the Apple Neural Engine
  let prediction = try await muna.predictions.create(
    tag: "@bytedance/depth-anything-3",
    inputs: ["image": image],
    acceleration: .npu
  )
  ```

  ```kt Android icon="android" focus={1,5} theme={null}
  // 🔥 Run inference with the local NPU
  val prediction = muna.predictions.create(
    "@bytedance/depth-anything-3",
    mapOf("image" to image),
    Acceleration.NPU
  )
  ```

  ```csharp Unity icon="unity" focus={1,5} theme={null}
  // 🔥 Run inference with the local NPU
  var prediction = await muna.Predictions.Create(
    tag: "@bytedance/depth-anything-3",
    inputs: new() { ["image"] = image },
    acceleration: @"local_npu"
  );
  ```
</CodeGroup>

### Supported Local Processors

Below are the currently supported local processors:

| Acceleration | Notes                                                          |
| :----------- | :------------------------------------------------------------- |
| `local_cpu`  | Use the CPU to accelerate predictions. This is always enabled. |
| `local_gpu`  | Use the GPU to accelerate predictions.                         |
| `local_npu`  | Use the neural processor to accelerate predictions.            |

<Info>
  Muna currently does not support multi-GPU local acceleration. This is planned for the future.
</Info>

### Specifying the Local GPU

Some Muna clients allow you to specify the acceleration device used to make predictions.
Our clients expose this field as an untyped integer or pointer.
The underlying type depends on the current operating system:

| OS       | Device type     | Notes                                                                                                  |
| -------- | --------------- | ------------------------------------------------------------------------------------------------------ |
| Android  | -               | Currently unsupported.                                                                                 |
| iOS      | `id<MTLDevice>` | [Metal device](https://developer.apple.com/documentation/metal/mtldevice?language=objc).               |
| Linux    | `int*`          | Pointer to [CUDA device ID](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html). |
| macOS    | `id<MTLDevice>` | [Metal device](https://developer.apple.com/documentation/metal/mtldevice?language=objc).               |
| visionOS | `id<MTLDevice>` | [Metal device](https://developer.apple.com/documentation/metal/mtldevice?language=objc).               |
| Web      | `GPUDevice`     | [WebGPU device](https://developer.mozilla.org/en-US/docs/Web/API/GPUDevice).                           |
| Windows  | `ID3D12Device*` | [DirectX 12 device](https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nn-d3d12-id3d12device).  |

<Info>
  The prediction `device` is merely a hint. Setting a `device` does not guarantee that all
  or any operation in the prediction function will actually use that acceleration device.
</Info>

<Danger>
  **You should absolutely (absolutely) never ever do this unless you know what the hell you're doing**.
</Danger>
