Skip to main content
The OpenAI client is widely used by developers who consume AI inference in their applications. This guide explains how to compile models that can be used via Muna’s OpenAI-compatible client by leveraging parameter annotations.
Muna’s OpenAI-compatible client allows developers to use millions of open-source AI models without changing their existing code.

Compiling Chat Completion Models

You can compile chat completion models compatible with Muna’s openai.chat.completions.create interface.
1

Accepting Chat Messages

Chat completion functions should accept a list of input messages with type list[muna.beta.openai.Message]:
llm.py
from muna import compile, Parameter
from muna.beta.openai import Message
from typing import Annotated

@compile(...)
def create_chat_completion(
    messages: Annotated[
        list[Message],
        Parameter.Generic(description="Messages comprising the conversation so far.")
    ]
):
    ...
2

Returning Chat Completion Chunks

Chat completion functions must return an iterator of completion chunks, with type Iterator[muna.beta.openai.ChatCompletionChunk]:
llm.py
from muna import compile, Parameter
from muna.beta.openai import ChatCompletionChunk, Message
from typing import Iterator

@compile(...)
def create_chat_completion(
    messages: Annotated[
        list[Message],
        Parameter.Generic(description="Messages comprising the conversation so far.")
    ]
) -> Iterator[ChatCompletionChunk]:
    ...
3

Creating Chat Completions

We recommend using the llama-cpp-python package to create chat completions using Llama.cpp:
llm.py
from muna import compile, Parameter
from muna.beta.openai import ChatCompletionChunk, Message
from llama_cpp import Llama
from typing import Iterator

model = Llama(model_path=model_path)

@compile(...)
def create_chat_completion(
    messages: Annotated[
        list[Message],
        Parameter.Generic(description="Messages comprising the conversation so far.")
    ]
) -> Iterator[ChatCompletionChunk]:
    stream = model.create_chat_completion(
        messages=messages,
        max_tokens=1_000,
        stream=True
    )
    for chunk in stream:
        yield chunk

Compiling Embedding Models

You can compile text embedding models compatible with Muna’s openai.embeddings.create interface.
1

Accepting Input Texts

Embedding functions should accept a list of input texts to embed, as a list[str]:
embed_text.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def embed_text(
    texts: Annotated[
        list[str],
        Parameter.Generic(description="Input texts to embed.")
    ]
) -> ndarray:
    ...
2

Returning the Embeddings

Embedding functions must return an embedding matrix as a Numpy ndarray. The array must have a Parameter.Embedding annotation:
embed_text.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def embed_text(
    texts: Annotated[
        list[str],
        Parameter.Generic(description="Input texts to embed.")
    ]
) -> Annotated[
    ndarray,
    Parameter.Embedding(description="Embedding matrix.")
]:
    ...
The returned ndarray must have a float32 data type.
The returned ndarray must be a 2D array with shape (N,D), where N is the number of input texts and D is the embedding dimension.
3

(Optional) Supporting Matryoshka Embeddings

Some embedding models allow for specifying the number of embedding dimensions, based on Matryoshka representation learning. To expose this setting, add an int parameter with the Parameter.EmbeddingDims annotation:
embed_text.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def embed_text(
    texts: Annotated[
        list[str],
        Parameter.Generic(description="Input texts to embed.")
    ],
    dimensions: Annotated[
        int,
        Parameter.EmbeddingDims(
            description="The number of dimensions the embeddings should have.",
            min=256,
            max=768
        )
    ] = 768
) -> Annotated[
    ndarray,
    Parameter.Embedding(description="Embedding matrix.")
]:
    ...
To remain compatible with the OpenAI embeddings interface, the function must have only one required parameter. As a result, make sure to specify a default value for all other parameters.

Compiling Speech Models

You can compile text-to-speech models compatible with Muna’s openai.audio.speech.create interface.
1

Accepting Input Text

Text-to-speech functions should accept an input text str:
generate_speech.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def generate_speech(
    text: Annotated[
        str,
        Parameter.Generic(description="Input text.")
    ]
) -> ndarray:
    ...
2

Accepting a Generation Voice

Text-to-speech functions must also accept a generation voice argument. We recommend using a Literal or StrEnum type. Regardless of the type you choose, the parameter must have a Parameter.AudioVoice annotation:
generate_speech.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated, Literal

@compile(...)
def generate_speech(
    text: Annotated[
        str,
        Parameter.Generic(description="Input text.")
    ],
    voice: Annotated[
        Literal["voice_a", "voice_b"],
        Parameter.AudioVoice(description="Voice to use in generating audio.")
    ]
) -> ndarray:
    ...
The generation voice must be a required parameter, because developers are required to specify the voice in the OpenAI interface.
3

Returning the Generated Audio

Speech generation functions must return the generated audio as a Numpy ndarray containing linear PCM samples. The array must have a Parameter.Audio annotation:
generate_speech.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated, Literal

@compile(...)
def generate_speech(
    text: Annotated[
        str,
        Parameter.Generic(description="Input text.")
    ],
    voice: Annotated[
        Literal["voice_a", "voice_b"],
        Parameter.AudioVoice(description="Voice to use in generating audio.")
    ]
) -> Annotated[
    ndarray,
    Parameter.Audio(description="Generated speech.", sample_rate=24_000)
]:
    ...
The returned ndarray must have a float32 data type.
The returned ndarray must either be a 1D array with shape (F,) for single channel audio; or a 2D array with shape (F,C) where C is the channel count (interleaved).
4

(Optional) Supporting Audio Speed

Some text-to-speech functions support configuring the speed of the generated audio. To expose this setting, add a float parameter with a Parameter.AudioSpeed annotation:
generate_speech.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated, Literal

@compile(...)
def generate_speech(
    text: Annotated[str, Parameter.Generic(description="Input text.")],
    voice: Annotated[
        Literal["voice_a", "voice_b"],
        Parameter.AudioVoice(description="Voice to use in generating audio.")
    ],
    speed: Annotated[
        float,
        Parameter.AudioSpeed(
            description="The speed of the generated audio.",
            min=0.25,
            max=4.0
        )
    ] = 1.0
) -> Annotated[
    ndarray,
    Parameter.Audio(description="Generated speech.", sample_rate=24_000)
]:
    ...
The audio speed parameter must have a default value, because it is an optional setting in the OpenAI interface.

Compiling Transcription Models

You can compile speech-to-text models compatible with Muna’s openai.audio.transcriptions.create interface.
1

Accepting Input Audio

Transcription functions should accept input audio as a Numpy ndarray annotated with the Parameter.Audio annotation:
moonshine_base.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def moonshine_base(
    audio: Annotated[
        ndarray,
        Parameter.Audio(
            description="Audio to transcribe with shape (F,C).",
            sample_rate=24_000
        )
    ]
) -> str:
    ...
When a user runs your compiled model with an audio file (mp3, wav, etc), the Muna client will decode it and resample it to your required sample_rate.
2

Returning the Transcribed Text

Transcription functions should return the transcribed text as a string:
moonshine_base.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def moonshine_base(
    audio: Annotated[
        ndarray,
        Parameter.Audio(
            description="Audio to transcribe with shape (F,C).",
            sample_rate=24_000
        )
    ]
) -> Annotated[
    str,
    Parameter.Generic(description="Transcribed text.")
]:
    ...