Skip to main content
OpenAI’s client is widely used by developers who consume AI inference in their applications. This guide explains how to create predictors that can be used via Muna’s mock OpenAI client by leveraging parameter annotations.
Muna’s mock OpenAI client allows developers to use millions of open-source AI models without changing their existing code.

Creating Chat Completion Predictors

You can create chat completion predictors compatible with Muna’s openai.chat.completions.create interface.
1

Accepting Chat Messages

Chat completion predictors should accept a list of input messages with type list[muna.beta.openai.Message]:
llm.py
from muna import compile, Parameter
from muna.beta.openai import Message
from typing import Annotated

@compile(...)
def create_chat_completion(
    messages: Annotated[
        list[Message],
        Parameter.Generic(description="Messages comprising the conversation so far.")
    ]
):
    ...
2

Returning Chat Completion Chunks

Chat completion predictors must return an iterator of completion chunks, with type Iterator[muna.beta.openai.ChatCompletionChunk]:
llm.py
from muna import compile, Parameter
from muna.beta.openai import ChatCompletionChunk, Message
from typing import Iterator

@compile(...)
def create_chat_completion(
    messages: Annotated[
        list[Message],
        Parameter.Generic(description="Messages comprising the conversation so far.")
    ]
) -> Iterator[ChatCompletionChunk]:
    ...
3

Creating Chat Completions

We recommend using the llama-cpp-python package to create chat completions using Llama.cpp:
llm.py
from muna import compile, Parameter
from muna.beta.openai import ChatCompletionChunk, Message
from llama_cpp import Llama
from typing import Iterator

model = Llama(model_path=model_path)

@compile(...)
def create_chat_completion(
    messages: Annotated[
        list[Message],
        Parameter.Generic(description="Messages comprising the conversation so far.")
    ]
) -> Iterator[ChatCompletionChunk]:
    stream = model.create_chat_completion(
        messages=messages,
        max_tokens=1_000,
        stream=True
    )
    for chunk in stream:
        yield chunk

Creating Embedding Predictors

You can create text embedding predictors compatible with Muna’s openai.embeddings.create interface.
1

Accepting Input Texts

Embedding predictors should accept a list of input texts to embed, as a list[str]:
embed_text.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def embed_text(
    texts: Annotated[
        list[str],
        Parameter.Generic(description="Input texts to embed.")
    ]
) -> ndarray:
    ...
2

Returning the Embeddings

Embedding predictors must return an embedding metrix as a Numpy ndarray. The array must have a Parameter.Embedding annotation:
embed_text.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def embed_text(
    texts: Annotated[
        list[str],
        Parameter.Generic(description="Input texts to embed.")
    ]
) -> Annotated[
    ndarray,
    Parameter.Embedding(description="Embedding matrix.")
]:
    ...
The returned array must have a float32 data type.
The returned array must either be a 2D array with shape (N,D), where N is the number of input texts and D is the embedding dimension.
3

(Optional) Supporting Matryoshka Embeddings

Some embedding models allow for specifying the number of embedding dimensions, based on Matryoshka representation learning. To expose this setting, add an int parameter with the Parameter.EmbeddingDims annotation:
embed_text.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def embed_text(
    texts: Annotated[
        list[str],
        Parameter.Generic(description="Input texts to embed.")
    ],
    dimensions: Annotated[
        int,
        Parameter.EmbeddingDims(
            description="The number of dimensions the embeddings should have.",
            min=256,
            max=768
        )
    ] = 768
) -> Annotated[
    ndarray,
    Parameter.Embedding(description="Embedding matrix.")
]:
    ...
To remain compatible with the OpenAI embeddings interface, the predictor must have only one required parameter. As a result, make sure to specify a default value for the dimensions parameter.

Creating Speech Predictors

You can create speech generation predictors compatible with Muna’s openai.audio.speech.create interface.
1

Accepting Input Text

Speech generation predictors should accept an input text str:
generate_speech.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated

@compile(...)
def generate_speech(
    text: Annotated[str, Parameter.Generic(description="Input text.")]
) -> ndarray:
    ...
2

Accepting a Generation Voice

Speech generation predictors must also accept a generation voice argument. We recommend using a Literal or StrEnum type. Regardless of the type you choose, the parameter must have a Parameter.AudioVoice annotation:
generate_speech.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated, Literal

@compile(...)
def generate_speech(
    text: Annotated[str, Parameter.Generic(description="Input text.")],
    voice: Annotated[
        Literal["voice_a", "voice_b"],
        Parameter.AudioVoice(description="Voice to use in generating audio.")
    ]
) -> ndarray:
    ...
The generation voice must be a required parameter, because developers are required to specify the voice in the OpenAI interface.
3

Returning the Generated Audio

Speech generation predictors must return the generated audio as a Numpy ndarray containing linear PCM samples. The array must have a Parameter.Audio annotation:
generate_speech.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated, Literal

@compile(...)
def generate_speech(
    text: Annotated[str, Parameter.Generic(description="Input text.")],
    voice: Annotated[
        Literal["voice_a", "voice_b"],
        Parameter.AudioVoice(description="Voice to use in generating audio.")
    ]
) -> Annotated[
    ndarray,
    Parameter.Audio(description="Generated speech.", sample_rate=24_000)
]:
    ...
The returned array must have a float32 data type.
The returned array must either be a 1D array with shape (F,) for single channel audio; or a 2D array with shape (C,F) where C is the channel count.
4

(Optional) Supporting Audio Speed

Some speech generation predictors support configuring the speed of the generated audio. To expose this setting, add a float parameter with a Parameter.AudioSpeed annotation:
generate_speech.py
from muna import compile, Parameter
from numpy import ndarray
from typing import Annotated, Literal

@compile(...)
def generate_speech(
    text: Annotated[str, Parameter.Generic(description="Input text.")],
    voice: Annotated[
        Literal["voice_a", "voice_b"],
        Parameter.AudioVoice(description="Voice to use in generating audio.")
    ],
    speed: Annotated[
        float,
        Parameter.AudioSpeed(
            description="The speed of the generated audio.",
            min=0.25,
            max=4.0
        )
    ] = 1.0
) -> Annotated[
    ndarray,
    Parameter.Audio(description="Generated speech.", sample_rate=24_000)
]:
    ...
The audio speed parameter must have a default value, because it is an optional setting in the OpenAI interface.