ServiceBoundMessage (client to server) or ClientBoundMessage (server to client).
Messages are binary-encoded protobuf. JSON examples below are shown for readability. Download the proto file.
Connection
Client Messages
Messages sent from client to server, wrapped inServiceBoundMessage.
InitializeSessionRequest
InitializeSessionRequest
Must be the first message sent. Configures the session parameters.
| Field | Type | Description |
|---|---|---|
input_audio_line | AudioLineConfiguration | Input audio format configuration |
output_audio_line | AudioLineConfiguration | Output audio format configuration |
vad_configuration | VadConfiguration | Voice activity detection settings |
inference_configuration | InferenceConfiguration | System prompt and model behavior |
tts_configuration | TtsConfiguration | Optional TTS provider config (e.g., ElevenLabs) |
supports_playback_reporting | bool | Whether the client will send PlaybackPositionReport messages. When true, enables accurate context truncation on user interrupt. Defaults to false. |
ReconfigureSessionRequest
ReconfigureSessionRequest
Reconfigure an ongoing session. Useful for changing audio input settings or the system prompt on the fly. You can update either field or both.
| Field | Type | Description |
|---|---|---|
input_audio_line | AudioLineConfiguration | Updated input audio format configuration |
inference_configuration | InferenceConfiguration | Updated system prompt and model behavior |
UserInput
UserInput
User input data (audio or text).
| Field | Type | Description |
|---|---|---|
packet_id | uint64 | Client-defined packet identifier for tracking |
mode | InferenceTriggerMode | How to trigger inference for this input |
audio_data | AudioData | Raw PCM audio bytes (one of audio_data or text_data) |
text_data | TextData | Text input (one of audio_data or text_data) |
- Audio Input
- Text Input
UpdateToolDefinitionsRequest
UpdateToolDefinitionsRequest
Define or update available tools. Replaces all existing definitions.
Each
| Field | Type | Description |
|---|---|---|
tool_definitions | ToolDefinition[] | List of tool definitions |
ToolDefinition contains:| Field | Type | Description |
|---|---|---|
name | string | Tool identifier used by the model |
description | string | Purpose and functionality description |
parameters | object | JSON Schema for tool parameters |
ToolCallResponse
ToolCallResponse
Response to a tool call request from the server.
| Field | Type | Description |
|---|---|---|
id | string | Must match the id from ToolCallRequest |
result | string | Tool execution result (any format the model understands) |
TriggerInference
TriggerInference
Manually trigger inference processing immediately, instead of waiting for natural pauses or end-of-input signals.
Primary use case is generating an initial greeting. Model behavior may be unpredictable if used directly after a model response.
| Field | Type | Description |
|---|---|---|
extra_instructions | string | Optional extra instructions to guide the inference |
ExportChatHistoryRequest
ExportChatHistoryRequest
Request the full conversation history. The server responds with a
ChatHistory message.| Field | Type | Description |
|---|---|---|
await_pending | bool | When true, waits for all in-flight async operations (e.g. transcriptions) to complete before responding |
PlaybackPositionReport
PlaybackPositionReport
Reports how many audio bytes the client has played. Only sent when the client declares
supports_playback_reporting: true in InitializeSessionRequest.The server uses this data to accurately truncate the LLM context to exactly what the user heard when they interrupt. Without this, the server falls back to elapsed-time estimation.| Field | Type | Description |
|---|---|---|
bytes_played | uint64 | Cumulative number of audio bytes played by the client |
DirectSpeech
DirectSpeech
Instructs the service to speak the given text via TTS immediately, bypassing the LLM. Any active inference is cancelled and the audio buffer is cleared before the text is spoken.
| Field | Type | Description |
|---|---|---|
text | string | Text to speak |
include_in_history | bool | When false, the text is spoken but marked as ephemeral — the LLM won’t know it was spoken |
ConversationQuery
ConversationQuery
Runs a one-shot LLM inference over the current conversation history without modifying it. Useful for side tasks like summarization or classification. The server responds with a
ConversationQueryResult.At least one of prompt or instructions must be provided.| Field | Type | Description |
|---|---|---|
prompt | string | Replaces the system prompt for this one-shot call. If absent, uses the session’s current system prompt |
instructions | string | Appended as instructions after the conversation turns. If absent, no extra instructions are appended |
Server Messages
Messages sent from server to client, wrapped inClientBoundMessage.
ModelTextFragment
ModelTextFragment
Streamed text output as tokens arrive. Used when TTS is not configured.
| Field | Type | Description |
|---|---|---|
text | string | Text content of this fragment |
ModelAudioChunk
ModelAudioChunk
TTS audio output when a TTS provider is configured.
| Field | Type | Description |
|---|---|---|
audio | AudioData | Audio bytes matching output_audio_line config |
transcript | string | Optional text that was spoken (alignment data from TTS provider) |
ToolCallRequest
ToolCallRequest
Model requests to execute a tool.
| Field | Type | Description |
|---|---|---|
id | string | Unique identifier for this request |
name | string | Name of the tool to call |
parameters | object | Parameters matching the tool’s schema |
PlaybackClearBuffer
PlaybackClearBuffer
Notification to clear the audio playback buffer. Sent proactively when the user starts speaking, regardless of whether there is ongoing TTS playback.When received, immediately discard any buffered audio that hasn’t been played yet. This message may be sent multiple times if the user interrupts multiple times.
ResponseBegin
ResponseBegin
Notification that the model has begun its response.
ResponseEnd
ResponseEnd
Notification that the model has finished its response.
ChatHistory
ChatHistory
The full conversation history, returned in response to
ExportChatHistoryRequest.| Field | Type | Description |
|---|---|---|
messages | ChatMessage[] | Ordered list of conversation messages |
SessionErrorNotification
SessionErrorNotification
Structured error notification sent before the server closes the connection.
| Field | Type | Description |
|---|---|---|
category | SessionErrorCategory | Error category for programmatic handling |
message | string | Human-readable message for logging or display |
trace_id | string | Optional trace ID for correlating with server logs |
UserTranscriptionResult
UserTranscriptionResult
Async transcription result for a completed user audio turn. Sent after the transcription worker finishes processing.
| Field | Type | Description |
|---|---|---|
turn_id | uint32 | Identifies which conversation turn this transcription belongs to |
text | string | Transcribed text |
language | string | Detected language (ISO 639-1 code, e.g. "en") |
ConversationQueryResult
ConversationQueryResult
Result of a
ConversationQuery request.| Field | Type | Description |
|---|---|---|
text | string | The LLM’s complete response text |
Type Definitions
AudioLineConfiguration
| Field | Type | Description |
|---|---|---|
sample_rate | uint32 | Sample rate in Hz (e.g., 16000) |
channel_count | uint32 | Number of channels (typically 1 for mono) |
sample_format | SampleFormat | Audio sample format |
SampleFormat
| Value | Description |
|---|---|
UNSIGNED_8_BIT | 8-bit unsigned integer samples |
SIGNED_16_BIT | 16-bit signed integer samples (recommended) |
SIGNED_32_BIT | 32-bit signed integer samples |
FLOAT_32_BIT | 32-bit floating point (0.0 to 1.0) |
FLOAT_64_BIT | 64-bit floating point (0.0 to 1.0) |
VadConfiguration
Voice Activity Detection settings.| Field | Type | Description |
|---|---|---|
confidence_threshold | float | Min confidence for speech detection (0.0-1.0) |
min_volume | float | Min volume level for speech (0.0-1.0) |
start_duration | Duration | Speech duration to trigger start |
stop_duration | Duration | Silence duration to trigger end |
backbuffer_duration | Duration | Audio buffer before speech start (recommended: 1s) |
InferenceConfiguration
| Field | Type | Description |
|---|---|---|
system_prompt | string | System prompt to guide model behavior |
temperature | double | Controls output randomness. Higher values produce more random output, lower values more deterministic output. |
TtsConfiguration
Optional text-to-speech configuration. If omitted, raw text fragments are sent.- ElevenLabs
- Hosted
| Field | Type | Description |
|---|---|---|
api_key | string | Your ElevenLabs API key |
voice_id | string | Voice ID (e.g., “21m00Tcm4TlvDq8ikWAM”) |
model_id | string | Optional model ID (e.g., “eleven_turbo_v2”) |
voice_settings | ElevenLabsVoiceSettings | Optional voice fine-tuning settings |
location | ElevenLabsLocation | Service location for data residency (default: US) |
ElevenLabsVoiceSettings
Fine-tuning settings for ElevenLabs voices.| Field | Type | Description |
|---|---|---|
stability | double | Stability for the voice (0.0-1.0) |
similarity_boost | double | Similarity boost for the voice (0.0-1.0) |
style | double | Style setting for v2 models (0.0-1.0) |
use_speaker_boost | bool | Whether to apply speaker boost |
speed | double | Speed setting for the voice |
ElevenLabsLocation
Controls which ElevenLabs regional endpoint is used. See ElevenLabs data residency docs for details.| Value | Description |
|---|---|
US | United States (default) — accessed via https://elevenlabs.io/ |
EU | European Union — requires ElevenLabs enterprise access |
INDIA | India — requires ElevenLabs enterprise access |
Duration
| Field | Type | Description |
|---|---|---|
seconds | uint64 | Whole seconds |
nanos | uint32 | Nanoseconds component |
InferenceTriggerMode
Controls how this input interacts with ongoing inference.| Value | Description |
|---|---|
NO_TRIGGER | Don’t trigger inference from this input. Audio is buffered for VAD processing but won’t start inference on its own. |
QUEUE | Queue inference to start after current inference completes (or immediately if idle). |
IMMEDIATE | Interrupt any ongoing inference and start processing new input immediately. Recommended for streaming audio. |
TextData
Text input wrapper.| Field | Type | Description |
|---|---|---|
data | string | Raw text data |
ChatMessage
A single message in the conversation history.| Field | Type | Description |
|---|---|---|
role | ChatMessageRole | Role of the entity this message is attributed to |
content | ChatMessageContent[] | Ordered content blocks of this message |
delivery_status | ChatDeliveryStatus | Delivery status of this message |
ephemeral | bool | true when the message was spoken via DirectSpeech with include_in_history: false — audible to the user but not in the LLM’s context |
ChatMessageRole
| Value | Description |
|---|---|
SYSTEM | System message (usually the system prompt) |
USER | User message |
ASSISTANT | Assistant message |
ChatDeliveryStatus
| Value | Description |
|---|---|
DELIVERY_IN_PROGRESS | Turn is still being generated |
DELIVERY_COMPLETE | All content was delivered to the client |
DELIVERY_INTERRUPTED | User interrupted — content reflects what was actually delivered |
ChatMessageContent
A single content block within a chat message. Contains one of:| Field | Type | Description |
|---|---|---|
text_content | ChatTextContent | Text content, optionally with TTS-synthesized audio |
input_audio | ChatAudioData | User input or model-output audio (not TTS-synthesized) |
thoughts | string | Internal model reasoning / chain-of-thought |
tool_call | ToolCallRequest | Tool call requested by the model |
tool_result | ToolCallResponse | Tool execution result |
instructions | string | Model instructions (e.g. directives injected via TriggerInference) |
ChatTextContent
Text content from a conversation turn, with optional TTS audio. When TTS is active, each synthesized sentence becomes aChatTextContent with both fields populated.
| Field | Type | Description |
|---|---|---|
text | string | The text content |
tts_audio | ChatAudioData | TTS-synthesized audio for this text, if available |
ChatAudioData
Self-describing audio data including format metadata so consumers can decode without out-of-band knowledge.If you reconfigure the audio pipeline mid-conversation, the format may change. Always inspect the
format field rather than assuming it matches the initial configuration.| Field | Type | Description |
|---|---|---|
audio | AudioData | Raw audio bytes |
format | AudioLineConfiguration | Audio format (sample rate, channels, sample format) |
transcription | string | Transcription of the audio content. Populated asynchronously for user audio turns. |
SessionErrorCategory
Broad error categories for programmatic handling ofSessionErrorNotification.
| Value | Description |
|---|---|
ERROR_UNKNOWN | Unknown or unclassified error |
ERROR_SESSION | Session lifecycle errors (not initialized, already initialized) |
ERROR_CONFIGURATION | Configuration errors (invalid audio format, missing required fields) |
ERROR_PROTOCOL | Protocol errors (malformed packets, unexpected message types) |
ERROR_INFERENCE | Inference/AI processing errors (model unavailable, processing failed, timeout) |
ERROR_AUDIO | Audio pipeline errors (codec failure, VAD errors) |
ERROR_TTS | TTS synthesis errors |
ERROR_INTERNAL | Internal service errors (catch-all for server-side issues) |