ServiceBoundMessage (client to server) or ClientBoundMessage (server to client).
Messages are binary-encoded protobuf. JSON examples below are shown for readability. Download the proto file.
Connection
Client Messages
Messages sent from client to server, wrapped inServiceBoundMessage.
InitializeSessionRequest
InitializeSessionRequest
Must be the first message sent. Configures the session parameters.
| Field | Type | Description |
|---|---|---|
input_audio_line | AudioLineConfiguration | Input audio format configuration |
output_audio_line | AudioLineConfiguration | Output audio format configuration |
vad_configuration | VadConfiguration | Voice activity detection settings |
inference_configuration | InferenceConfiguration | System prompt and model behavior |
tts_configuration | TtsConfiguration | Optional TTS provider config (e.g., ElevenLabs) |
ReconfigureSessionRequest
ReconfigureSessionRequest
Reconfigure an ongoing session. Useful for changing audio input settings on the fly.
| Field | Type | Description |
|---|---|---|
input_audio_line | AudioLineConfiguration | Updated input audio format configuration |
UserInput
UserInput
User input data (audio or text).
| Field | Type | Description |
|---|---|---|
packet_id | uint64 | Client-defined packet identifier for tracking |
mode | InferenceTriggerMode | How to trigger inference for this input |
audio_data | AudioData | Raw PCM audio bytes (one of audio_data or text_data) |
text_data | TextData | Text input (one of audio_data or text_data) |
- Audio Input
- Text Input
UpdateToolDefinitionsRequest
UpdateToolDefinitionsRequest
Define or update available tools. Replaces all existing definitions.
Each
| Field | Type | Description |
|---|---|---|
tool_definitions | ToolDefinition[] | List of tool definitions |
ToolDefinition contains:| Field | Type | Description |
|---|---|---|
name | string | Tool identifier used by the model |
description | string | Purpose and functionality description |
parameters | object | JSON Schema for tool parameters |
ToolCallResponse
ToolCallResponse
Response to a tool call request from the server.
| Field | Type | Description |
|---|---|---|
id | string | Must match the id from ToolCallRequest |
result | string | Tool execution result (any format the model understands) |
TriggerInference
TriggerInference
Manually trigger inference processing immediately, instead of waiting for natural pauses or end-of-input signals.
Primary use case is generating an initial greeting. Model behavior may be unpredictable if used directly after a model response.
| Field | Type | Description |
|---|---|---|
extra_instructions | string | Optional extra instructions to guide the inference |
Server Messages
Messages sent from server to client, wrapped inClientBoundMessage.
ModelTextFragment
ModelTextFragment
Streamed text output as tokens arrive. Used when TTS is not configured.
| Field | Type | Description |
|---|---|---|
text | string | Text content of this fragment |
ModelAudioChunk
ModelAudioChunk
TTS audio output when a TTS provider is configured.
| Field | Type | Description |
|---|---|---|
audio | AudioData | Audio bytes matching output_audio_line config |
transcript | string | Optional text that was spoken |
ToolCallRequest
ToolCallRequest
Model requests to execute a tool.
| Field | Type | Description |
|---|---|---|
id | string | Unique identifier for this request |
name | string | Name of the tool to call |
parameters | object | Parameters matching the tool’s schema |
PlaybackClearBuffer
PlaybackClearBuffer
Notification to clear the audio playback buffer. Sent proactively when the user starts speaking, regardless of whether there is ongoing TTS playback.When received, immediately discard any buffered audio that hasn’t been played yet. This message may be sent multiple times if the user interrupts multiple times.
ResponseBegin
ResponseBegin
Notification that the model has begun its response.
ResponseEnd
ResponseEnd
Notification that the model has finished its response.
Type Definitions
AudioLineConfiguration
| Field | Type | Description |
|---|---|---|
sample_rate | uint32 | Sample rate in Hz (e.g., 16000) |
channel_count | uint32 | Number of channels (typically 1 for mono) |
sample_format | SampleFormat | Audio sample format |
SampleFormat
| Value | Description |
|---|---|
UNSIGNED_8_BIT | 8-bit unsigned integer samples |
SIGNED_16_BIT | 16-bit signed integer samples (recommended) |
SIGNED_32_BIT | 32-bit signed integer samples |
FLOAT_32_BIT | 32-bit floating point (0.0 to 1.0) |
FLOAT_64_BIT | 64-bit floating point (0.0 to 1.0) |
VadConfiguration
Voice Activity Detection settings.| Field | Type | Description |
|---|---|---|
confidence_threshold | float | Min confidence for speech detection (0.0-1.0) |
min_volume | float | Min volume level for speech (0.0-1.0) |
start_duration | Duration | Speech duration to trigger start |
stop_duration | Duration | Silence duration to trigger end |
backbuffer_duration | Duration | Audio buffer before speech start (recommended: 1s) |
TtsConfiguration
Optional text-to-speech configuration. If omitted, raw text fragments are sent.- ElevenLabs
| Field | Type | Description |
|---|---|---|
api_key | string | Your ElevenLabs API key |
voice_id | string | Voice ID (e.g., “21m00Tcm4TlvDq8ikWAM”) |
model_id | string | Optional model ID (e.g., “eleven_turbo_v2”) |
Duration
| Field | Type | Description |
|---|---|---|
seconds | uint64 | Whole seconds |
nanos | uint32 | Nanoseconds component |
InferenceTriggerMode
Controls how this input interacts with ongoing inference.| Value | Description |
|---|---|
NO_TRIGGER | Don’t trigger inference from this input. Audio is buffered for VAD processing but won’t start inference on its own. |
QUEUE | Queue inference to start after current inference completes (or immediately if idle). |
IMMEDIATE | Interrupt any ongoing inference and start processing new input immediately. Recommended for streaming audio. |
TextData
Text input wrapper.| Field | Type | Description |
|---|---|---|
data | string | Raw text data |