ServiceBoundMessage (client to server) or ClientBoundMessage (server to client).
Messages are binary-encoded protobuf. JSON examples below are shown for readability. Download the proto file.
Connection
Client Messages
Messages sent from client to server, wrapped inServiceBoundMessage.
InitializeSessionRequest
InitializeSessionRequest
Must be the first message sent. Configures the session parameters.
| Field | Type | Description |
|---|---|---|
input_audio_line | AudioLineConfiguration | Input audio format configuration |
output_audio_line | AudioLineConfiguration | Output audio format configuration |
vad_configuration | VadConfiguration | Voice activity detection settings |
inference_configuration | InferenceConfiguration | System prompt and model behavior |
tts_configuration | TtsConfiguration | Optional TTS provider config (e.g., ElevenLabs) |
UserInput
UserInput
Audio data from the user’s microphone.
| Field | Type | Description |
|---|---|---|
packet_id | uint64 | Client-defined packet identifier for tracking |
audio_data | AudioData | Raw PCM audio bytes |
UpdateToolDefinitionsRequest
UpdateToolDefinitionsRequest
Define or update available tools. Replaces all existing definitions.
Each
| Field | Type | Description |
|---|---|---|
tool_definitions | ToolDefinition[] | List of tool definitions |
ToolDefinition contains:| Field | Type | Description |
|---|---|---|
name | string | Tool identifier used by the model |
description | string | Purpose and functionality description |
parameters | object | JSON Schema for tool parameters |
ToolCallResponse
ToolCallResponse
Response to a tool call request from the server.
| Field | Type | Description |
|---|---|---|
id | string | Must match the id from ToolCallRequest |
result | string | Tool execution result (any format the model understands) |
Server Messages
Messages sent from server to client, wrapped inClientBoundMessage.
ModelTextFragment
ModelTextFragment
Streamed text output as tokens arrive. Used when TTS is not configured.
| Field | Type | Description |
|---|---|---|
text | string | Text content of this fragment |
ModelAudioChunk
ModelAudioChunk
TTS audio output when a TTS provider is configured.
| Field | Type | Description |
|---|---|---|
audio | AudioData | Audio bytes matching output_audio_line config |
transcript | string | Optional text that was spoken |
ToolCallRequest
ToolCallRequest
Model requests to execute a tool.
| Field | Type | Description |
|---|---|---|
id | string | Unique identifier for this request |
name | string | Name of the tool to call |
parameters | object | Parameters matching the tool’s schema |
PlaybackClearBuffer
PlaybackClearBuffer
Notification to clear the audio playback buffer. Sent when the user interrupts the model (e.g., starts speaking while audio is still playing).When received, immediately discard any buffered audio that hasn’t been played yet.
Type Definitions
AudioLineConfiguration
| Field | Type | Description |
|---|---|---|
sample_rate | uint32 | Sample rate in Hz (e.g., 16000) |
channel_count | uint32 | Number of channels (typically 1 for mono) |
sample_format | SampleFormat | Audio sample format |
SampleFormat
| Value | Description |
|---|---|
UNSIGNED_8_BIT | 8-bit unsigned integer samples |
SIGNED_16_BIT | 16-bit signed integer samples (recommended) |
SIGNED_32_BIT | 32-bit signed integer samples |
FLOAT_32_BIT | 32-bit floating point (0.0 to 1.0) |
FLOAT_64_BIT | 64-bit floating point (0.0 to 1.0) |
VadConfiguration
Voice Activity Detection settings.| Field | Type | Description |
|---|---|---|
confidence_threshold | float | Min confidence for speech detection (0.0-1.0) |
min_volume | float | Min volume level for speech (0.0-1.0) |
start_duration | Duration | Speech duration to trigger start |
stop_duration | Duration | Silence duration to trigger end |
backbuffer_duration | Duration | Audio buffer before speech start (recommended: 1s) |
TtsConfiguration
Optional text-to-speech configuration. If omitted, raw text fragments are sent.- ElevenLabs
| Field | Type | Description |
|---|---|---|
api_key | string | Your ElevenLabs API key |
voice_id | string | Voice ID (e.g., “21m00Tcm4TlvDq8ikWAM”) |
model_id | string | Optional model ID (e.g., “eleven_turbo_v2”) |
Duration
| Field | Type | Description |
|---|---|---|
seconds | uint64 | Whole seconds |
nanos | uint32 | Nanoseconds component |