Skip to main content
The Speeq WebSocket Protocol uses Protocol Buffers for message encoding. All messages are wrapped in either ServiceBoundMessage (client to server) or ClientBoundMessage (server to client).
Messages are binary-encoded protobuf. JSON examples below are shown for readability. Download the proto file.

Connection

wss://app.phonebot.io/v1/realtime
Authentication via Bearer token in the connection headers.

Client Messages

Messages sent from client to server, wrapped in ServiceBoundMessage.

InitializeSessionRequest

Must be the first message sent. Configures the session parameters.
FieldTypeDescription
input_audio_lineAudioLineConfigurationInput audio format configuration
output_audio_lineAudioLineConfigurationOutput audio format configuration
vad_configurationVadConfigurationVoice activity detection settings
inference_configurationInferenceConfigurationSystem prompt and model behavior
tts_configurationTtsConfigurationOptional TTS provider config (e.g., ElevenLabs)
{
  "initializeSessionRequest": {
    "inputAudioLine": {
      "sampleRate": 16000,
      "channelCount": 1,
      "sampleFormat": "SIGNED_16_BIT"
    },
    "outputAudioLine": {
      "sampleRate": 16000,
      "channelCount": 1,
      "sampleFormat": "SIGNED_16_BIT"
    },
    "vadConfiguration": {
      "confidenceThreshold": 0.5,
      "minVolume": 0,
      "startDuration": { "seconds": 0, "nanos": 200000000 },
      "stopDuration": { "seconds": 0, "nanos": 500000000 },
      "backbufferDuration": { "seconds": 1, "nanos": 0 }
    },
    "inferenceConfiguration": {
      "systemPrompt": "You are a helpful assistant."
    }
  }
}
Audio data from the user’s microphone.
FieldTypeDescription
packet_iduint64Client-defined packet identifier for tracking
audio_dataAudioDataRaw PCM audio bytes
{
  "userInput": {
    "packetId": 1,
    "audioData": {
      "data": "<base64-encoded-pcm>"
    }
  }
}
Define or update available tools. Replaces all existing definitions.
FieldTypeDescription
tool_definitionsToolDefinition[]List of tool definitions
Each ToolDefinition contains:
FieldTypeDescription
namestringTool identifier used by the model
descriptionstringPurpose and functionality description
parametersobjectJSON Schema for tool parameters
{
  "updateToolDefinitionsRequest": {
    "toolDefinitions": [
      {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": { "type": "string" }
          },
          "required": ["location"]
        }
      }
    ]
  }
}
Response to a tool call request from the server.
FieldTypeDescription
idstringMust match the id from ToolCallRequest
resultstringTool execution result (any format the model understands)
Every ToolCallRequest must receive a corresponding ToolCallResponse, even if execution fails.
{
  "toolCallResponse": {
    "id": "call_abc123",
    "result": "{\"temperature\": 22, \"condition\": \"sunny\"}"
  }
}

Server Messages

Messages sent from server to client, wrapped in ClientBoundMessage.

ModelTextFragment

Streamed text output as tokens arrive. Used when TTS is not configured.
FieldTypeDescription
textstringText content of this fragment
{
  "modelTextFragment": {
    "text": "Hello, how can I help you?"
  }
}
TTS audio output when a TTS provider is configured.
FieldTypeDescription
audioAudioDataAudio bytes matching output_audio_line config
transcriptstringOptional text that was spoken
{
  "modelAudioChunk": {
    "audio": {
      "data": "<base64-encoded-pcm>"
    },
    "transcript": "Hello, how can I help you?"
  }
}
Model requests to execute a tool.
FieldTypeDescription
idstringUnique identifier for this request
namestringName of the tool to call
parametersobjectParameters matching the tool’s schema
{
  "toolCallRequest": {
    "id": "call_abc123",
    "name": "get_weather",
    "parameters": {
      "location": "Amsterdam"
    }
  }
}
Notification to clear the audio playback buffer. Sent when the user interrupts the model (e.g., starts speaking while audio is still playing).When received, immediately discard any buffered audio that hasn’t been played yet.
{
  "playbackClearBuffer": {}
}

Type Definitions

AudioLineConfiguration

FieldTypeDescription
sample_rateuint32Sample rate in Hz (e.g., 16000)
channel_countuint32Number of channels (typically 1 for mono)
sample_formatSampleFormatAudio sample format

SampleFormat

ValueDescription
UNSIGNED_8_BIT8-bit unsigned integer samples
SIGNED_16_BIT16-bit signed integer samples (recommended)
SIGNED_32_BIT32-bit signed integer samples
FLOAT_32_BIT32-bit floating point (0.0 to 1.0)
FLOAT_64_BIT64-bit floating point (0.0 to 1.0)

VadConfiguration

Voice Activity Detection settings.
FieldTypeDescription
confidence_thresholdfloatMin confidence for speech detection (0.0-1.0)
min_volumefloatMin volume level for speech (0.0-1.0)
start_durationDurationSpeech duration to trigger start
stop_durationDurationSilence duration to trigger end
backbuffer_durationDurationAudio buffer before speech start (recommended: 1s)

TtsConfiguration

Optional text-to-speech configuration. If omitted, raw text fragments are sent.
FieldTypeDescription
api_keystringYour ElevenLabs API key
voice_idstringVoice ID (e.g., “21m00Tcm4TlvDq8ikWAM”)
model_idstringOptional model ID (e.g., “eleven_turbo_v2”)
{
  "ttsConfiguration": {
    "elevenLabs": {
      "apiKey": "sk-...",
      "voiceId": "21m00Tcm4TlvDq8ikWAM",
      "modelId": "eleven_turbo_v2"
    }
  }
}

Duration

FieldTypeDescription
secondsuint64Whole seconds
nanosuint32Nanoseconds component
// 500 milliseconds
{ "seconds": 0, "nanos": 500000000 }

// 1.5 seconds
{ "seconds": 1, "nanos": 500000000 }