Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.deepslate.eu/llms.txt

Use this file to discover all available pages before exploring further.

The VAD WebSocket Protocol is a lightweight streaming endpoint for Voice Activity Detection without LLM inference. It uses Protocol Buffers for message encoding. All messages are wrapped in either ServiceBoundMessage (client to server) or ClientBoundMessage (server to client).
Messages are binary-encoded protobuf. JSON examples below are shown for readability. Download the proto file.
This endpoint runs pure VAD — it does not perform LLM inference, TTS synthesis, or transcription. Use it when you only need speech activity events and want to drive your own processing pipeline downstream. For the full conversational AI pipeline, see the Opal WebSocket Protocol.

Connection

wss://app.deepslate.eu/api/v1/vendors/{vendorId}/organizations/{organizationId}/realtime/vad
Authentication via Bearer token in the connection headers.

Client Messages

Messages sent from client to server, wrapped in ServiceBoundMessage.

InitializeSessionRequest

Must be the first message sent. Configures the audio input format and VAD parameters.
Inference, TTS, and playback reporting fields from the full realtime protocol are not used here and will be ignored if present.
FieldTypeDescription
input_audio_lineAudioLineConfigurationInput audio format configuration
output_audio_lineAudioLineConfigurationOutput audio format (reserved for symmetry; not actively used)
vad_configurationVadConfigurationVoice activity detection settings
enable_vad_frame_telemetryboolWhen true, the server emits per-frame VadAnalysisFrame messages (~50 Hz). Off by default. Experimental.
{
  "initializeSessionRequest": {
    "inputAudioLine": {
      "sampleRate": 16000,
      "channelCount": 1,
      "sampleFormat": "SIGNED_16_BIT"
    },
    "vadConfiguration": {
      "confidenceThreshold": 0.5,
      "minVolume": 0,
      "startDuration": { "seconds": 0, "nanos": 200000000 },
      "stopDuration": { "seconds": 0, "nanos": 500000000 },
      "backbufferDuration": { "seconds": 1, "nanos": 0 }
    },
    "enableVadFrameTelemetry": false
  }
}
Reconfigure the input audio format of an ongoing session.
Reconfiguration may not be seamless. There may be glitches or dropped audio during the transition.
FieldTypeDescription
input_audio_lineAudioLineConfigurationUpdated input audio format configuration
{
  "reconfigureSessionRequest": {
    "inputAudioLine": {
      "sampleRate": 48000,
      "channelCount": 1,
      "sampleFormat": "SIGNED_16_BIT"
    }
  }
}
Raw PCM audio input for VAD processing. Only audio data is accepted — text input and inference trigger modes are not supported on this endpoint.
FieldTypeDescription
packet_iduint64Client-defined packet identifier. Used to correlate packets with VAD events (e.g. VadStateEvent.packet_id). Any numbering scheme is valid.
audio_dataAudioDataRaw PCM audio bytes matching the configured input_audio_line
{
  "userInput": {
    "packetId": 42,
    "audioData": {
      "data": "<binary-pcm-data>"
    }
  }
}

Server Messages

Messages sent from server to client, wrapped in ClientBoundMessage.

SessionReady

Sent once the session is fully initialized and ready to accept audio input. Wait for this message before sending UserInput.
{
  "sessionReady": {}
}
Emitted on every VAD state-machine transition. Always sent, regardless of whether enable_vad_frame_telemetry is set.
FieldTypeDescription
session_timeDurationTime elapsed since the input pipeline started when the transition occurred
from_stateVadStateThe state before the transition
to_stateVadStateThe state after the transition
packet_iduint64The packet_id of the UserInput packet that triggered the transition
{
  "vadStateEvent": {
    "sessionTime": { "seconds": 2, "nanos": 340000000 },
    "fromState": "SILENCE",
    "toState": "SPEECH_STARTING",
    "packetId": 42
  }
}
Per-frame VAD telemetry, emitted at ~50 Hz (20 ms frames on 16 kHz audio). Only sent when enable_vad_frame_telemetry: true was set in InitializeSessionRequest.
This message type is experimental and may be changed or removed without a major version bump.
Frame indexing is monotonic per session and reflects the post-resampling frame stream that the VAD engine actually processes.
FieldTypeDescription
frame_indexuint64Monotonic frame index for this session
session_timeDurationWall-clock time elapsed since the input pipeline started, at the end of this frame
confidencefloatRaw VAD confidence score from the underlying engine (0.0–1.0)
volumefloatRaw RMS volume of the audio frame (0.0–1.0)
stateVadStateThe state-machine state at the end of this frame
source_packet_idsuint64[]packet_ids whose audio contributed to this frame. Usually one; multiple at packet boundaries or with very small packets.
{
  "vadAnalysisFrame": {
    "frameIndex": 117,
    "sessionTime": { "seconds": 2, "nanos": 340000000 },
    "confidence": 0.82,
    "volume": 0.31,
    "state": "SPEECH",
    "sourcePacketIds": [42]
  }
}
Structured error notification sent before the server closes the connection.
FieldTypeDescription
categorySessionErrorCategoryError category for programmatic handling
messagestringHuman-readable message for logging or display
trace_idstringOptional trace ID for correlating with server logs
{
  "error": {
    "category": "ERROR_CONFIGURATION",
    "message": "Invalid sample rate: must be between 8000 and 48000",
    "traceId": "abc-123-xyz"
  }
}

Type Definitions

AudioLineConfiguration

FieldTypeDescription
sample_rateuint32Sample rate in Hz (e.g., 16000)
channel_countuint32Number of channels (typically 1 for mono)
sample_formatSampleFormatAudio sample format

SampleFormat

ValueDescription
UNSIGNED_8_BIT8-bit unsigned integer samples
SIGNED_16_BIT16-bit signed integer samples (recommended)
SIGNED_32_BIT32-bit signed integer samples
FLOAT_32_BIT32-bit floating point (0.0 to 1.0)
FLOAT_64_BIT64-bit floating point (0.0 to 1.0)

VadConfiguration

Voice Activity Detection settings.
FieldTypeDescription
confidence_thresholdfloatMin confidence for speech detection (0.0–1.0)
min_volumefloatMin volume level for speech (0.0–1.0)
start_durationDurationHow long speech must be detected continuously before triggering SPEECH_STARTING
stop_durationDurationHow long silence must persist before transitioning out of SPEECH
backbuffer_durationDurationAudio buffered before the speech start point (recommended: 1 s)

VadState

The VAD pipeline is a debounced state machine. Rather than emitting a transition on every raw frame, the engine applies start_duration and stop_duration windows to smooth out transient noise and brief pauses before committing to a new state. A frame is considered above threshold when both confidence ≥ confidence_threshold AND volume ≥ min_volume; both conditions must hold simultaneously. SILENCE — The initial state. The engine is processing audio but no speech onset has been detected. Frames are evaluated every ~20 ms; the machine stays here until it sees a frame that clears both confidence_threshold and min_volume. SPEECH_STARTING — A potential speech onset has been detected: at least one frame exceeded both thresholds. The machine enters this state and starts the start_duration debounce timer. This window guards against brief noise bursts or transient spikes being misclassified as speech. Two outcomes are possible:
  • If frames remain above threshold continuously for the full start_duration, the machine advances to SPEECH.
  • If any frame drops below threshold before start_duration elapses, the machine returns to SILENCE immediately — the onset is treated as a false positive.
SPEECH — Active speech is confirmed. The machine entered here after sustained above-threshold audio lasting at least start_duration. Audio is considered live speech until the engine sees a frame that drops below threshold, at which point the machine moves to SPEECH_ENDING. SPEECH_ENDING — A potential speech offset has been detected: a frame dropped below threshold while in SPEECH. The stop_duration debounce timer starts. This window prevents brief pauses — breaths, hesitations, word gaps — from prematurely ending a speech segment. Two outcomes are possible:
  • If any frame returns above threshold before stop_duration elapses, the machine snaps back to SPEECH, continuing the same segment.
  • If frames remain below threshold for the full stop_duration, the machine transitions to SILENCE and the speech segment is considered complete.
ValueDescription
SILENCENo speech detected
SPEECH_STARTINGAbove-threshold frame seen; waiting for start_duration to confirm onset
SPEECHActive speech confirmed
SPEECH_ENDINGBelow-threshold frame seen; waiting for stop_duration to confirm offset

Duration

FieldTypeDescription
secondsuint64Whole seconds
nanosuint32Nanoseconds component
// 200 milliseconds
{ "seconds": 0, "nanos": 200000000 }

// 1 second
{ "seconds": 1, "nanos": 0 }

SessionErrorCategory

ValueDescription
ERROR_UNKNOWNUnknown or unclassified error
ERROR_SESSIONSession lifecycle errors (not initialized, already initialized)
ERROR_CONFIGURATIONConfiguration errors (invalid audio format, missing required fields)
ERROR_PROTOCOLProtocol errors (malformed packets, unexpected message types)
ERROR_INFERENCEReserved — not applicable to this endpoint
ERROR_AUDIOAudio pipeline errors (codec failure, VAD errors)
ERROR_TTSReserved — not applicable to this endpoint
ERROR_INTERNALInternal service errors (catch-all for server-side issues)

Session Lifecycle

A typical VAD session follows this sequence:
If enable_vad_frame_telemetry is true, VadAnalysisFrame messages are interleaved continuously between state events at ~50 Hz.