import senko
diarizer = senko.Diarizer(device='auto', vad='auto', clustering='auto', warmup=True, quiet=True, mer_cos=None)device: Device to use for VAD & embeddings stage (auto,cuda,coreml,cpu)autoautomatically selectscoremlif on macOS, if not, thencuda, if not, thencpu
vad: Voice Activity Detection model to use (auto,pyannote,silero)autoautomatically selectspyannoteforcuda&coreml,sileroforcpupyannoteuses Pyannote VAD (requirescudafor optimal performance)silerouses Silero VAD (runs on CPU; not available on macOS)
clustering: Clustering location whendevice==cuda(auto,gpu,cpu)- Only applies to CUDA devices; non-CUDA devices always use CPU clustering
autouses GPU clustering for CUDA devices with compute capability >= 7.0, CPU clustering otherwisegpuuses GPU clustering on CUDA devices with compute capability >= 7.0, falls back to CPU clustering with warning otherwisecpuforces CPU clustering (most accurate; see evals)
warmup: Warm up CAM++ embedding model and clustering objects during initialization- If warmup is not done, the first few runs of the pipeline will be a bit slower
quiet: Suppress progress updates and all other output to stdoutmer_cos: Override the cosine-similarity merge threshold for both spectral and UMAP+HDBSCAN clustering- Must be > 0 and <= 1
Nonekeeps the default value fromsenko/cluster/conf/*.yaml(0.875)- After initial clustering, clusters whose centroid cosine similarity is >=
mer_cosare merged - If you see too many speakers (over‑splitting), try lowering
mer_cos - If you see too few speakers (over‑merging), try raising
mer_cos
result_data = diarizer.diarize(wav_path='audio.wav', accurate=None, generate_colors=False)wav_path: Path to the audio file (16kHz mono 16-bit WAV format)accurate: Use shorter subsegments & smaller shift for (very slightly) better accuracy (None,True,False)None(default): Auto-enables ifdevice == 'cuda'andvad == 'pyannote'- Accuracy difference not stark enough in my testing to warrant turning this on for when not
device == 'cuda'andvad == 'pyannote'. - Only reason to turn on would be to get better output parity with
cudaif oncoremlorcpu.
generate_colors: Whether to generate speaker color sets for visualization
Dictionary (result_data) containing keys:
raw_segments: Raw diarization output- A list of speaking segments (dictionaries) with keys
start,end,speaker
- A list of speaking segments (dictionaries) with keys
raw_speakers_detected: Number of unique speakers found inraw_segmentsmerged_segments: Cleaned diarization output- Same format as
raw_segments - Segments <= 0.78 seconds in length are removed
- Adjacent segments of the same speaker that have a silence in between them of <= 4 seconds are merged into one segment
- Same format as
merged_speakers_detected: Number of unique speakers found inmerged_segmentsspeaker_centroids: Voice fingerprints for each detected speaker- Dictionary mapping speaker IDs to 192-dimensional numpy arrays
- Each centroid is the mean of all audio embeddings for that speaker
- Can be used for speaker comparison/identification across different audio files
timing_stats: Dictionary of how long each stage of the pipeline took in seconds, as well as the total time- Keys:
total_time,vad_time,fbank_time,embeddings_time,clustering_time
- Keys:
speaker_color_sets: 10 sets of speaker colors (if requested)vad: Voice activity detection segments- List of
(start, end)tuples in seconds produced by the VAD stage, marking every region of the audio that contains speech
- List of
senko.AudioFormatErrorif audio file is not in the required 16kHz mono 16-bit WAV format
if senko.speaker_similarity(centroid1, centroid2) >= 0.875:
print('Speakers are the same')Calculate cosine similarity between two speaker centroids (voice fingerprints).
centroid1: First speaker centroid (192-dimensional numpy array)centroid2: Second speaker centroid (192-dimensional numpy array)
float: Cosine similarity score between -1 and 1 (<1 rarely if ever happens with speaker embeddings)
senko.save_json(segments, output_path)Save diarization segments to a JSON file.
segments: List of segment dictionaries with keysstart,end,speaker- Typically
result["raw_segments"]orresult["merged_segments"]fromdiarize()
- Typically
output_path: Path where the JSON file will be saved
senko.save_rttm(segments, wav_path, output_path)Save diarization segments in RTTM (Rich Transcription Time Marked) format, compatible with standard diarization evaluation tools.
segments: List of segment dictionaries with keysstart,end,speaker- Typically
result["raw_segments"]orresult["merged_segments"]fromdiarize()
- Typically
wav_path: Path to the original audio file (used to extract file ID for RTTM format)output_path: Path where the RTTM file will be saved
Speaker segments (raw_segments/merged_segments):
[
{
"start": 0.0,
"end": 5.2,
"speaker": "SPEAKER_01"
},
{
"start": 5.2,
"end": 10.8,
"speaker": "SPEAKER_02"
},
...
]
Speaker centroids (speaker_centroids):
{
"SPEAKER_01": array([0.123, -0.456, 0.789, ...]), # 192-dimensional numpy array
"SPEAKER_02": array([-0.234, 0.567, -0.890, ...]), # 192-dimensional numpy array
...
}
Color sets (speaker_color_sets):
{
"0": {
"SPEAKER_01": "#ea759c",
"SPEAKER_02": "#579c3a",
"SPEAKER_03": "#100058",
},
"1": {
"SPEAKER_01": "#97de7b",
"SPEAKER_02": "#4c56b6",
"SPEAKER_03": "#480000",
},
"2": {
"SPEAKER_01": "#8393f9",
"SPEAKER_02": "#bf5d01",
"SPEAKER_03": "#003a38",
},
...
}
VAD segments (vad):
[
(0.0, 2.1),
(2.4, 6.7),
(7.1, 10.2),
...
]