Skip to content

Voice & Speech

Speech-to-text, text-to-speech, voice cloning, voice agents, and audio AI. From open-source toolkits like Whisper and SpeechBrain to production voice agent platforms.

Librosa Logo

Librosa

Python library for audio and music analysis, useful for AI voice processing.

PythonAudio AnalysisMusic
OpenSMILE Logo

OpenSMILE

Open-source feature extraction toolkit for audio, for AI emotion recognition.

Audio AnalysisFeature ExtractionEmotion Recognition
Pyannote Logo

Pyannote

Python library for speaker diarization, for AI audio processing.

PythonAudio ProcessingSpeaker Diarization
Pyannote Audio Logo

Pyannote Audio

Python library for speaker diarization and audio analysis.

PythonAudio ProcessingSpeaker Diarization
Whisper Logo

Whisper

Speech recognition model by OpenAI

Speech RecognitionTranscriptionOpenAIMultilingualAudio Transcription
SpeechBrain Logo

SpeechBrain

PyTorch-based speech toolkit

PythonOpen SourceNLPSpeech RecognitionAudio Processing
DeepSpeech Logo

DeepSpeech

Speech-to-text engine using a model trained by machine learning techniques

Open SourceDeep LearningSpeech RecognitionAudio ProcessingMozilla
Kaldi Logo

Kaldi

Speech recognition toolkit

Open SourceResearchNLPSpeech RecognitionAudio Processing
ElevenLabs Logo

ElevenLabs

AI voice generation and cloning platform

Text-to-SpeechVoice SynthesisVoice CloningAI Voice
Vapi Logo

Vapi

Provider-agnostic voice AI orchestration platform. Build, test, and deploy advanced voice agents across 14+ STT/TTS/LLM providers.

AI AgentsAPISpeech RecognitionText-to-Speech
Retell AI Logo

Retell AI

Enterprise voice agent platform with sub-600ms response times. Build, deploy, and manage human-sounding voice agents at scale.

AI AgentsSpeech RecognitionEnterpriseText-to-SpeechEnterprise AI
Bland AI Logo

Bland AI

High-volume outbound voice agents with built-in telephony infrastructure. Automate calls without sacrificing quality.

AI AgentsSpeech RecognitionEnterpriseText-to-Speech
Inworld TTS Logo

Inworld TTS

#1-ranked text-to-speech with human-like expression and sub-200ms realtime latency. Voice cloning and multilingual.

Audio ProcessingText-to-Speech
Mistral Voxtral Logo

Mistral Voxtral

Mistral's open-weight frontier speech-understanding models (24B and 3B). Apache 2.0 licensed; long-context, multilingual, function-calling.

LLMOpen SourceSpeech RecognitionAudio Processing
Deepgram Voice Agent Logo

Deepgram Voice Agent

Unified conversational voice AI API combining STT, LLM orchestration, and TTS at $4.50/hr. 54.2% lower WER on noisy audio.

AI AgentsAPISpeech RecognitionText-to-Speech

15 tools in this category