Voice & Speech

Speech-to-text, text-to-speech, voice cloning, voice agents, and audio AI. From open-source toolkits like Whisper and SpeechBrain to production voice agent platforms.

Librosa

Python library for audio and music analysis, useful for AI voice processing.

PythonAudio AnalysisMusic

GitHub Website Learn More

OpenSMILE

Open-source feature extraction toolkit for audio, for AI emotion recognition.

Audio AnalysisFeature ExtractionEmotion Recognition

GitHub Website Learn More

Pyannote

Python library for speaker diarization, for AI audio processing.

PythonAudio ProcessingSpeaker Diarization

GitHub Website Learn More

Pyannote Audio

Python library for speaker diarization and audio analysis.

PythonAudio ProcessingSpeaker Diarization

GitHub Website Learn More

Whisper

Speech recognition model by OpenAI

Speech RecognitionTranscriptionOpenAIMultilingualAudio Transcription

GitHub Website Learn More

SpeechBrain

PyTorch-based speech toolkit

PythonOpen SourceNLPSpeech RecognitionAudio Processing

GitHub Website Learn More

DeepSpeech

Speech-to-text engine using a model trained by machine learning techniques

Open SourceDeep LearningSpeech RecognitionAudio ProcessingMozilla

GitHub Website Learn More

Kaldi

Speech recognition toolkit

Open SourceResearchNLPSpeech RecognitionAudio Processing

GitHub Website Learn More

ElevenLabs

AI voice generation and cloning platform

Text-to-SpeechVoice SynthesisVoice CloningAI Voice

GitHub Website Learn More

Vapi

Provider-agnostic voice AI orchestration platform. Build, test, and deploy advanced voice agents across 14+ STT/TTS/LLM providers.

AI AgentsAPISpeech RecognitionText-to-Speech

Website Learn More

Retell AI

Enterprise voice agent platform with sub-600ms response times. Build, deploy, and manage human-sounding voice agents at scale.

AI AgentsSpeech RecognitionEnterpriseText-to-SpeechEnterprise AI

Website Learn More

Bland AI

High-volume outbound voice agents with built-in telephony infrastructure. Automate calls without sacrificing quality.

AI AgentsSpeech RecognitionEnterpriseText-to-Speech

Website Learn More

Inworld TTS

#1-ranked text-to-speech with human-like expression and sub-200ms realtime latency. Voice cloning and multilingual.

Audio ProcessingText-to-Speech

Website Learn More

Mistral Voxtral

Mistral's open-weight frontier speech-understanding models (24B and 3B). Apache 2.0 licensed; long-context, multilingual, function-calling.

LLMOpen SourceSpeech RecognitionAudio Processing

Website Learn More

Deepgram Voice Agent

Unified conversational voice AI API combining STT, LLM orchestration, and TTS at $4.50/hr. 54.2% lower WER on noisy audio.

AI AgentsAPISpeech RecognitionText-to-Speech

Website Learn More

15 tools in this category