Voicelab Models
Voicelab provides access to state-of-the-art open-source text-to-speech models.Featured Models
Sesame CSM-1B
State-of-the-art conversational speech model This model is optimized and certain voices are professionally cloned. This means that we offer fast, low-cost inference for this model, and that select voices will have more consistent quality.- Parameters: 1 billion
- Specialization: Natural conversation and dialogue
- Languages: English
- Key Features:
- Exceptional naturalness in conversational contexts
- Low latency inference
- Consistent voice quality across long generations
- Optimized for real-time applications
Dia
general-purpose TTS model- Parameters: 1.6 billion
- Specialization: General-purpose text-to-speech
- Languages: English
- Key Features:
- Accepts emotive tokens (e.g. laughing, sighing)
- Efficient inference
- Less stable than CSM-1B
(laughs), (clears throat), (sighs), (gasps), (coughs), (singing), (sings), (mumbles), (beep), (groans), (sniffs), (claps), (screams), (inhales), (exhales), (applause), (burps), (humming), (sneezes), (chuckle), (whistles)
Orpheus
Expressive and emotional speech synthesis- Parameters: 750 million
- Specialization: Emotional and expressive speech
- Languages: English (primary)
- Key Features:
- Advanced emotional control
- Dynamic prosody and intonation
- Character voice capabilities
- Rich vocal expressions
<laugh>, <chuckle>, <sigh>, <cough>, <sniffle>, <groan>, <yawn>, <gasp>
Coming Soon
Kokoro, Chatterbox, Kyutai TTSModel Specifications
Supported Features
| Feature | Sesame CSM-1B | Dia | Orpheus |
|---|---|---|---|
| Voice Cloning | ✅ | ✅ | ✅ |
| Emotive Tokens | ❌ | ✅ | ✅ |
| Multi-speaker | ✅ | ✅ | ✅ |
| Real-time Streaming | ✅ | ✅ | ✅ |
| Custom Fine-tuning | ✅ | ✅ | ✅ |
While CSM-1B doesn’t accept emotive tokens (laugh/sigh/etc.), it has the capacity to generate these artifacts in output audio based on conversational context.