Voicelab Models
Voicelab provides access to state-of-the-art open-source text-to-speech models.Featured Models
Sesame CSM-1B
State-of-the-art conversational speech model This model is optimized and certain voices are professionally cloned. This means that we offer fast, low-cost inference for this model, and that select voices will have more consistent quality.- Parameters: 1 billion
- Specialization: Natural conversation and dialogue
- Languages: English
- Key Features:
- Exceptional naturalness in conversational contexts
- Low latency inference
- Consistent voice quality across long generations
- Optimized for real-time applications
Dia
general-purpose TTS modelThis model is not yet optimized or post-trained. Generation quality may be inconsistent.
- Parameters: 1.6 billion
- Specialization: General-purpose text-to-speech
- Languages: English
- Key Features:
- Accepts emotive tokens (e.g. laughing, sighing)
- Efficient inference
- Less stable than CSM-1B
(laughs)
, (clears throat)
, (sighs)
, (gasps)
, (coughs)
, (singing)
, (sings)
, (mumbles)
, (beep)
, (groans)
, (sniffs)
, (claps)
, (screams)
, (inhales)
, (exhales)
, (applause)
, (burps)
, (humming)
, (sneezes)
, (chuckle)
, (whistles)
Orpheus
Expressive and emotional speech synthesisThis model is not post-trained; the hosted weights are as-is.
- Parameters: 750 million
- Specialization: Emotional and expressive speech
- Languages: English (primary)
- Key Features:
- Advanced emotional control
- Dynamic prosody and intonation
- Character voice capabilities
- Rich vocal expressions
<laugh>
, <chuckle>
, <sigh>
, <cough>
, <sniffle>
, <groan>
, <yawn>
, <gasp>
Coming Soon
Kokoro, Chatterbox, Kyutai TTSModel Specifications
Supported Features
Feature | Sesame CSM-1B | Dia | Orpheus |
---|---|---|---|
Voice Cloning | ✅ | ✅ | ✅ |
Emotive Tokens | ❌ | ✅ | ✅ |
Multi-speaker | ✅ | ✅ | ✅ |
Real-time Streaming | ✅ | ✅ | ✅ |
Custom Fine-tuning | ✅ | ✅ | ✅ |
While CSM-1B doesn’t accept emotive tokens (laugh/sigh/etc.), it has the capacity to generate these artifacts in output audio based on conversational context.