Explore the available text-to-speech models and their capabilities
Voicelab provides access to state-of-the-art open-source text-to-speech models.
State-of-the-art conversational speech model
This model is optimized and certain voices are professionally cloned. This means that we offer fast, low-cost inference for this model, and that select voices will have more consistent quality.
general-purpose TTS model
This model is not yet optimized or post-trained. Generation quality may be inconsistent.
Emotive Tags: (laughs)
, (clears throat)
, (sighs)
, (gasps)
, (coughs)
, (singing)
, (sings)
, (mumbles)
, (beep)
, (groans)
, (sniffs)
, (claps)
, (screams)
, (inhales)
, (exhales)
, (applause)
, (burps)
, (humming)
, (sneezes)
, (chuckle)
, (whistles)
Expressive and emotional speech synthesis
This model is not post-trained; the hosted weights are as-is.
Emotive tags: <laugh>
, <chuckle>
, <sigh>
, <cough>
, <sniffle>
, <groan>
, <yawn>
, <gasp>
Kokoro, Chatterbox, Kyutai TTS
Feature | Sesame CSM-1B | Dia | Orpheus |
---|---|---|---|
Voice Cloning | ✅ | ✅ | ✅ |
Emotive Tokens | ❌ | ✅ | ✅ |
Multi-speaker | ✅ | ✅ | ✅ |
Real-time Streaming | ✅ | ✅ | ✅ |
Custom Fine-tuning | ✅ | ✅ | ✅ |
While CSM-1B doesn’t accept emotive tokens (laugh/sigh/etc.), it has the capacity to generate these artifacts in output audio based on conversational context.
Explore the available text-to-speech models and their capabilities
Voicelab provides access to state-of-the-art open-source text-to-speech models.
State-of-the-art conversational speech model
This model is optimized and certain voices are professionally cloned. This means that we offer fast, low-cost inference for this model, and that select voices will have more consistent quality.
general-purpose TTS model
This model is not yet optimized or post-trained. Generation quality may be inconsistent.
Emotive Tags: (laughs)
, (clears throat)
, (sighs)
, (gasps)
, (coughs)
, (singing)
, (sings)
, (mumbles)
, (beep)
, (groans)
, (sniffs)
, (claps)
, (screams)
, (inhales)
, (exhales)
, (applause)
, (burps)
, (humming)
, (sneezes)
, (chuckle)
, (whistles)
Expressive and emotional speech synthesis
This model is not post-trained; the hosted weights are as-is.
Emotive tags: <laugh>
, <chuckle>
, <sigh>
, <cough>
, <sniffle>
, <groan>
, <yawn>
, <gasp>
Kokoro, Chatterbox, Kyutai TTS
Feature | Sesame CSM-1B | Dia | Orpheus |
---|---|---|---|
Voice Cloning | ✅ | ✅ | ✅ |
Emotive Tokens | ❌ | ✅ | ✅ |
Multi-speaker | ✅ | ✅ | ✅ |
Real-time Streaming | ✅ | ✅ | ✅ |
Custom Fine-tuning | ✅ | ✅ | ✅ |
While CSM-1B doesn’t accept emotive tokens (laugh/sigh/etc.), it has the capacity to generate these artifacts in output audio based on conversational context.