We provide access to Sesame’s ultra-realistic CSM-1B voice model, through a rebuilt version that can produce audio in real-time and at low-latency. This voice engine is available at no additional cost.

In testing, the voice generally surpasses state-of-the-art voice vendors in realism, while beating them on cost and latency.

Key Features

  • Natural Prosody: Sesame voices deliver more natural intonation, rhythm, and stress patterns in speech
  • Improved Expressiveness: Better emotional range and contextual understanding
  • Enhanced Pronunciation and Spelling: More accurate handling of complex words and phrases
  • Seamless Transitions: Smoother flow between sentences and paragraphs

Using Sesame Voices

Sesame voices can be identified by the “Sesame” tag in the voice selection interface.

While there are a small number of available Sesame voices right now, cloning Sesame voices is straightforward, and can be done with ~8-20 seconds of audio. For tips on effectively creating new Sesame voices, see the Voice Cloning section.

Sesame voices are still in beta, and may still have instability in inference (e.g. long pauses, or strange conversational artifacts). We regularly release updates that enhance their capabilities and performance.