Sesame’s AI voice models are, without a doubt, among the most human-sounding that have been yet developed. The company, which recently released an uncanny pair of voice model demos on its website, has open-sourced its advanced Conversational Speech Model (CSM). The voices on its website — of Maya and Miles — are not present, but the realism of the tech is miles more advanced than that of, say, Siri, Alexa, or even OpenAI’s advanced voice model, which have varying degrees of flatness.
In particular, this previously under-the-radar company has released its 1B CSM variant to the public under an Apache 2.0 license. Built on a Llama backbone with a specialized audio decoder generating Mimi audio codes, CSM is billed as a significant leap forward in natural speech generation. As the company noted in February, it aims to cross the so-called uncanny valley of conversational voice.
For a quick demo of the CSM, check out my overview
Available on GitHub under the SesameAILabs organization and with checkpoints hosted on Hugging Face, developers can now build on this model, which has some capacity to emote and seemingly breathe at times. Getting started requires minimal setup: a CUDA-compatible GPU (tested on CUDA 12.4 and 12.6), Python 3.10 (though newer versions may work), and in some cases, ffmpeg for audio operations.
Unlike traditional text-to-speech systems that simply read provided text, CSM can maintain conversational context. The documentation explains that “CSM sounds best when provided with context,” allowing developers to create more natural-sounding dialogue by providing previous utterances as context. This feature enables the speech generation to adapt its tone, pacing, and expressiveness based on the conversation flow. A YouTube video from Dr. Know-it-all Knows it all breaks down the nuts and bolts of the technology as well as providing a demo with one of the voice AIs, Maya.
The open-source release is accompanied by ethical guidelines, explicitly prohibiting uses like impersonation, misinformation, or illegal activities.
From an R&D perspective, Sesame AI, a San Francisco-based startup, has ties to Oculus VR, whose Rift device was among the first modern consumer-grade VR headsets. Specifically, one of its co-founders, Brendan Iribe, is also behind Sesame, which is also eyeing an augmented reality play of its own: lightweight AI glasses that pair with the voice assistant. Sesame AI’s overarching mission is to achieve what it calls “voice presence.” The company describes this as a “magical quality that makes spoken interactions feel real, understood, and valued”
The company has quickly attracted major venture capital backing, including Andreessen Horowitz (with backing from Marc Andreessen and Anjney Midha), Spark Capital, Matrix Partners and others. Sesame has offices in San Francisco, Bellevue, and New York.
Backgrounder on the tech
Conversational Speech Model (CSM): At the core of Sesame AI’s innovation is its Conversational Speech Model (CSM) – an AI speech system designed for real-time, human-like conversation. Unlike traditional text-to-speech pipelines that first generate text and then synthesize audio, Sesame’s CSM is end-to-end and multimodal. In other words, it processes text and audio context together in a single model, as the International Society for Presence Research has noted. This allows the AI to “think” as it speaks, producing not just words but also the subtle vocal behaviors that convey meaning and emotion. For example, CSM will insert natural pauses, “ums” or appear to draw breath, chuckle, or change in tone on the fly. As a result, the AI’s speech comes across as fluid and often human-like.
Architecture and training: Sesame’s Conversational Speech Model taps transformer architecture similar to large language models, adapted for speech generation. The model is essentially composed of two neural networks working in tandem. Sesame refers to them as a “backbone” or master model and a decoder. The backbone is a transformer with billions of parameters that handles understanding the text and conversational context, while the decoder is a smaller transformer that focuses on producing the fine-grained audio output. In the current prototype, the largest CSM configuration has an 8.3 billion parameter backbone paired with a roughly 300 million parameter decoder. The system represents audio using two types of learned tokens: semantic tokens, which capture the linguistic content and high-level speech traits, and acoustic tokens, which capture detailed voice characteristics like timbre, pitch, and timing. By operating directly on these discrete audio tokens, CSM can generate speech without an intermediate text-only step. While in tests with their voice models, there is still some latency at times. But this one-stage approach avoids the bottleneck of traditional two-stage TTS systems. It also is a factor that allows for richer expressivity as it enables the model to adjust intonation and pacing based on context in near real time.
Training data: To give the model its natural speaking ability, Sesame AI trained CSM on a huge corpus of voice data. According to the team, they curated and processed roughly 1 million hours of predominantly English audio recordings for training, as it notes on its website. Sesame drew this dataset from publicly available sources, and transcribed and segmented it. The result supports the model’s learning from both the words and how those words are spoken in different situations. It performed training over multiple epochs with long sequence lengths (up to 2,048 tokens, roughly equivalent to two minutes of speech so that the model could learn long-range conversational dependencies. By exposing the AI to a diverse range of voices, speaking styles, and dialogues, Sesame enabled CSM to absorb nuances of human speech. Those include subtle emotional inflections to the rhythm of back-and-forth dialogue.
“Voice Presence” in practice: Achieving voice presence is about more than clear pronunciation; it requires the AI voice to exhibit multiple human-like qualities in tandem. Sesame’s research identifies several elements behind natural conversational speech. Those are emotional intelligence, the ability to detect cues of emotion or mood and respond with an appropriate tone (e.g. sounding empathetic when the user is upset). In tests, however, the model doesn’t always succeed in correctly reading emotion, but it does offer more prowess in detecting, say, sarcasm than older voice AI models. In addition, Sesame’s AI can handle conversational dynamics, handling the timing and flow of dialog, such as knowing when to pause, when to interject or yield, and using natural filler words or laughter. It is also contextually aware, capable of adjusting its speaking style based on the context and history of the conversation. It can, for instance, maintain a certain formality or referencing earlier topics when appropriate. In addition, it can adopt a coherent persona or voice character that remains reliable and appropriate over time, so the user feels they are talking to the same character each time.