An example of an example with the CSM of the sesame created by Gwen Porcel.
Gwen Porcel, Co -host AI for the podcast of humansPosted Video on Reddate for example Where man makes an excuse for embezzlement and argues with the boss. It is so dynamic that it is difficult to tell who is a human being and which AI model is. Deciding through our own demo, it is fully capable you see in the video.
“Nearly human standard”
Under the hood, Sesam’s CSM acquires its realism by using two AI models together (a spinal cord and decoder). Meta’s calls Architecture that processes text and audio. Sel trained three AI model sizes, the maximum of 8.3 billion parameters (8 billionback bone models plus 300 million parameter decoder) was used primarily on English audio about 1 million hours.
Sesam’s CSM does not follow the traditional two -phase approach used by several systems before the text. Instead of producing cement token (high -level speech represented) and sound details (fine grain audio properties) in two separate stages, Sesam’s CSM integrates into a single step, multi -modal transformer -based model, jointly interfaith text and audio texts. Openi’s sound model uses a similar multi -modal approach.
In blind tests without the context of the dialogue, human diagnostics did not show any clear preference between CSM speech and real human recording, suggesting that the model achieves humanity standards for isolated speech patterns. However, when the conversation is provided with the context, the reviewers still prefer real human speech, which shows that the space is completely in the context of the context.
Seed co -founder Brandon Ireb Acknowledged In a comment on hacker News, the current limits, noting that the system is “still very restless and often inappropriate in its tone” and has issues of interference, time and conversation. He wrote, “Today, we are firmly in the valley, but we hope we can get out.”