FEED issue 31 Web

79 START-UP ALLEY Respeecher

RESPEECHER

COUNTRY: UKRAINE STARTED: 2018

Many content generation applications that leverage AI focus on the visual, the end result of which can be unsettlingly life-like deep fakes. However, Kyiv-based Respeecher is focused firmly on creating life-like mimics in the audio world. The two-year-old company – founded by CEO Alex Serdiuk, and deep learning engineers Grant Reaber and Dmytro Bielievtsov – offers an AI-based toolkit that enables voice cloning, allowing one actor to perform the voice of another. One of the firm’s early projects was the resurrection of the voice of a recently deceased Ukrainian singer at a hackathon event, which left some members of the audience convinced that he was still alive. The start-up’s speech-to-speech synthesis produces audio that is much higher in quality than text to speech, Serdiuk claims. He adds that, following the first release of the product at the beginning of 2019, Respeecher was used on a “major Hollywood blockbuster”, albeit under NDA. The firm was also among a hand- picked selection of start-ups to showcase their wares at the Cannes (virtual) Film Festival this year, and Serdiuk is particularly enthused by film-use cases. “We’re meeting a lot of documentary filmmakers because the product is such a great tool to bring back voices from the past, and to make history more ‘visual’ – in an acoustic way,” he says. A recent such project includes providing the simulated voice for Richard Nixon in an alternative history of the US Moon landings, produced by MIT ’s Center for Advanced Virtuality. Other case uses, Serdiuk says, could be matching the subject’s voice in a biopic – he uses the recent film Bohemian Rhapsody as an example. “The actor who played Freddie Mercury really looked like him, but the voice wasn’t there – voice synthesis would have added an extra level of authenticity,” he says. In features such The Gemini Man , where CGI was used to make Will Smith look younger, his voice was not age matched – something that is also

SOUNDS GOOD Respeecher’s technology is supposed to be able to imitate voices so accurately, even top Hollywood sound engineers are foxed...

costs are structured on a project-by- project basis, depending on the level of servicing a client requires (whispers and screams, for example, can take longer). Prices are comparable to CGI, Serdiuk adds, “because what we are doing is effectively CGI for sound”. However, he also wants to reach out to smaller companies to fit their budgets, too. The start-up’s unique offering has already attracted around $1.5m in funding, with bigger VCs coming on board after the firm was accepted into seed accelerator Techstar ’s 2019 cohort. While Serdiuk isn’t keen on the term ‘deep fake,’ Respeecher has often been referred to as a ‘deep fake for good’ company, openly pledging to help fight fake content created for ill gains. Serdiuk reveals that the company is currently working on an acoustic watermarking technique for all the sound that it produces, so that it is distinguishable as a product of Respeecher – something that the firm may release to the wider industry further down the line. They are also offering up high quality data sets for companies that work on speech detectors, while also working on a detector of their own. “We recognise it is actually our responsibility to help educate society that this technology is a great tool for content creators but it can bring misuse, and we should be ready for that.”

claimed for Respeecher. The system’s ability to reduce accents coupled with the work the company is currently doing to get the product to work in real time could also lead to call centre applications, Serdiuk adds. But will the ability to synthesise voice talent put real actors out of work? Serdiuk is not convinced that unions will be threatened by the technology. He says usage is only given after written consent from the voice owner, and it’s generally used as a tool to scale the voices of in- demand actors, potentially leaving more work available to the less well-known. “Unions are much more threatened about text-to-speech companies in this regard because they don’t require any human input,” Serdiuk notes. “What’s special about our system is that we take all the emotional content from a real human and transfer it to our system, and change the voice, so we keep the human in the loop,” he adds. At the moment this process takes between two weeks and a month, and WHAT WE ARE DOING IS EFFECTIVELY CGI FOR SOUND

feedzinesocial feedmagazine.tv

Powered by