On average, men and women speak roughly 15,000 words per day. We call our friends and family, log into Zoom for meetings with our colleagues, discuss our days with our loved ones, or if you’re like me, you argue with the ref about a bad call they made in the playoffs.
Hospitality, travel, IoT and the auto industry are all on the cusp of leveling-up voice assistant adoption and the monetization of voice. The global voice and speech recognition market is expected to grow at a CAGR of 17.2% from 2019 to reach $26.8 billion by 2025, according to Meticulous Research. Companies like Amazon and Apple will accelerate this growth as they leverage ambient computing capabilities, which will continue to push voice interfaces forward as a primary interface.
As voice technologies become ubiquitous, companies are turning their focus to the value of the data latent in these new channels. Microsoft’s recent acquisition of Nuance is not just about achieving better NLP or voice assistant technology, it’s also about the trove of healthcare data that the conversational AI has collected.
Our voice technologies have not been engineered to confront the messiness of the real world or the cacophony of our actual lives.
Google has monetized every click of your mouse, and the same thing is now happening with voice. Advertisers have found that speak-through conversion rates are higher than click-through conversation rates. Brands need to begin developing voice strategies to reach customers — or risk being left behind.
Voice tech adoption was already on the rise, but with most of the world under lockdown protocol during the COVID-19 pandemic, adoption is set to skyrocket. Nearly 40% of internet users in the U.S. use smart speakers at least monthly in 2020, according to Insider Intelligence.
Yet, there are several fundamental technology barriers keeping us from reaching the full potential of the technology.
By the end of 2020, worldwide shipments of wearable devices rose 27.2% to 153.5 million from a year earlier, but despite all the progress made in voice technologies and their integration in a plethora of end-user devices, they are still largely limited to simple tasks. That is finally starting to change as consumers demand more from these interactions, and voice becomes a more essential interface.
In 2018, in-car shoppers spent $230 billion to order food, coffee, groceries or items to pick up at a store. The auto industry is one of the earliest adopters of voice AI, but in order to really capture voice technology’s true potential, it needs to become a more seamless, truly hands-free experience. Ambient car noise still muddies the signal enough that it keeps users tethered to using their phones.
In the customer service industry, your accent dictates many aspects of your job. It shouldn’t be the case that there’s a “better” or “worse” accent, but in today’s global economy (though who knows about tomorrow’s) it’s valuable to sound American or British. While many undergo accent neutralization training, Sanas is a startup with another approach (and a $5.5 million seed round): using speech recognition and synthesis to change the speaker’s accent in near real time.
The company has trained a machine learning algorithm to quickly and locally (that is, without using the cloud) recognize a person’s speech on one end and, on the other, output the same words with an accent chosen from a list or automatically detected from the other person’s speech.
It slots right into the OS’s sound stack so it works out of the box with pretty much any audio or video calling tool. Right now the company is operating a pilot program with thousands of people in locations from the U.S. and U.K. to the Philippines, India, Latin America and others. Accents supported will include American, Spanish, British, Indian, Filipino and Australian by the end of the year.
To tell the truth, the idea of Sanas kind of bothered me at first. It felt like a concession to bigoted people who consider their accent superior and think others below them. Tech will fix it … by accommodating the bigots. Great!
But while I still have a little bit of that feeling, I can see there’s more to it than this. Fundamentally speaking, it is easier to understand someone when they speak in an accent similar to your own. But customer service and tech support is a huge industry and one primarily performed by people outside the countries where the customers are. This basic disconnect can be remedied in a way that puts the onus of responsibility on the entry-level worker, or one that puts it on technology. Either way the difficulty of making oneself understood remains and must be addressed — an automated system just lets it be done more easily and allows more people to do their job.
It’s not magic — as you can tell in this clip, the character and cadence of the person’s voice is only partly retained and the result is considerably more artificial sounding:
But the technology is improving and like any speech engine, the more it’s used, the better it gets. And for someone not used to the original speaker’s accent, the American-accented version may very well be more easily understood. For the person in the support role, this likely means better outcomes for their calls — everyone wins. Sanas told me that the pilots are just starting so there are no numbers available from this deployment yet, but testing has suggested a considerable reduction of error rates and increase in call efficiency.
It’s good enough at any rate to attract a $5.5 million seed round, with participation from Human Capital, General Catalyst, Quiet Capital and DN Capital.
“Sanas is striving to make communication easy and free from friction, so people can speak confidently and understand each other, wherever they are and whoever they are trying to communicate with,” CEO Maxim Serebryakov said in the press release announcing the funding. It’s hard to disagree with that mission.
While the cultural and ethical questions of accents and power differentials are unlikely to ever go away, Sanas is trying something new that may be a powerful tool for the many people who must communicate professionally and find their speech patterns are an obstacle to that. It’s an approach worth exploring and discussing even if in a perfect world we would simply understand one another better.
“Voice skins” have become a very popular feature for AI-based voice assistants, to help personalize some of the more anodyne aspects of helpful, yet also kind of bland and robotic, speaking voices you get on services like Alexa. Now a startup that is building voice skins for different companies to use across their services, and for third parties to create and apply as well, is raising some funding to fuel its growth.
LOVO, the Berkeley, California-based artificial intelligence (AI) voice & synthetic speech tool developer, this week closed a $4.5 million pre-Series A round led by South Korean Kakao Entertainment along with Kakao Investment and LG CNS, an IT solution affiliate of LG Group.
Its previous investor SkyDeck Fund and a private investor, vice president of finance at DoorDash Michael Kim, also joined the funding.
The proceeds will be used to propel its research and development in artificial intelligence and synthetic speech and grow the team.
“We plan on hiring heavily across all functions, from machine learning, artificial intelligence and product development to marketing and business development. The fund will also be allocated to securing resources such as GPUs and CPUs,” co-founder and chief operating officer Tom Lee told TechCrunch.
LOVO, founded in November 2019, has 17 people including both co-founders, chief executive office Charlie Choi and COO Lee.
The company plans to double down on improving LOVO’s AI model, enhance its AI voices and develop a better product that surpasses any that exists in the current market, Lee said.
“Our goal is to be a global leader in providing AI voices that touches people’s hearts and emotions. We want to democratize limitations of content production. We want to be the platform for all things voice-related,” Lee said.
With the mission, LOVO allows enterprises and individual content creators to generate voiceover content for using in marketing, e-learning, customer support, movies, games, chatbots and augmented reality (AR) and virtual reality (VR).
“Since our launch a little over a year ago, users have created over 5 million voice content on our platform,” co-founder and CEO Choi said.
LOVO launched its first product LOVO Studio last year, which provides an easy-to-use application for individuals and businesses to find the voice they want, produce, and publish their voiceover content. Developers can utilize LOVO’s Voiceover API to turn text into speeches in real-time, integrated into their applications. Users also can create their own AI voices by simply reading 15 minutes of script via LOVO’s DIY Voice Cloning service.
LOVO owns more than 200 voice skins that provide users with voices categorized by language, style, and situation suited for their various needs.
The global text to speech (TTS) market is estimated at $3 billions, with the global voiceover market at around $10 billion, according to Lee. The Global TTS market is projected to increase $5.61 billion by 2028 from $1.94 billion in 2020, based on Research Interviewer’s report published in August 2021.
LOVO already secured 50,000 users and more than 50 enterprise customers including the US-based J.B. Hunt, Bouncer, CPA Canada, LG CNS, and South Korea’s Shinhan Bank, Lee mentioned.
LOVO’s four core markets are marketing, education, movies and games in entertainment and AR/VR, Lee said. The movie Spiral, the latest film of the Saw Series, features LOVO’s voice in the film, he noted.
It is expected that LOVO will create additional synergies in the entertainment industry in the wake of the latest funding from a South Korean entertainment.
VP of CEO Vision Office at Kakao Entertainment J.H. Ryu said, “I’m excited for LOVO’s synergies with Kakao Entertainment’s future endeavors in the entertainment vertical, especially with web novels and music,” Ryu also added, “AI technology is opening the doors to a new market for audio content, and we expect a future a where an individual’s voice will be utilized effectively as an intellectual property and as an asset.”
Founding Partner at SkyDeck Fund Chon Tang said, “Audio is uniquely engaging as a form of information but also difficult to produce, especially at scale. LOVO’s artificial intelligence-based synthesis platform has consistently out-performed other cloud-based solutions in quality and cost.”
LOVO is also preparing to penetrate further to international market, “We have a strong presence in the US, UK, Canada, Australia and New Zealand, and are getting signals from the rest of Europe, South America and Asia,” Lee said. LOVO has an office in South Korea and is looking to expand into Europe soon, Lee added.