The Booming Voice AI Landscape: A VC Perspective

October 16, 2024

3.4K views

7 minute read

Voice AI applications will unlock $10B of new software TAM over the next five years
Bessemer Venture Partners

Remember when talking to machines felt like science fiction? Those of you old enough to remember the ‘Google Duplex’ demo (which turned out to be fake) might recall the feeling of astonishment that tech can sound that natural. Well, that future is now knocking on our door. ChatGPT’s advanced voice mode and Eleven Labs are setting new benchmarks in conversational AI by enhancing voice quality and realism, NotebookLM’s natural voice podcast took the Internet by storm and new open source technologies are making high quality voice cloning easier than ever.

Like many tech breakthroughs, it’s bringing unprecedented opportunities for startups. As a VC watching this space, I’m seeing a perfect storm brewing: massive investment, breakthrough technologies, and untapped markets ripe for disruption. But it’s also not free of challenges – from powerful incumbents to questions about the dark side of these technologies.

In this post I tried to collate the best thinking about Voice AI, standing on the shoulders of research published by Lightspeed, A16Z, Bessemer and others and bringing examples that I found compelling. If you get chance, watch some of the videos to get a sense on how far the technology got. Let’s dive in!

The State of Play: Voice AI in 2024

In 2024, about a third of all venture capital funding has been going into AI companies. Most of that investment (dollar wise) has been going to companies building foundational AI models raised over $23 billion, with voice technology being a key beneficiary. This includes OpenAI’s latest round of $6.6 billion (largest VC round in history). But substantial investments are also being deployed into emerging startups, particularly into vertical applications. This trend is evident in the success of companies like DeepL (translation), Speak (language learning), and Retell AI (call centres). Sierra AI, founded by Bret Taylor (former co-CEO of Salesforce, CTO of Facebook and current chairman of OpenAI) is currently raising hundreds of millions of dollars at $4 billion valuation, just a year or so from launch after unlocking AI voice agents for companies.

For the second quarter in a row, AI was the top sector by venture dollars invested. And funding to AI companies has grown this year not just in terms of absolute dollars invested, but also proportion. (source: Crunchbase)

But what’s more interesting is how the technology is being deployed. First, It’s worth taking a look at the most updated landscapes and then dive into the trends.

The latest landscape in the Voice AI space was published by Lightspeed. It provides a comprehensive overview of the current state of voice technology and how it developed over time.

Another deep dive on Voice AI was recently published by A16Z, with a particular focus on voice AI agents and the desire to automate/reinvent the phone call. It’s particularly interesting to think about voice AI in terms of the tech stack needed to build the voice engines, but note that the application layer (for both B2B and B2C apps) sits on top of the tech stack doesn’t require to build the full infrastructure.

The landscape is still relatively small, but growing. On the B2B side, Business voice applications have progressed significantly, from rudimentary interactive voice response (IVR) systems in the 1970s to sophisticated conversational AI systems powered by LLMs. Large players entering the AI agent space are starting to acquire companies in this space (or build their own solutions). In the landscape below, Israeli startup Tenyx was recently acquired by Salesforce for an undisclosed sum.

On the B2C side, with advancements in real-time conversational AI, businesses can now deliver seamless, interactive voice experiences that feel increasingly natural and personalised. For example Speak and Praktika, which use voice AI for language learning, grew very quickly to over $20M in revenue in the last 12 months.

Bessemer makes a bold prediction that Voice AI applications will drive $10 billion in new software TAM over the next five years. While early Voice AI companies focused on Automatic Speech Recognition (ASR), a new generation is emerging with conversational voice solutions that handle repetitive tasks. These advancements enable professionals in sales, recruiting, customer support, and administrative roles to concentrate on more strategic, high-value activities.

Emerging trends in Voice AI

Real time AI Audio Agents and live conversations – which coincided with the launch of its OpenAI’s Advanced Voice Mode, enables users to have a real time voice conversation with the chatbot, and even get it to sing. I’ve yet to try it personally, but the demos I’ve seen online have been very impressive. Another example is the startup Bland AI, a startup that can handle sales and customer service

Google’ is building a real-time voice assistant called Project Astra, which aims to deliver real time multi modal user interaction by seeing the world and communicating with the user in natural language. Imagine if Siri and Alexa could do this?

Multi-Modal Innovation The integration of voice with other AI capabilities is creating new possibilities. OpenAI’s voice mode isn’t just about speech – it’s about natural, contextual conversations. Google’s Illuminate and NotebookLM are great examples of taking content that is primarily text and making into human sounding podcast/voice conversation between two people.

Democratisation of Voice Tech Tools: ElevenLabs, the leader in the space, is pushing boundaries in voice synthesis, making AI characters sound increasingly human and available to any developer via API. The company is 2 years old and is reportedly doing $80M ARR per TechCrunch.

Another example is Cartesia AI. It enables creating real-time, multi-modal AI systems that can function independently of cloud connectivity, thereby enhancing privacy and reducing latency.

What once required massive resources can now be accomplished with open-source tools and modest computing power. A case in point, Ethan Mollick recently shared a thread on how he cloned his voice using e2-f5-tts running locally (using Pinokio) with only 10 seconds of original voice recording. This democratisation is driving innovation at the edges. Think about the products and services people can come up with next.

You really, really should not trust audio clips anymore

Even a couple months ago, it used to take a commercial service to clone a voice. No more. Here is me creating a voice clone of myself using just a 10 second reference clip on my home computer

This is all real time, no cuts pic.twitter.com/0OlBcMcYbm
— Ethan Mollick (@emollick) October 14, 2024

The ElevenLabs Reader App. Listen to any article, PDF, ePub, or any text on the go with the highest quality AI voices.

Vertical Applications Taking Off. A large portion of the funding and innovation in voice AI is concentrated on applications for specific industry verticals.

Healthcare (remote patient monitoring, mental health support) like Suki which raised $70M earlier this month
Education (language learning, personalised tutoring) like Speak, which raised a Series B-3 round in July at a $500 million valuation
Customer Service (intelligent voice agents) like Ada
Entertainment (gaming, interactive content) such as Volley, which creates AI voice games and recently raised $55M series C or Respeecher AI which can change voices for AI filmmaking or help you license celebrity voices.

Opportunities for Startups: Focusing on Niche Solutions

Despite the dominance of giants like OpenAI and Google, startups have ample room to innovate by focusing on niches. Here’s where startups can find room to grow:

Industry Specialisation: Vertical AI applications are transforming industries by leveraging domain-specific data and AI models to address specialised use cases. This includes a wide range of verticals like In-car entertainment, hospitality, commerce, personal health, financial services etc.
Agentic Automation for Enterprise Functions: Generative AI agents are being deployed to automate complex business processes across various functions. As A16Z pointed out, there’s a huge opportunity in automating phone calls, especially those that have a predictable flow, this can include: customer service (although this space is getting very crowded), sales and marketing, IT helpdesk, meeting management etc. Virtual employees for hire.
Consumer Cloud Applications: Bessemer forecasts that AI-driven content, including voice, will dominate by 2030. AI is revitalising the consumer cloud market, creating opportunities for startups building applications that leverage voice and other modalities. From voice enabled content creation to social media or education, users are willing to pay for high quality interactions to either reduce loneliness or get entertained. Google paid $2.6 billion to re-hire the founders of Character.ai and I could see a voice enabled version of that platform coming up in the near future. Would you pay $1 to have a phone call with virtual Elon Musk? Napoleon? Mahatma Gandhi?
Innovating on-device – On-device processing requires balancing performance with power consumption and device resources. As mentioned in the example of Cartesia, enabling users to access voice AI applications via the phone is crucial as it’s a natural way that consumers use voice and has the widest availability. That being said there are also opportunities in other connected devices like home assistants, TVs, watches, car entertainment etc.

Ethical Challenges and Market Considerations

The rapid growth of voice AI presents notable challenges:

Competition from AI Giants: Startups face competition from large, well-funded companies like OpenAI, Google, and Microsoft, which are developing sophisticated voice and translation models and have vast-amounts of data and distribution advantages.
Technical hurdles: Ensuring the accuracy of speech recognition and language understanding is essential for reliable performance. Another component of this technical challenge is accuracy. AI voices that sound ‘robotic’ can be disappointing for users.
Latency and Cost: Training and deploying sophisticated voice models can be computationally expensive. Current architectures often involve multiple steps (speech to text, text processing, text to speech) that can introduce delays and make voice interactions costly. Reducing latency to sub-250 milliseconds is crucial for natural-sounding conversations
Ethical and IP Concerns: With the proliferation of voice cloning and tokenised speech, startups must address ethical concerns proactively to ensure responsible development and deployment. There’s a fairly good chance that bad actors are using the latest voice technology for malicious purposes.
Data Privacy and Security: Voice data is highly sensitive and subject to regulations like GDPR. Startups need to prioritise data security and privacy to maintain user trust and comply with legal requirements
Managing Human-AI Interaction: Voice AI applications need to be designed to seamlessly hand off to human agents when necessary, for example in the case of health or customer service. It’s important to keep a human in the loop and maintain a high quality control.

A Call to Action: Innovating in Voice AI

The voice AI revolution is unfolding, and startups operating at the application layer can benefit from a more robust infrastructure they can build on. This is a pivotal moment for startups to innovate, collaborate, and shape the future of voice technology.

At Remagine Ventures, we invest in pre-seed startups in Israel and UK. If you’re a founder building the future of AI Voice applications/agents, we’d love to hear from you.

Author
Recent Posts

Follow me

Eze Vidra

Co Founder and Managing Partner at Remagine Ventures

Eze is managing partner of Remagine Ventures, a seed fund investing in ambitious founders at the intersection of tech, entertainment, gaming and commerce with a spotlight on Israel.

I'm a former general partner at google ventures, head of Google for Entrepreneurs in Europe and founding head of Campus London, Google's first physical hub for startups.

I'm also the founder of Techbikers, a non-profit bringing together the startup ecosystem on cycling challenges in support of Room to Read. Since inception in 2012 we've built 11 schools and 50 libraries in the developing world.

Follow me