Everything About OpenAI’s Voice Engine, Which Can Mimic People’s Voices

Artificial intelligence (AI) is developing at such a fast pace that we hear regular updates on new advancements every month or so.

In fact, it has become so advanced that companies might start to engage fewer people due to AI.

With AI like OpenAI’s ChatGPT and Google’s Gemini to write for you and even generate images, there’s a lot of human functions that are starting to be replaced by technology.

The latest update on OpenAI’s newest tool has been announced just recently – a new AI tool that can mimic human voices with uncanny accuracy.

OpenAI’s Voice Engine

The company that brought us ChatGPT is bringing us yet another innovative AI after showing off their video generator, Sora.

OpenAI’s AI voice generator, Voice Engine, has a range of potential applications, including accessibility services.

On 29 March, the company shared samples from early tests of the tool, which uses a 15-second sample of someone speaking to generate a convincing replica of their voice.

Isn’t that amazing? 15 seconds is all it needs!

Visit this blog post to hear the generated audios, and be awed by how human-like they sound.

Voice Engine is currently accessible only to a “small group of trusted partners,” including education and health technology companies.

OpenAI will use their tests to determine whether and how to allow more widespread adoption of the service.

While OpenAI is not the first to launch an AI-generated voice service, the company has proven to be particularly good at garnering widespread adoption of their AI tools, as we saw with ChatGPT.

Other companies offering AI-generated voice services include ElevenLabs, PlayHT, Murf AI, and Resemble AI.

How Voice Engine Can Be Used

Reading assistance

OpenAI says Voice Engine will be able to provide “reading assistance to non-readers and children through natural-sounding, emotive voices representing a wider range of speakers than what’s possible with preset voices.”

An education technology company, Age of Learning, has been using this to generate pre-scripted voice-over content. They also use Voice Engine and GPT-4 to create real-time, personalized responses to interact with students.

Junior college students in Singapore are all-too-familiar with online lectures that have to be watched every week.

With AI voice generation services, it could be even easier for teachers to create online lectures, saving lots of time and energy to be spent on other things.

Translating Content

Voice Engine will also be able to provide assistance in translating content, like videos and podcasts, so creators and businesses can reach more people around the world, fluently and in their own voices.

The aforementioned blog post includes an example of an audio clip of a human reading an English passage about friendship, alongside AI-generated audio that sounds like the same person reading the same passage in Spanish, Chinese, German, French, and Japanese.

In each of the AI-generated samples, the tone and accent of the original speaker is maintained.

If I played the five AI-generated samples and the original English recording, and asked you to identify which one is the reference audio (without you knowing the answer beforehand, of course), you probably wouldn’t be able to tell which is the reference audio and which are generated audios.

When your ahma scolds you in dialect, it doesn’t hurt very much because you don’t understand what she’s saying. Wait till OpenAI opens this to the public and she starts translating dialect into English for you to understand every word when she scolds you.

Reaching Global Communities

Voice Engine will also help to improve essential service delivery in remote settings.

Dimagi is building tools for community health workers to provide a variety of essential services, such as counseling for breastfeeding mothers.

To help these workers develop their skills, Dimagi uses Voice Engine and GPT-4 to give interactive feedback in each worker’s primary language including Swahili or more informal languages like Sheng, a code-mixed language (primarily Swahili and English-based) popular in Kenya.

Singlish next?

Supporting Non-verbal People

Individuals with conditions that affect speech and those with learning needs will be able to benefit from Voice Engine.

People who are non-verbal will be able to access unique and non-robotic voices across many languages through Livox, an AI alternative communication app, allowing these individuals with disabilities to communicate.

Their users can choose speech that best represents them, and for multilingual users, maintain a consistent voice across each spoken language.

Helping Patients Recover Their Voice

Voice Engine will be able to help those suffering from sudden or degenerative speech conditions recover their voice.

The Norman Prince Neurosciences Institute at Lifespan, a not-for-profit health system that serves as the primary teaching affiliate of Brown University’s medical school, has been piloting a programme offering Voice Engine to some individuals with speech impairment.

Miraculously, since Voice Engine only requires a short 15-second audio sample, doctors Fatima Mirza, Rohaid Ali, and Konstantina Svokos were able to restore the voice of a young patient who lost her fluent speech due to a vascular brain tumour, using audio from a video recorded for a school project.

Technology Is A Double-edged Sword

We know all too well that technology can be both a boon and a bane.

AI-generated voice services could very much fuel the creation and spread of disinformation or make it easier to perpetrate scams.

A lot of us hang up calls we get from random numbers if the person on the other end has a robotic voice, because we know it’s likely to be a scam.

But what if those scammers use Voice Engine to generate human-like voices? Will more people fall prey to such scams, then?

Moreover, as this technology allows AI-generated voices to sound human-like, how easy will it be for anyone to make deepfake videos more believable?

We’ve already seen deepfakes of Donald Trump and Taylor Swift speaking in Chinese. The deepfakes sounded incredibly real and believable, causing many netizens to believe the two actually spoke in Chinese.

It is all too easy for anyone with access to AI voice-generators to create deepfake videos that may sway public opinion on important matters, government affairs, and more.

As such services become more readily available, will more and more people start to use these for sinister purposes?

“We recognize that generating speech that resembles people’s voices has serious risks, which are especially top of mind in an election year,” OpenAI said.

The company does not plan to release Voice Engine to the public immediately.

They do, however, encourage steps such as:

  • Phasing out voice based authentication as a security measure for accessing bank accounts and other sensitive information
  • Exploring policies to protect the use of individuals’ voices in AI
  • Educating the public in understanding the capabilities and limitations of AI technologies, including the possibility of deceptive AI content
  • Accelerating the development and adoption of techniques for tracking the origin of audiovisual content, so it’s always clear when you’re interacting with a real person or with an AI

How Voice Engine Compares To ChatGPT

Some may compare Voice Engine to ChatGPT, as OpenAI announced a new read aloud feature for the latter just last September.

The new ChatGPT feature allows the chatbot to read out its answers in five different realistic voices.

The main difference between the two services is that one simply has more of a text-to-speech function, while the other analyses an individual’s voice to generate an audio that sounds just like the reference voice.

While ChatGPT can read you a bedtime story, it definitely does not have Voice Engine’s ability to read it to you in the voice of your mother.

Or in the voice of a soothing therapist, if your mother’s usual voice is just screaming at the top of her lungs.

Moreover, OpenAI has stated that “the model is proficient at transcribing English text but performs poorly with some other languages, especially those with non-roman script.”

Non-English users are advised against using ChatGPT for this purpose.

On the other hand, Voice Engine has shown it is capable of performing well with other languages, as seen in the sample recordings on the blog post.

ChatGPT’s read aloud feature has already been rolled out and can be used via the web version of ChatGPT, as well as the iOS and Android versions of the app.

While ChatGPT can’t speak in many different voices like Voice Engine, it performs the simple task of reading aloud rather well.

So, the next time you’re lazy to read a long report, grab a cup of tea and sit down in a comfy chair, and let ChatGPT do the reading for you.

Unless you want it to be read in your own voice. You’ll have to wait quite a while before Voice Engine is released to the public.

OpenAI’s Other Brainchild: Sora, An AI-generated Video Tool

Just two months ago, OpenAI introduced a new AI model, Sora, which it claims can create “realistic” and “imaginative” 60-second videos from quick text prompts.

While such a tool could cause various problems, the company said it plans to work with experts to test the latest model and look closely at various areas including misinformation, hateful content, and bias.

OpenAI also said it is building tools to help detect misleading information.

Sora will first be made available to cybersecurity professors who can assess the product for harms and risks.

A couple of visual artists, designers, and filmmakers will also have access to Sora, to allow the company to collect feedback on how creative professionals could use it.

With so many developments happening in AI at the speed of light, there’s no telling what will happen next. Maybe we’ll be able to use holograms to FaceTime people, or maybe the education system will change so much we can barely recognise it in five years time.

Or maybe AI will destroy humanity #justsaying

Or maybe not. Watch this and you’d understand: