What you hear may not be what you think it is.
I ran an experiment on some friends, and the results were not quite what I expected.
The question was simple: “Is this me?”
The answer, as the clickbait headlines might say, will surprise you.
Become a Patron of Ask Leo! and go ad-free!
Trusting your ears
The ability to synthesize audio speech using a sample of someone’s voice has progressed to the point where we really do need to be extra cautious in situations where we might be mislead. It’s amazing and useful technology, but the ability to “deep fake” a voice comes with risks.
First, the test
If you watch the video accompanying this article, below, or in fact have watched any of my videos or listened to me on any podcast, you’ll have a sense for what I sound like.
Now, listen to this:
My question to you is simple: is that me? More realistically, if you didn’t realize that audio was part of an article entitled “Don’t Trust Your Ears”, would you think it was me or not?
It’s not. The real me never spoke that; it’s generated audio. All I did was copy/paste the text of the first sentence of this section into a speech synthesis service provided by llElevenLabs, and “I” started speaking.
It’s pretty darned spooky.
Running a similar test on some friends, the results were mixed. Some recognized the fake, some did not.
The next variation
Now listen to this:
Not only did I not speak that, I didn’t write it. I never said any of it.
That’s the result of asking ChatGPT to generate “100 words on the importance of password managers in the style of Leo Notenboom”, and then pasting the resulting text1 into the voice-generation app.
Granted, this one’s easier to detect as fake. Setting aside the fact that for some reason, ChatGPT generated 197 words instead of the requested 100, the audio cadence and intonation is ever so slightly off in places.2 But this technology is only going to get better. Given the pace of improvement to date, I’d expect significantly better results very soon.
It’s cheap and easy
The service costs a little, but aside from that, the only thing required was for me to upload five minutes of audio of me speaking. (Anything over five minutes is apparently a waste!)
That was all. I could have uploaded one of my podcast episodes or scraped the audio from one of my hundreds of YouTube videos. In this case, I went for maximum audio quality and recorded myself narrating one of my personal blog posts.
Oh, and I promised that I had the rights to use this voice.
You can guess where this is going.
Anyone can do it… to anyone
If you can collect five minutes of decent audio from someone talking, you’re done. You have their voice. You can lie about having the right to use it,3 and you can now have that person say anything.
We’ve seen (or rather heard) this happen already to politicians and a few famous actors. Usually it’s in the realm of video deep fakes, which remain somewhat easier to identify.
However, audio could be enough to cause issues.
Some financial institutions use voice ID for authentication. The phrase “my voice is my password” is commonly used. A security researcher used the tool to synthesize his voice saying that phrase, and it worked. He was able to access his account using voice ID technology without actually having spoken the phrase.
Hopefully banks will discontinue this technique quickly.
I don’t want to paint too bleak a picture here. There are some amazing uses for this technology. Consider Stephen Hawking speaking in his “real” voice rather than the old stilted computer-generated voice he used in his later years. Or consider Roger Ebert, the famous film critic, who lost his voice to cancer; at one point Apple went to great lengths to create a text-to-speech model of his voice so his wife could hear him speak as she remembered him. Now those “great lengths” are more like a few minutes that anyone can take.
It really comes down to skepticism. “Recognizing” a voice may no longer be enough to know for certain who you’re talking to on a phone. Imagine a scammer running the “I’ve lost my passport and need money” scam, only this time using a synthesized copy of the victim’s grandchild’s voice.
Be skeptical. Be amazed — but as always, and as with so many things, be alert.
It’s always me in my weekly newsletter. Subscribe to Confident Computing! Less frustration and more confidence, solutions, answers, and tips in your inbox every week.
Download (right-click, Save-As) (Duration: 10:33 — 14.6MB)
Footnotes & References
1: The ChatGPT-generated text:
Password managers: essential tools for today’s digital landscape. Leo Notenboom here, and I’m here to emphasize the importance of these handy tools in our ever-connected world.
With the Internet continuously growing and the number of online accounts multiplying, it’s become nearly impossible to remember complex, unique passwords for each service. This is where password managers come to the rescue, ensuring we maintain strong online security.
These tools create, store, and auto-fill complex passwords, adding a layer of protection against cyber threats. They encourage the use of unique, hard-to-guess passwords, thus minimizing the risk of a single breach affecting multiple accounts.
Moreover, password managers reduce the likelihood of falling prey to phishing attacks. By automatically filling in credentials on legitimate websites, they train users to avoid manually entering passwords on potentially malicious sites.
In summary, password managers are vital tools in the digital era. They enhance our online security by promoting strong, unique passwords and shielding us from various cyber threats. In a world where personal information is more vulnerable than ever, the adoption of password managers is a crucial step towards safeguarding our digital identities. So, what are you waiting for? Get yourself a password manager today!
2: And I’d never use the word moreover.
3: I’m NOT suggesting you do this. I’m also not a lawyer, so the legalities are beyond me, but they seem scary and complex.
6 comments on “Don’t Trust Your Ears”
I could hear a lot of errors in inflection, but it still would have fooled me. It sounded like you doing an emotionless reading of the text which is believable. I had a few college professors whose monotone voices could have been flawlessly deep-faked.
This may spawn a new forensic science if it hasn’t already, especially in crimes that contain ransom. Audible books could be pirated easily too.
In that recording of “you” talking about password managers, the voice was believable but the inflections and monotone speaking didn’t sound right. It sounded like you were reading it without your normal emphasis on certain words.
It is interesting that someone’s voice can be so easily faked and made to sound so realistic.
I noticed that the inflections were off, but it sounded to me as if Leo were reading the text. Leo reads with more inflection, but many people read with a lack of inflection similar to Leo’s fake. This technology will improve over time, probably a short time, and the fakes will be indistinguishable from the real thing.
In politics, recordings have had impact. Now, or in the very near future, it will be possible to produce sound and video of anyone saying anything. Then, when a real recording of something embarrassing is revealed, the victim could say “AI fake!” and no one would believe it.
Truth is dead.
Scary and disturbing.