Two months ago, I started work on a side project experimenting with artificial speech synthesis using recently published machine learning methods. The pace of advancement in speech quality produced by neural network models following the deep learning revolution in 2012 was impressive, and with the release of the WaveNet paper by Google’s DeepMind in 2016 it was clear that the uncanny valley had been crossed. It was now possible to generate artificial speech indistinguishable from human speech, as judged by native speakers.
I was amazed by the quality demonstrated in the results of the WaveNet paper when it was published, and began casually following the literature, though I wasn’t actively contributing to it. I was interested in the application of speech synthesis to the creative industry — in particular, voice generation for games and films. I grew up a media junkie, and though my academic and professional career wound up taking an engineering path, I have remained passionate about artistic storytelling my whole life.
I decided to explore the space from an entrepreneurial perspective to see what markets might exist around speech synthesis models as content creation tools. I delved a bit deeper into the literature involving the state-of-the-art models and decided that a good way to learn about how these models worked would be to glue together a handful of existing open-source components to produce a realistic text-to-speech demonstration of a highly recognizable voice. I experimented with a few voices, eventually settling on psychologist, author, and podcaster Jordan Peterson for a few reasons, not the least of which being that I am a fan of his work and style of intellectual discourse, even though I don’t agree with all of his viewpoints.
I started working on the project part-time at the end of June. After about four weeks of learning how the components worked, experimenting with data processing techniques, and training a few different models, I had a pretty convincing speech generator. The next step was to build a tech demo that would let users generate audio from the model using a simple front end — type and listen.
Building the site was actually harder for me than developing the speech model. I needed to learn a few technologies, having never built a website myself before. Getting the site running smoothly took another month. I mention this to emphasize that creating this model was not difficult. All of the technology was readily available. If I can build it in a few weeks, there are surely many engineers more talented than I that could do it better faster. We must decide as a society how we want to deal with this fact. More on this later.
An Unexpected Reception
On August 14th, I posted a link to the site (notjordanpeterson.com) on Reddit. By the next day, there were over three thousand active users; by the second day, over ten thousand. I hadn’t expected such demand so quickly, and had to spin up a couple more servers to handle the traffic, though even this wasn’t enough. Wait times to synthesize audio continued to climb each day, reaching as high as 20 minutes to generate 20 seconds of audio, though people didn’t seem to mind the wait. I received almost no complaints — only positive feedback about the site’s entertainment value.
On August 16th, TNW wrote an article about the site, touting its “eerily realistic” quality, which drove more users to the site. On August 17th, a YouTube video on The Thinkery’s channel was posted, describing the site and the uncanny resemblance of the generated audio to Dr. Peterson’s real voice. The video also raised potential concerns over technologies like this.
I found most of the praise for the quality of the model to be a bit overblown. The fidelity of the generated speech is certainly convincing (a testament to the caliber of the models’ inventors, not any skill on my part), but not indistinguishable from the real thing. In particular, the frequent mispronunciations and general lack of emotional affect make it obvious that the speech is synthetic. I liken the model’s output to Dr. Peterson speaking into a cheap microphone while on a low dose of propofol. It is undoubtedly possible, however, to generate short clips that would fool someone who wasn’t listening for a distinction. Models like this will only improve in the coming months, so indistinguishability from an arbitrary person’s voice is around the corner.
In addition to the articles and videos posted about the site during the week it was up, I received numerous messages from users of the site, almost all of which were positive, some in interesting and unexpected ways. Most people talked about using the site to send friendly jibes to their friends, something I did with my own friends while testing it out, to the glee of the recipients. Others — long-time fans of Dr. Peterson — talked about sending encouraging messages to family in the voice of someone they greatly admired, and how well-received those messages were. Still others discussed generating inspirational quotes from Dr. Peterson addressed directly to them. One person even described how much this had improved his mental health in just a few days and suggested I consider pursuing a partnership with Dr. Peterson to develop a personalized mindfulness app.
To my surprise, Dr. Peterson himself tweeted a link to the Thinkery video not long after its release, with only the pithy commentary “not good”. I interpreted this to mean “not good for society”, though it was difficult to tell whether the topic was only of passing interest to him, or represented something of deeper concern. This became clear on August 22nd, when Dr. Peterson published a blog post about “deep fake” technologies, citing my website and a few of its YouTube forebears. He posed several relevant and important questions we collectively need to address in the coming years, all of which I and many others have been thinking about as these technologies evolve. What I did not expect was his conclusion, in which he called for “the Deep Fake artists … to be stopped, using whatever legal means are necessary, as soon as possible,” also suggesting that this type of content should be a felony offense.
This seemed to me an overreaction. However, out of respect for Dr. Peterson, and in the spirit of having a rational discussion about the implications of this technology, I suspended the functionality of the site for the time being. The remainder of this post is my attempt to articulate some of the complicated issues the existence of this technology raises, and some ways we might go about mitigating its misuse while maximizing its potential for good.
To Regulate or Not to Regulate
Despite the numerous positive examples, there were certainly negative uses of the site, with some people entering offensive text and posting the resulting audio. I need not repeat the examples here, as I’m sure you can imagine them. I have no doubt that Dr. Peterson encountered several of these negative examples from media coverage of the site, which surely colored his initial reaction, though I would be remiss if I didn’t point out that the mainstream media hasn’t needed any machine learning technology thus far to put words in Dr. Peterson’s mouth.
Assigning blame for inappropriate uses of technology to the technology itself is counterproductive, however. A natural reaction to novel technology is to immediately use it toward lewd or immature ends. Recall that two of the top-selling apps in Apple’s App Store immediately after its launch in 2008 were iFart and iBeer — and that was on a curated platform. This inclination comes from facility. It’s easy to generate banal content like this, and it receives attention in the short term. Despite this tendency, however, the novelty wears off quickly and more creative and productive uses begin to emerge.
Nevertheless, since the media invariably focuses on the negative aspects of new technologies, the subject of regulation looms large. While there is often need for regulation of some kind when addressing global issues (free markets don’t solve all problems), we also need to act judiciously when considering how to regulate emerging technologies. We need to solve the core problems, and not throw the baby out with the bathwater.
It should be obvious that machine learning isn’t the first technology to have both benign and nefarious implications. It’s difficult to think of any technology that doesn’t exhibit this dichotomy. Fire, guns, the printing press, computers, the internet. Even something as seemingly innocuous as an online social network, where you can share photos and stories with friends, has landed us in a heap of trouble, spurring congressional oversight. How many people in 2004 predicted that?
Whatever the potential perils, the best answer to the regulation of new technology is rarely to strangle it in its crib. This in general breeds black markets and an overabundance of misuse, followed by costly enforcement policies. In cases where the technology is trivially easy to distribute, as it is with software, these effects are amplified. Digital piracy comes immediately to mind. The answer to Napster and its file-sharing descendants wasn’t litigation from the RIAA. It was iTunes. And later Spotify. New business models. New technologies. Positive sum solutions. Swift and untempered regulation is often a losing battle, if not an incredibly protracted and expensive one. We need to find smarter ways to adapt.
In the specific case of machine learning algorithms that can mimic the likeness of real people (especially public figures), it’s instructive to analyze existing modes of such mimicry to see what, if anything, is different about these new technologies. It is fairly straightforward to produce a convincing photo of Dr. Peterson riding bare-chested on a horse, à la Putin. This has been possible for years. Yet no one is currently flustered by this because people have become accustomed to the existence of Photoshop, and they have learned to be more skeptical of digital images. It will be the same with machine-learning generated audio, after an adjustment period, and eventually video as well. Regulating the technology itself would be akin to holding Adobe accountable for an internet troll’s Peterson/Putin mashup photo.
Consider also impersonations and parody, sans machine learning. If we collectively believe that impersonation in digital media without intent to commit fraud, but rather for the sake of parody or artistic expression should be illegal, why have we not shut down Saturday Night Live (SNL), or at least forbidden it to portray any public figures in its sketches? Such content is currently protected under freedom of speech and parody laws, and for good reason. Are we to shut down parody when we don’t like its content?
What is different about using machine learning technologies to create such content? There does indeed seem to be an intuitive distinction at first glance, but examining more closely may yield no differences. I believe the reasons for this intuition are twofold — accessibility and fidelity.
One characteristic of software is the ease with which it can be transmitted and used — its accessibility. Once a solution is found for a problem through software, it is almost immediately usable by anyone with an internet connection. Obviously this has both pros and cons. If the software is of net benefit to society, it means we get better faster. If it’s malicious, it is very difficult to correct for.
So what about content creation software? If nearly everyone suddenly has the ability to make a convincing video parodying a public figure in a matter of minutes, is that good or bad? Not an easy question to answer. One thing is for sure though — the sheer volume of content that will emerge will result in a redistribution of viewers’ attention from a select few sources of content to a much larger pool, albeit with much higher variability in quality. We have already seen this with the rise of YouTube and streaming content services siphoning viewers’ attention from traditional media networks over the past decade.
As more content of higher caliber becomes available from a wider array of content creators, the public’s attention will necessarily be reapportioned. However, the more clever and thought-provoking creators will inevitably rise to the top, in one of those hierarchies Dr. Peterson regularly lectures about. While accessibility certainly makes any potential misuses more conspicuous, does it change the nature of what is being produced? Parody is parody. If the nature and purpose of the content isn’t changing, should accessibility change the nature of legislation surrounding it? Asking for a friend.
Fidelity is perhaps the more salient reason for people’s unease with this technology. When Alec Baldwin parodies Donald Trump on SNL, it is hilarious, and it is also immediately obvious that he is not actually Donald Trump. What if it weren’t obvious? Aside from the fact that it would probably make the sketch less funny, does it change the nature of the content? Is it no longer parody? We are reaching the point at which fidelity of generated content is high enough that this question needs to be answered, at least for audio. If the answer is that it is no longer parody, but fraud, this seems tantamount to claiming that parody constitutes fair use of someone’s likeness unless you do it really well. This should raise some eyebrows.
As for criminal misuse of this technology — that is, genuine fraud with intent to deceive — these cases should be handled as they always have been under our judicial system. The responsibility should lie with the generator of the content based on their use of it and their intent. It makes sense to police inappropriate use of the content generated by this technology, but not the technology itself. With the invention of the phone came the prank call, and later the fraudulent call. The solution wasn’t to eliminate telephony. Pushing this technology underground can’t be the solution to ensuring that the media we view is authentic.
So what is the solution? Given that the advancement of this technology is inevitable, what are some ways we can protect ourselves against bad actors leveraging its accessibility and fidelity? The root problem that must be solved is source verification. We trust information from sources that have built trust with us. Reputation is paramount. What has changed as a result of technological progress is the ability to counterfeit information — to make it seem as if it were issued by a trusted party, when in fact it was not.
Surely this problem sounds familiar. And surely the solutions we have already developed are equally familiar. Currency exchange has been particularly susceptible to shifts in technology, yet with each shift we quickly find ways to handle bad actors. For centuries, nations have spent tremendous effort to combat counterfeiting of physical currencies. With the arrival of credit cards came credit card fraud, and entire divisions and agencies for fighting it. With the rise of the internet, and the prospect of exchanging currency digitally, came the entire ecosphere of secure digital transactions, backed by cryptographic methods. Never did we consider regressing to a cash-only society.
The solutions to protecting our money have also been readily used to protect our personal information. When you visit a website, your browser tells you whether the site is secure — that the information you are sending will be encrypted, and that the recipient of the data is actually who it claims to be, as verified by a trusted third party. Future developments in blockchain technology will render the trusted third party unnecessary, but the principle is the same. Cryptographic methods will continue to be the solution to source verification when transmitting information electronically.
Over the next few years, I see no way around moving to a communication model in which we cryptographically sign digital media meant to be sources of truth. If a video claims to be from the White House, it will be cryptographically signed by the White House, and there will be software to verify that. If an audio clip claims to be from Jordan Peterson, it will be signed using Dr. Peterson’s private key. If a media clip claims to be from a trusted source but doesn’t have a valid signature for that source, your media player will tell you that, and you can choose to ignore it accordingly.
Whatever the details of how we implement these verification systems, the fundamental principle is straightforward and essential: we should treat any media we intend to consider a source of truth the same way we treat our money and personal information. Everything else should be considered entertainment, or require further verification from multiple sources.
In short, the solution to source verification is, at least in part, more technology, not litigation and suppression. We must also learn to be more skeptical of digital audio and video from untrusted sources. This opinion piece from the New York Times (coincidentally posted on the same day I released my site) shares the same sentiment. As Dr. Peterson has emphasized in many of his 12 Rules for Life lectures, the general populace is not dumb. There simply exist adjustment periods during the emergence of new technologies. We are entering one of those now. We will adapt intelligently, doomsayers be damned.
An Emerging Landscape
Assuming we do manage to adapt and avoid a post-truth dystopia, what will the landscape of content creation look like over the next few years? Within the domain of speech synthesis, it will be possible and inexpensive in the next three to five years to generate a perfect clone of someone’s voice from a few minutes or less of their speech. It will also be possible to create new synthetic voices by interpolating between existing voice models, allowing content creators to produce the full gamut of variability in human speech, including accents and intonations, in any language.
The voice acting industry will change dramatically as a result of this. A CGI movie can be made today without the use of human actors, with the exception of dialogue. With the rise of synthetic voices, films and video games will increasingly use software tools to generate the dialogue they need in much the same way that they now use graphics software tools for modeling, texturing, animation, and lighting.
Applications outside the creative industry will make extensive use of this technology as well. Call centers, digital assistants, content readers, and advertisements will all deliver highly personalized content using flawless human speech in a listener’s native language. This will also open large markets in localization — narration and dialogue will be instantaneously translatable into any language using the speaker’s native voice with an appropriate accent. Imagine Scarlett Johansson speaking perfect French, Spanish, German, and Italian in the European releases of her future movies.
To the extent that existing celebrities continue to maintain their personae in digital media, the content they produce will be increasingly machine-generated. Personalized content addressed directly to individual fans will be a staple of those stars who wish to keep up with this changing landscape. Machine learning technology will make this possible at greater scale without requiring any of a celebrity’s time. Those who resist this technology rather than embrace it will likely miss out on opportunities for new revenue streams.
More intriguing applications of voice synthesis technology include preserving cherished voices in today’s media, as well as resurrecting the voices of beloved celebrities that have passed on. Consider the joy of having David Attenborough and Morgan Freeman narrate our documentaries for another 100 years, and listening to future newscasts delivered by Walter Cronkite. Whether these scenarios come to pass remains to be seen, but it is certain that the technology will be available to achieve them in the near future.
The potential applications of this technology to consumers’ personal lives are numerous and thought-provoking as well. As an example, one user of my site who messaged me to compliment me on my work added that he couldn’t imagine how much his mom would be willing to pay to hear personalized messages spoken to her by her deceased husband. The “living portraits” of remembered loved ones as described in the Harry Potter novels will not be relegated to the world of fantasy for much longer. Does this concept tug at your heart strings, or sound like an episode of Black Mirror? Perhaps both? Progress in machine-generated media will raise increasingly many unusual questions like this in the coming years.
I am currently in the process of starting a company to build next-generation content creation tools for storytellers. Our mission is to empower everyone to tell the best versions of their stories possible by leveraging machine learning to reduce the barrier to entry for creating professional-quality media. The experience I’ve gained building this prototype and witnessing people’s reactions to it has been invaluable. While the business models I am exploring don’t revolve around the use of well-known personalities, I still believe the issues discussed above are critical for us to address intelligently as a society. We must find a way to maintain the protections of free speech and parody while minimizing the potential harm from bad actors.
I wrote this post with the hope of stimulating further discussion about the implications of machine-generated content. I look forward to hearing from others who are thinking deeply about the issues I’ve addressed here, and learning from their perspectives.