WTF is an OSARO?

As you may or may not know, and may or may not be interested in, I work at an artificial intelligence startup called OSARO. We use machine learning to develop intelligent software for applications in industrial automation. Our software, when installed on a physical system like a factory robot, learns to perform tasks that have thus far been too difficult to program using traditional software engineering methods. Essentially, we build brains for factory robots that until recently have had very little of them.

What you are less likely to know, and even less likely to be interested in, is what our company name means. This post is for those of you that are intrigued—both of you. OSARO is actually an acronym, each letter of which corresponds to a component of a continuous process of learning and decision-making that nearly every living organism with a nervous system, including you, engages in throughout the entirety of its life. This process is called reinforcement learning, or, more colloquially, trial-and-error learning.

In a nutshell, reinforcement learning consists of trying something out and seeing what happens. If the outcome is good, do it again. If the outcome is bad, try something else. If you encounter a situation similar to one you’ve been in before, do the thing that felt good, not the thing that felt bad. This is a gross oversimplification, but at its heart this principle describes how most living things that move in the world make their decisions. OSARO’s mission is to design software that mimics this process of learning and decision-making to produce robotic systems that exhibit intelligent and robust behavior.

So what does this have to do with OSARO’s name? The learning and decision-making process for systems that employ the reinforcement learning framework is often represented abstractly by a sequence of steps that repeat ad infinitum. OSARO’s name is an abbreviation of the components that comprise each step of that sequence: Observation, State, Action, Reward, Observation. OSARO. What does each of these components signify, and how does each relate to producing intelligent behavior?


You experience the world around you as a rich array of sensations that begin as signals generated by your sensory organs—eyes, ears, skin, etc. Everything you are aware of (and much you are not aware of) is a summary of the data entering your body from the neurons in these organs. These sensory neurons pass information to your brain about the stimuli acting on them—light on your retina, changes in air pressure on your eardrum, temperature and pressure on your skin, tension in your muscles, etc. We collectively call these raw data observations. Without these data, your brain would have no information to process; there would be nothing to summarize, no world to make sense of. You’d be a brain in a jar. It would be very boring.

Your brain is continuously processing signals from billions of these sensory neurons hundreds of times per second and trying to make sense of it all. This is big data—very high-dimensional and very noisy—far too much make decisions based on the raw data themselves. The world is too complex and unpredictable at this level of understanding, akin to drinking from a firehose. To make decisions quickly, you need some way of throwing away the irrelevant details, keeping the relevant ones and summarizing them concisely. Your brain must turn these fast-changing, highly-detailed data (e.g., light intensities signaled by the cells in your retina that change every few milliseconds) into slow-changing, high-level data that are easier to reason about (e.g., seeing a fly landing on your sandwich). In short, you need state.


All intelligent entities need some way to concisely represent the complex world they live in. In the same way that we can represent numbers in many different ways (counting beads, hash marks, Roman numerals, decimals), there are innumerable ways to summarize and represent the torrent of data (observations) coming from your sensory organs. Some representations are more useful than others, depending on your objectives. Evolution has selected for representations that can summarize sensory data very quickly in ways that improve an organism’s chances of survival and reproduction. Brains are thus, in part, data compression machines—naturally evolved summarizers.

The complex representations that humans and animals form of the world around them are determined in part by genetics, which determine the physical structure of the brain and its early neuronal connections, and in part by experience, which modify those connections in response to sensory data. We call the latter modification learning. An infant has a primitive and not especially useful representation of the world because it hasn’t yet experienced enough data to have learned how to summarize them usefully. It must rely only on innate mechanisms of compression that evolved to function right out of the womb, and these are limited in scope. During the first few months of life, more sophisticated representations slowly take form out of the “blooming, buzzing confusion” of data that initially seem to have no coherent structure. While it’s difficult to imagine as an adult what it feels like to be an infant that can’t make sense of what it’s looking at, an example may help illustrate the point.

Take a quick look at the picture below. If you’ve seen it before, bear with me. If you haven’t, don’t look too hard at it just yet. Just glance at it. Do you see anything other than a bunch of black splotches on a white background? If not, that’s great; this example will be even more compelling to you. Before you look more closely, try to hold in your mind what it feels like not to recognize what’s in the picture—how it feels to just see splotches.


Ok. Now look closer. Do you see any common objects you can identify? Some patterns that look roughly like something you’ve seen before? Blurring your vision a little may help. How about the lower part of a big tree in the upper left, casting a large shadow on the ground below it? And what about in the middle of the scene? Do you see a dog, facing away from you? A Dalmatian perhaps? With its head lowered and sniffing the ground? Keep looking and you’ll eventually see it. If not, search “hidden dalmatian” in Google images and you’ll find some versions with the dog highlighted.

Now think about the moment when you recognized the dog. Did it feel like it sort of “snapped” into place? An “aha” moment? What happened there? The image didn’t change. Nothing got uncovered in the image or came into focus on your retina that wasn’t there before you recognized the dog.  Your brain was processing the same pattern of information coming in through your retina, the same observations, trying desperately to make sense of those black splotches. Initially, it put forth the hypothesis that it really was just a picture of black splotches with no inherent meaning. But after some time it came up with another hypothesis that made more sense given your prior experiences with dogs and trees, and was still consistent with the data (your observations). You understood what you were looking at. Your state changed.

What’s even more compelling about the hypothesis your brain just came up with is just how strongly it will hold onto it once confirmed. Remember when I asked you to try to remember what it felt like not to see the dog? Try to go back to that mental state. Try to look at the image now and not see the dog. I’m willing to bet you can’t. You can remember that you couldn’t see it before, but you can’t not see it now.

This is a contrived example to make you consciously aware of perceptual changes of state, but the reality is that this is what your brain is doing all of the time. This process is just what it normally feels like to observe the world around you and understand it. You just don’t notice it because your brain is so good at it, and things in the real world are usually not so ambiguous—though indeed this is because your brain has learned to become very good at disambiguating things that you see often.

For the first few weeks of life, newborns likely interpret their surroundings in a manner similar to the way you first interpreted the image above—lots of splotchy patterns moving around their visual field, with no meaning or structure to any of it. Over time, as they are presented with more data from their environment, abstract representations of these data that are consistent with the world’s inherent structure slowly take form and become “obvious”.

The problem the brain is solving when forming these abstract representations is known as representation learning. There are many theories about exactly how the brain does this, but we are far from a comprehensive answer. Because of this lack of understanding, it’s also very difficult to reverse engineer an artificial system that solves this problem, despite massive research efforts in artificial intelligence over the last 70 years. We’re making slow but steady progress, however. Recent advances in a class of approaches based on neural network models, collectively termed “deep learning”, have produced several impressive results in just the past few years. There is still a long road ahead before we’ll be able to design systems that exhibit general intelligence in the way that humans do. In the meantime we can leverage these intermediate solutions to incrementally improve the capabilities and robustness of robotic systems—precisely the approach we take at OSARO.

As interesting as the problem of representation learning is, it’s even more interesting to ask why our brains should even bother to solve it. What’s the point of state? Why bother compressing the data we observe? Why is it important to summarize the complicated light patterns on your retina as “a fly landing on my sandwich”?  The answer is provided by our next component of interest.


We wouldn’t bother to try to predict and understand the world around us if we couldn’t do anything with that understanding. We predict, we understand, so that we can act. We act in order to change our environment to improve our situation—that nasty fly on your sandwich must be swatted away. Action is the raison d’être of intelligence.

Consider living things that don’t act—such as plants. We generally don’t consider them to display much intelligence. Although plants have some physiological mechanisms for adapting to their environment minimally over long time scales, they generally can’t move in the sense that we commonly think of movement, and consequently don’t have nervous systems. There’s no reason for them to summarize their surroundings because there is nothing they can do with that information. Consider an even more curious organism, the sea squirt.

sea squirt

Its early life is spent as a free-swimming, tadpole-like larva, as shown in the upper image above. In this form, it has a primitive nervous system connected to a single eye and a long tail. During this phase of its life, it needs to navigate its watery environment searching for food and a place to settle for the second phase of its life. It thus needs to take in sensory information, process it, and do something appropriate (e.g., move toward food). Once it reaches its second phase of life, it attaches itself to a hard surface and becomes a filter feeder, destined never to move again. In this form, as shown in the lower image above, it simply consumes whatever food particles happen to float into its mouth. At this point, it promptly digests its brain and most of its nervous system, since they are now nothing but a drain on precious resources. Waste not, want not.

These examples highlight the fact that all of the work our brain does to learn a representation of our environment is in service of predicting what we will likely observe next—including the consequences of our own actions—so that we can act accordingly to influence our surroundings. The better and further into the future we can predict, the greater the level of control we have over our environment, and the easier it is to improve our situation. What differs between species of animals is the sophistication of those prediction mechanisms and how they are tailored to the body of the organism—the ways that it can sense its environment and affect its surroundings.

Now that we know why our brains try to summarize the world in ways that let us do useful things, it’s natural to ask what we should do at any given moment.  Why choose any one action over another?  What does it mean to improve our situation? This is where our next component of interest comes in.


There are a functionally infinite number of things you could be doing at any given moment of your life. You have hundreds of muscles you can use to affect the world around you, each of which you can contract and release in an astronomically large number of combinations every fraction of a second. How does your brain decide which of them to choose? We don’t just take random actions at every moment and hope for the best. We take deliberate actions to achieve goals. This is the hallmark of intelligence.


Where do these goals come from?  The answer comes in part from specialized regions of the brain that produce reward signals. These signals tell the rest of the brain how “good” or “bad” your current state is. We choose actions that take us to states yielding positive rewards, and away from states yielding negative rewards. Our goals—acquiring food when we’re hungry, running from predators when threatened, etc.—derive from this principle.

But this begs the question. Where do rewards come from? What determines what is good and what is bad? The process of evolution provides the bedrock here. Darwinian natural selection has spent hundreds of millions of years tuning the innate reward-generating circuits in brains to direct organisms’ behavior toward states that increase their likelihood of reproducing. Reproducing of course requires staying alive, and so being well-fed, not thirsty, and safe from predators are highly-rewarding states. Being in the good graces of attractive mates also increases an organism’s likelihood of passing on its genes, hence all of the positive rewards associated with sex.

Given these innate “ground truth” signals provided by evolution about goodness and badness, biological reinforcement learning systems learn, through trial and error, which actions to take so as to maximize the sum of the rewards they expect to receive in the future. This is the primary objective of a reinforcement learning system—accumulate the most reward possible over time by taking appropriate actions.

This is actually quite a difficult problem because rewards don’t often come immediately or reliably after taking a single action in a given state. Often very long sequences of actions must be taken, passing through many states, before any appreciable reward is received. A reinforcement learning system must figure out which of the actions it took along a path through a set of states were responsible for the rewards it received along that path—a problem known as credit assignment.  There are well-established mathematical models of how this problem might be solved in biological systems, with evidence from experimental neuroscience to support them. The field of computational reinforcement learning, an area of active research for the past several decades, explores the implementation of these models in artificial systems such as robots.

Since evolution is not the guiding process when designing artificial systems in this paradigm, a critical question arises: where should the rewards in such AI systems come from? That’s the 64-trillion dollar question we must answer in the coming decades. Extracting the maximum economic value from truly intelligent robotic systems will center around solving what is known as the value alignment problem. In short, we must design AI systems such that their reward functions encourage behavior that is beneficial to humans. What “beneficial” means precisely and how to account for complex tradeoffs when these systems must interact with humans are open questions in the field of AI. These issues are too complex to address here, but there are many ideas about how we can begin to tackle this problem. The only thing most researchers agree on right now is that we had better get this right.

At OSARO, the tasks our robots perform are constrained to execute in industrial environments with strict safety protocols and little direct human interaction. As such, the reward functions we employ can be fairly simple. I’ll explain in more detail how OSARO’s software incorporates the framework I’ve been outlining thus far in a moment. But first, let’s finish our journey through OSARO’s name, as we have now come to the final component of interest.


I know, I know. We covered this already. But this section is actually about something just as important as observations themselves. Something so important to our reality that Einstein spent much of his life contemplating it: time. Recall that the components I’ve been describing comprise each step in a sequence that spans your entire life; indeed, that sequence is your life. The steps in this sequence occur in your brain hundreds of times per second. Observation, state, action, reward, observation, state, action, reward, observation, state, action reward… you get the picture. Listing observation again here emphasizes the central role that time plays in reinforcement learning.

To understand the world, we must learn representations that both summarize the history of our observations and allow us to make predictions about future observations, given some sequence of actions we intend to take. We must also correlate these sequences of action with the states and rewards they yield so that we will know how to behave in the future, and these correlations are learned through trial and error. All of these requirements for intelligent behavior are inextricably linked to the passage of time. Reinforcement learning is inherently a process.

Even state representations themselves often require a notion of time. Single observations are generally not sufficient to disambiguate what state you’re in at any given moment. In general we must experience several observations in sequence in order to understand what’s happening around us. Take the concept of object permanence, for example. This is simply the idea that an object in your visual field that moves out of view (e.g., behind another object) doesn’t cease to exist. Rather, your brain integrates the history of observations received when the object was in view, and uses that history to maintain the position of the object in your model of the world even after your observations no longer contain any data relating to it (i.e., when it’s no longer visible). This seems like quite an obvious concept, but it’s something that humans don’t learn until they are around 6 months old. When you think about it, that’s a pretty long time to think that your mom ceases to exist every time she plays peek-a-boo with you.

Environments in which a single observation is sufficient to tell you what state you’re in are said to have what is called the Markov property. This means that you can define your state representation to be your current observation without fear of misclassifying your true state. Reinforcement learning has been successfully applied to environments with the Markov property for decades now, in particular to games like Chess and Go. A fairly recent success was DeepMind’s AlphaZero, which used reinforcement learning to achieve super-human performance in Chess, Go, and Shogi, all in a matter of hours. While this was indeed an impressive result in the reinforcement learning community, the fact that these environments are Markovian certainly helped a lot.

The real-world is decidedly non-Markovian. Take a look in front of you. Are your house keys in sight? If not, consider that they might be in your pocket, or in your desk drawer, or under the couch cushions. All of these possibilities might be valid states of your world right now, but they are all consistent with your current observations because your keys are not currently in view. You may remember where your keys are, but that’s because of history—previous observations you experienced that showed your keys to be in your pocket, or desk, or elsewhere. Time is an integral part of how we perceive the world and act in it, and the letters bookending OSARO’s name pay homage to this fact. 


So now you know what OSARO means, and that each of its letters stands for a critical component of a framework for learning and decision-making in biological and artificial intelligent systems. A big challenge for artificial reinforcement learning systems over the past couple of decades has been to apply them successfully to real-world problems—in particular robotic manipulation tasks. This is a challenge we deal with daily at OSARO. One of OSARO’s key solutions, our piece-picking system, learns to pick objects from a source bin full of consumer products and place them into destination bins that then get packed into boxes and shipped. Check out the clip below for an example of our piece picking system doing its thing.


How does OSARO apply the reinforcement learning framework to the industrial automation tasks we solve for our customers? We can use our piece-picking solution as an example and take things one component at a time again to illustrate.


Our software runs on various robotic platforms, but each provides our system with streams of information from which to learn. In the case of piece-picking, every robot cell has one or more cameras that provide the system with visual observations—a view of the source and destination bins from various angles. The robot arm also provides information about the position and orientation of the arm at each moment in time, as well as the amount of force being exerted on the end-effector (the tool attached to the end of the arm that actually does the picking). We often use suction-based end-effectors (as in the clip above), which additionally provide an observation signal indicating degree of air flow. This can be used to infer whether the end-effector is currently holding an object or not. All of these observation signals generate quite a lot of data for our algorithms to process. Deciding where and how to pick or place an object requires aggregating them into a compact, useful state representation.


Our software leverages state-of-the-art machine learning techniques to go from streams of observation data to meaningful, stable descriptors of the system’s state. For instance, images of the pick and place bins are passed through neural networks that learn to identify where the source and destination bins are in 3-dimensional space so that the system knows where it can move the arm to pick and place objects without colliding with the bins.  Other neural networks are trained on images of the contents of the bins to learn things like how full the bins are, where the relevant objects that may need to be picked are located, and exactly where on those objects would be the most effective place for the robot to attempt a pick. These components of the state representation are aggregated together and used by the system to make decisions about where and how to pick and place objects safely and efficiently.


The system has many actions to choose from for any given pick request. These include selecting which object in the bin to grasp, which end-effector tool will be the most effective, the angle at which to pick up the object, and how close to get to the object during approach before slowing down to avoid damaging it. In other tasks, where precise placement position and orientation are also required, the system has even more actions to choose from during the placement motion, such as where to place the object and at what orientation, what height to release it from, and how fast to move when carrying it. The number of possible actions is too large to program a simple set of rules for selecting among them—hence the need to learn appropriate behavior via trial and error.


We guide the system to learn to behave as we would like it to by defining its reward function. The system receives positive rewards for successful picks, which occur when the target object is picked up and placed successfully in the destination bin without being damaged. Negative rewards are given for failures like unsuccessful pick attempts, dropping an object outside of a bin, picking up more than one object, or damaging an object. The system also receives more reward for doing its job faster, and is thus incentivized to pick as fast as possible without damaging or dropping objects.


Recall that learning and decision-making in our system is an ongoing process, unfolding over time. Each attempt at a pick provides a new opportunity for the system to learn about the effects of its actions. When it succeeds, it adjusts its behavior to perform similarly in similar situations. When it fails, it tries something different the next time it’s in a similar situation. The system initially fails often, but improves its speed and accuracy as it continues to attempt picks and stumbles across good solutions that it remembers for the future.  This is the primary advantage of a learning-based system—as it performs its duties it continues to adapt and improve.

Hopefully this outline gives you a clear sense of how OSARO applies the reinforcement learning paradigm to real-world robotic systems. OSARO has deployed multiple instances of our piece-picking solution around the world, all of which are continuously collecting new data and improving the system as a whole. We are also actively working on adapting our software to solve more complex automation problems, including several challenging industrial assembly tasks, such as automotive assembly.

It’s an exciting time to be working on reinforcement learning systems that solve real problems facing many manufacturers and distributors around the world. These systems can be trained to perform repetitive yet difficult manipulation tasks efficiently and inexpensively. Increasingly many countries are encountering shortages of reliable human labor for such tasks, threatening critical supply chains and economic stability. The recent devastating disruption to the global supply chain caused by the COVID-19 pandemic highlights the immense potential for intelligent robotic systems to bolster the stability of logistics infrastructure around the world, and consequently the global economy. OSARO is proud to be at the forefront of advancing the capabilities of these systems to help realize this potential to improve the standard of living for billions of people.

The Strange Future of Digital Media

Two months ago, I started work on a side project experimenting with artificial speech synthesis using recently published machine learning methods. The pace of advancement in speech quality produced by neural network models following the deep learning revolution in 2012 was impressive, and with the release of the WaveNet paper by Google’s DeepMind in 2016 it was clear that the uncanny valley had been crossed. It was now possible to generate artificial speech indistinguishable from human speech, as judged by native speakers.

I was amazed by the quality demonstrated in the results of the WaveNet paper when it was published, and began casually following the literature, though I wasn’t actively contributing to it. I was interested in the application of speech synthesis to the creative industry — in particular, voice generation for games and films. I grew up a media junkie, and though my academic and professional career wound up taking an engineering path, I have remained passionate about artistic storytelling my whole life.

I decided to explore the space from an entrepreneurial perspective to see what markets might exist around speech synthesis models as content creation tools. I delved a bit deeper into the literature involving the state-of-the-art models and decided that a good way to learn about how these models worked would be to glue together a handful of existing open-source components to produce a realistic text-to-speech demonstration of a highly recognizable voice. I experimented with a few voices, eventually settling on psychologist, author, and podcaster Jordan Peterson for a few reasons, not the least of which being that I am a fan of his work and style of intellectual discourse, even though I don’t agree with all of his viewpoints.

I started working on the project part-time at the end of June. After about four weeks of learning how the components worked, experimenting with data processing techniques, and training a few different models, I had a pretty convincing speech generator. The next step was to build a tech demo that would let users generate audio from the model using a simple front end — type and listen.

Building the site was actually harder for me than developing the speech model. I needed to learn a few technologies, having never built a website myself before. Getting the site running smoothly took another month. I mention this to emphasize that creating this model was not difficult. All of the technology was readily available. If I can build it in a few weeks, there are surely many engineers more talented than I that could do it better faster. We must decide as a society how we want to deal with this fact. More on this later.

An Unexpected Reception

On August 14th, I posted a link to the site ( on Reddit. By the next day, there were over three thousand active users; by the second day, over ten thousand. I hadn’t expected such demand so quickly, and had to spin up a couple more servers to handle the traffic, though even this wasn’t enough. Wait times to synthesize audio continued to climb each day, reaching as high as 20 minutes to generate 20 seconds of audio, though people didn’t seem to mind the wait. I received almost no complaints — only positive feedback about the site’s entertainment value.

On August 16th, TNW wrote an article about the site, touting its “eerily realistic” quality, which drove more users to the site. On August 17th, a YouTube video on The Thinkery’s channel was posted, describing the site and the uncanny resemblance of the generated audio to Dr. Peterson’s real voice. The video also raised potential concerns over technologies like this.

I found most of the praise for the quality of the model to be a bit overblown. The fidelity of the generated speech is certainly convincing (a testament to the caliber of the models’ inventors, not any skill on my part), but not indistinguishable from the real thing. In particular, the frequent mispronunciations and general lack of emotional affect make it obvious that the speech is synthetic. I liken the model’s output to Dr. Peterson speaking into a cheap microphone while on a low dose of propofol. It is undoubtedly possible, however, to generate short clips that would fool someone who wasn’t listening for a distinction. Models like this will only improve in the coming months, so indistinguishability from an arbitrary person’s voice is around the corner.

In addition to the articles and videos posted about the site during the week it was up, I received numerous messages from users of the site, almost all of which were positive, some in interesting and unexpected ways. Most people talked about using the site to send friendly jibes to their friends, something I did with my own friends while testing it out, to the glee of the recipients. Others — long-time fans of Dr. Peterson — talked about sending encouraging messages to family in the voice of someone they greatly admired, and how well-received those messages were. Still others discussed generating inspirational quotes from Dr. Peterson addressed directly to them. One person even described how much this had improved his mental health in just a few days and suggested I consider pursuing a partnership with Dr. Peterson to develop a personalized mindfulness app.

To my surprise, Dr. Peterson himself tweeted a link to the Thinkery video not long after its release, with only the pithy commentary “not good”. I interpreted this to mean “not good for society”, though it was difficult to tell whether the topic was only of passing interest to him, or represented something of deeper concern. This became clear on August 22nd, when Dr. Peterson published a blog post about “deep fake” technologies, citing my website and a few of its YouTube forebears. He posed several relevant and important questions we collectively need to address in the coming years, all of which I and many others have been thinking about as these technologies evolve. What I did not expect was his conclusion, in which he called for “the Deep Fake artists … to be stopped, using whatever legal means are necessary, as soon as possible,” also suggesting that this type of content should be a felony offense.

This seemed to me an overreaction. However, out of respect for Dr. Peterson, and in the spirit of having a rational discussion about the implications of this technology, I suspended the functionality of the site for the time being. The remainder of this post is my attempt to articulate some of the complicated issues the existence of this technology raises, and some ways we might go about mitigating its misuse while maximizing its potential for good.

To Regulate or Not to Regulate

Despite the numerous positive examples, there were certainly negative uses of the site, with some people entering offensive text and posting the resulting audio. I need not repeat the examples here, as I’m sure you can imagine them. I have no doubt that Dr. Peterson encountered several of these negative examples from media coverage of the site, which surely colored his initial reaction, though I would be remiss if I didn’t point out that the mainstream media hasn’t needed any machine learning technology thus far to put words in Dr. Peterson’s mouth.

Assigning blame for inappropriate uses of technology to the technology itself is counterproductive, however. A natural reaction to novel technology is to immediately use it toward lewd or immature ends. Recall that two of the top-selling apps in Apple’s App Store immediately after its launch in 2008 were iFart and iBeer — and that was on a curated platform. This inclination comes from facility. It’s easy to generate banal content like this, and it receives attention in the short term. Despite this tendency, however, the novelty wears off quickly and more creative and productive uses begin to emerge.

Nevertheless, since the media invariably focuses on the negative aspects of new technologies, the subject of regulation looms large. While there is often need for regulation of some kind when addressing global issues (free markets don’t solve all problems), we also need to act judiciously when considering how to regulate emerging technologies. We need to solve the core problems, and not throw the baby out with the bathwater.

It should be obvious that machine learning isn’t the first technology to have both benign and nefarious implications. It’s difficult to think of any technology that doesn’t exhibit this dichotomy. Fire, guns, the printing press, computers, the internet. Even something as seemingly innocuous as an online social network, where you can share photos and stories with friends, has landed us in a heap of trouble, spurring congressional oversight. How many people in 2004 predicted that?

Whatever the potential perils, the best answer to the regulation of new technology is rarely to strangle it in its crib. This in general breeds black markets and an overabundance of misuse, followed by costly enforcement policies. In cases where the technology is trivially easy to distribute, as it is with software, these effects are amplified. Digital piracy comes immediately to mind. The answer to Napster and its file-sharing descendants wasn’t litigation from the RIAA. It was iTunes. And later Spotify. New business models. New technologies. Positive sum solutions. Swift and untempered regulation is often a losing battle, if not an incredibly protracted and expensive one. We need to find smarter ways to adapt.

In the specific case of machine learning algorithms that can mimic the likeness of real people (especially public figures), it’s instructive to analyze existing modes of such mimicry to see what, if anything, is different about these new technologies. It is fairly straightforward to produce a convincing photo of Dr. Peterson riding bare-chested on a horse, à la Putin. This has been possible for years. Yet no one is currently flustered by this because people have become accustomed to the existence of Photoshop, and they have learned to be more skeptical of digital images. It will be the same with machine-learning generated audio, after an adjustment period, and eventually video as well. Regulating the technology itself would be akin to holding Adobe accountable for an internet troll’s Peterson/Putin mashup photo.

Consider also impersonations and parody, sans machine learning. If we collectively believe that impersonation in digital media without intent to commit fraud, but rather for the sake of parody or artistic expression should be illegal, why have we not shut down Saturday Night Live (SNL), or at least forbidden it to portray any public figures in its sketches? Such content is currently protected under freedom of speech and parody laws, and for good reason. Are we to shut down parody when we don’t like its content?

What is different about using machine learning technologies to create such content? There does indeed seem to be an intuitive distinction at first glance, but examining more closely may yield no differences. I believe the reasons for this intuition are twofold — accessibility and fidelity.

One characteristic of software is the ease with which it can be transmitted and used — its accessibility. Once a solution is found for a problem through software, it is almost immediately usable by anyone with an internet connection. Obviously this has both pros and cons. If the software is of net benefit to society, it means we get better faster. If it’s malicious, it is very difficult to correct for.

So what about content creation software? If nearly everyone suddenly has the ability to make a convincing video parodying a public figure in a matter of minutes, is that good or bad? Not an easy question to answer. One thing is for sure though — the sheer volume of content that will emerge will result in a redistribution of viewers’ attention from a select few sources of content to a much larger pool, albeit with much higher variability in quality. We have already seen this with the rise of YouTube and streaming content services siphoning viewers’ attention from traditional media networks over the past decade.

As more content of higher caliber becomes available from a wider array of content creators, the public’s attention will necessarily be reapportioned. However, the more clever and thought-provoking creators will inevitably rise to the top, in one of those hierarchies Dr. Peterson regularly lectures about. While accessibility certainly makes any potential misuses more conspicuous, does it change the nature of what is being produced? Parody is parody. If the nature and purpose of the content isn’t changing, should accessibility change the nature of legislation surrounding it? Asking for a friend.

Fidelity is perhaps the more salient reason for people’s unease with this technology. When Alec Baldwin parodies Donald Trump on SNL, it is hilarious, and it is also immediately obvious that he is not actually Donald Trump. What if it weren’t obvious? Aside from the fact that it would probably make the sketch less funny, does it change the nature of the content? Is it no longer parody? We are reaching the point at which fidelity of generated content is high enough that this question needs to be answered, at least for audio. If the answer is that it is no longer parody, but fraud, this seems tantamount to claiming that parody constitutes fair use of someone’s likeness unless you do it really well. This should raise some eyebrows.

Bad Actors

As for criminal misuse of this technology — that is, genuine fraud with intent to deceive — these cases should be handled as they always have been under our judicial system. The responsibility should lie with the generator of the content based on their use of it and their intent. It makes sense to police inappropriate use of the content generated by this technology, but not the technology itself. With the invention of the phone came the prank call, and later the fraudulent call. The solution wasn’t to eliminate telephony. Pushing this technology underground can’t be the solution to ensuring that the media we view is authentic.

So what is the solution? Given that the advancement of this technology is inevitable, what are some ways we can protect ourselves against bad actors leveraging its accessibility and fidelity? The root problem that must be solved is source verification. We trust information from sources that have built trust with us. Reputation is paramount. What has changed as a result of technological progress is the ability to counterfeit information — to make it seem as if it were issued by a trusted party, when in fact it was not.

Surely this problem sounds familiar. And surely the solutions we have already developed are equally familiar. Currency exchange has been particularly susceptible to shifts in technology, yet with each shift we quickly find ways to handle bad actors. For centuries, nations have spent tremendous effort to combat counterfeiting of physical currencies. With the arrival of credit cards came credit card fraud, and entire divisions and agencies for fighting it. With the rise of the internet, and the prospect of exchanging currency digitally, came the entire ecosphere of secure digital transactions, backed by cryptographic methods. Never did we consider regressing to a cash-only society.

The solutions to protecting our money have also been readily used to protect our personal information. When you visit a website, your browser tells you whether the site is secure — that the information you are sending will be encrypted, and that the recipient of the data is actually who it claims to be, as verified by a trusted third party. Future developments in blockchain technology will render the trusted third party unnecessary, but the principle is the same. Cryptographic methods will continue to be the solution to source verification when transmitting information electronically.

Over the next few years, I see no way around moving to a communication model in which we cryptographically sign digital media meant to be sources of truth. If a video claims to be from the White House, it will be cryptographically signed by the White House, and there will be software to verify that. If an audio clip claims to be from Jordan Peterson, it will be signed using Dr. Peterson’s private key. If a media clip claims to be from a trusted source but doesn’t have a valid signature for that source, your media player will tell you that, and you can choose to ignore it accordingly.

Whatever the details of how we implement these verification systems, the fundamental principle is straightforward and essential: we should treat any media we intend to consider a source of truth the same way we treat our money and personal information. Everything else should be considered entertainment, or require further verification from multiple sources.

In short, the solution to source verification is, at least in part, more technology, not litigation and suppression. We must also learn to be more skeptical of digital audio and video from untrusted sources. This opinion piece from the New York Times (coincidentally posted on the same day I released my site) shares the same sentiment. As Dr. Peterson has emphasized in many of his 12 Rules for Life lectures, the general populace is not dumb. There simply exist adjustment periods during the emergence of new technologies. We are entering one of those now. We will adapt intelligently, doomsayers be damned.

An Emerging Landscape

Assuming we do manage to adapt and avoid a post-truth dystopia, what will the landscape of content creation look like over the next few years? Within the domain of speech synthesis, it will be possible and inexpensive in the next three to five years to generate a perfect clone of someone’s voice from a few minutes or less of their speech. It will also be possible to create new synthetic voices by interpolating between existing voice models, allowing content creators to produce the full gamut of variability in human speech, including accents and intonations, in any language.

The voice acting industry will change dramatically as a result of this. A CGI movie can be made today without the use of human actors, with the exception of dialogue. With the rise of synthetic voices, films and video games will increasingly use software tools to generate the dialogue they need in much the same way that they now use graphics software tools for modeling, texturing, animation, and lighting.

Applications outside the creative industry will make extensive use of this technology as well. Call centers, digital assistants, content readers, and advertisements will all deliver highly personalized content using flawless human speech in a listener’s native language. This will also open large markets in localization — narration and dialogue will be instantaneously translatable into any language using the speaker’s native voice with an appropriate accent. Imagine Scarlett Johansson speaking perfect French, Spanish, German, and Italian in the European releases of her future movies.

To the extent that existing celebrities continue to maintain their personae in digital media, the content they produce will be increasingly machine-generated. Personalized content addressed directly to individual fans will be a staple of those stars who wish to keep up with this changing landscape. Machine learning technology will make this possible at greater scale without requiring any of a celebrity’s time. Those who resist this technology rather than embrace it will likely miss out on opportunities for new revenue streams.

More intriguing applications of voice synthesis technology include preserving cherished voices in today’s media, as well as resurrecting the voices of beloved celebrities that have passed on. Consider the joy of having David Attenborough and Morgan Freeman narrate our documentaries for another 100 years, and listening to future newscasts delivered by Walter Cronkite. Whether these scenarios come to pass remains to be seen, but it is certain that the technology will be available to achieve them in the near future.

The potential applications of this technology to consumers’ personal lives are numerous and thought-provoking as well. As an example, one user of my site who messaged me to compliment me on my work added that he couldn’t imagine how much his mom would be willing to pay to hear personalized messages spoken to her by her deceased husband. The “living portraits” of remembered loved ones as described in the Harry Potter novels will not be relegated to the world of fantasy for much longer. Does this concept tug at your heart strings, or sound like an episode of Black Mirror? Perhaps both? Progress in machine-generated media will raise increasingly many unusual questions like this in the coming years.

Moving Forward

I am currently in the process of starting a company to build next-generation content creation tools for storytellers. Our mission is to empower everyone to tell the best versions of their stories possible by leveraging machine learning to reduce the barrier to entry for creating professional-quality media. The experience I’ve gained building this prototype and witnessing people’s reactions to it has been invaluable. While the business models I am exploring don’t revolve around the use of well-known personalities, I still believe the issues discussed above are critical for us to address intelligently as a society. We must find a way to maintain the protections of free speech and parody while minimizing the potential harm from bad actors.

I wrote this post with the hope of stimulating further discussion about the implications of machine-generated content. I look forward to hearing from others who are thinking deeply about the issues I’ve addressed here, and learning from their perspectives.