As you may or may not know, and may or may not be interested in, I work at an artificial intelligence startup called OSARO. We use machine learning to develop intelligent software for applications in industrial automation. Our software, when installed on a physical system like a factory robot, learns to perform tasks that have thus far been too difficult to program using traditional software engineering methods. Essentially, we build brains for factory robots that until recently have had very little of them.
What you are less likely to know, and even less likely to be interested in, is what our company name means. This post is for those of you that are intrigued—both of you. OSARO is actually an acronym, each letter of which corresponds to a component of a continuous process of learning and decision-making that nearly every living organism with a nervous system, including you, engages in throughout the entirety of its life. This process is called reinforcement learning, or, more colloquially, trial-and-error learning.
In a nutshell, reinforcement learning consists of trying something out and seeing what happens. If the outcome is good, do it again. If the outcome is bad, try something else. If you encounter a situation similar to one you’ve been in before, do the thing that felt good, not the thing that felt bad. This is a gross oversimplification, but at its heart this principle describes how most living things that move in the world make their decisions. OSARO’s mission is to design software that mimics this process of learning and decision-making to produce robotic systems that exhibit intelligent and robust behavior.
So what does this have to do with OSARO’s name? The learning and decision-making process for systems that employ the reinforcement learning framework is often represented abstractly by a sequence of steps that repeat ad infinitum. OSARO’s name is an abbreviation of the components that comprise each step of that sequence: Observation, State, Action, Reward, Observation. OSARO. What does each of these components signify, and how does each relate to producing intelligent behavior?
You experience the world around you as a rich array of sensations that begin as signals generated by your sensory organs—eyes, ears, skin, etc. Everything you are aware of (and much you are not aware of) is a summary of the data entering your body from the neurons in these organs. These sensory neurons pass information to your brain about the stimuli acting on them—light on your retina, changes in air pressure on your eardrum, temperature and pressure on your skin, tension in your muscles, etc. We collectively call these raw data observations. Without these data, your brain would have no information to process; there would be nothing to summarize, no world to make sense of. You’d be a brain in a jar. It would be very boring.
Your brain is continuously processing signals from billions of these sensory neurons hundreds of times per second and trying to make sense of it all. This is big data—very high-dimensional and very noisy—far too much make decisions based on the raw data themselves. The world is too complex and unpredictable at this level of understanding, akin to drinking from a firehose. To make decisions quickly, you need some way of throwing away the irrelevant details, keeping the relevant ones and summarizing them concisely. Your brain must turn these fast-changing, highly-detailed data (e.g., light intensities signaled by the cells in your retina that change every few milliseconds) into slow-changing, high-level data that are easier to reason about (e.g., seeing a fly landing on your sandwich). In short, you need state.
All intelligent entities need some way to concisely represent the complex world they live in. In the same way that we can represent numbers in many different ways (counting beads, hash marks, Roman numerals, decimals), there are innumerable ways to summarize and represent the torrent of data (observations) coming from your sensory organs. Some representations are more useful than others, depending on your objectives. Evolution has selected for representations that can summarize sensory data very quickly in ways that improve an organism’s chances of survival and reproduction. Brains are thus, in part, data compression machines—naturally evolved summarizers.
The complex representations that humans and animals form of the world around them are determined in part by genetics, which determine the physical structure of the brain and its early neuronal connections, and in part by experience, which modify those connections in response to sensory data. We call the latter modification learning. An infant has a primitive and not especially useful representation of the world because it hasn’t yet experienced enough data to have learned how to summarize them usefully. It must rely only on innate mechanisms of compression that evolved to function right out of the womb, and these are limited in scope. During the first few months of life, more sophisticated representations slowly take form out of the “blooming, buzzing confusion” of data that initially seem to have no coherent structure. While it’s difficult to imagine as an adult what it feels like to be an infant that can’t make sense of what it’s looking at, an example may help illustrate the point.
Take a quick look at the picture below. If you’ve seen it before, bear with me. If you haven’t, don’t look too hard at it just yet. Just glance at it. Do you see anything other than a bunch of black splotches on a white background? If not, that’s great; this example will be even more compelling to you. Before you look more closely, try to hold in your mind what it feels like not to recognize what’s in the picture—how it feels to just see splotches.
Ok. Now look closer. Do you see any common objects you can identify? Some patterns that look roughly like something you’ve seen before? Blurring your vision a little may help. How about the lower part of a big tree in the upper left, casting a large shadow on the ground below it? And what about in the middle of the scene? Do you see a dog, facing away from you? A Dalmatian perhaps? With its head lowered and sniffing the ground? Keep looking and you’ll eventually see it. If not, search “hidden dalmatian” in Google images and you’ll find some versions with the dog highlighted.
Now think about the moment when you recognized the dog. Did it feel like it sort of “snapped” into place? An “aha” moment? What happened there? The image didn’t change. Nothing got uncovered in the image or came into focus on your retina that wasn’t there before you recognized the dog. Your brain was processing the same pattern of information coming in through your retina, the same observations, trying desperately to make sense of those black splotches. Initially, it put forth the hypothesis that it really was just a picture of black splotches with no inherent meaning. But after some time it came up with another hypothesis that made more sense given your prior experiences with dogs and trees, and was still consistent with the data (your observations). You understood what you were looking at. Your state changed.
What’s even more compelling about the hypothesis your brain just came up with is just how strongly it will hold onto it once confirmed. Remember when I asked you to try to remember what it felt like not to see the dog? Try to go back to that mental state. Try to look at the image now and not see the dog. I’m willing to bet you can’t. You can remember that you couldn’t see it before, but you can’t not see it now.
This is a contrived example to make you consciously aware of perceptual changes of state, but the reality is that this is what your brain is doing all of the time. This process is just what it normally feels like to observe the world around you and understand it. You just don’t notice it because your brain is so good at it, and things in the real world are usually not so ambiguous—though indeed this is because your brain has learned to become very good at disambiguating things that you see often.
For the first few weeks of life, newborns likely interpret their surroundings in a manner similar to the way you first interpreted the image above—lots of splotchy patterns moving around their visual field, with no meaning or structure to any of it. Over time, as they are presented with more data from their environment, abstract representations of these data that are consistent with the world’s inherent structure slowly take form and become “obvious”.
The problem the brain is solving when forming these abstract representations is known as representation learning. There are many theories about exactly how the brain does this, but we are far from a comprehensive answer. Because of this lack of understanding, it’s also very difficult to reverse engineer an artificial system that solves this problem, despite massive research efforts in artificial intelligence over the last 70 years. We’re making slow but steady progress, however. Recent advances in a class of approaches based on neural network models, collectively termed “deep learning”, have produced several impressive results in just the past few years. There is still a long road ahead before we’ll be able to design systems that exhibit general intelligence in the way that humans do. In the meantime we can leverage these intermediate solutions to incrementally improve the capabilities and robustness of robotic systems—precisely the approach we take at OSARO.
As interesting as the problem of representation learning is, it’s even more interesting to ask why our brains should even bother to solve it. What’s the point of state? Why bother compressing the data we observe? Why is it important to summarize the complicated light patterns on your retina as “a fly landing on my sandwich”? The answer is provided by our next component of interest.
We wouldn’t bother to try to predict and understand the world around us if we couldn’t do anything with that understanding. We predict, we understand, so that we can act. We act in order to change our environment to improve our situation—that nasty fly on your sandwich must be swatted away. Action is the raison d’être of intelligence.
Consider living things that don’t act—such as plants. We generally don’t consider them to display much intelligence. Although plants have some physiological mechanisms for adapting to their environment minimally over long time scales, they generally can’t move in the sense that we commonly think of movement, and consequently don’t have nervous systems. There’s no reason for them to summarize their surroundings because there is nothing they can do with that information. Consider an even more curious organism, the sea squirt.
Its early life is spent as a free-swimming, tadpole-like larva, as shown in the upper image above. In this form, it has a primitive nervous system connected to a single eye and a long tail. During this phase of its life, it needs to navigate its watery environment searching for food and a place to settle for the second phase of its life. It thus needs to take in sensory information, process it, and do something appropriate (e.g., move toward food). Once it reaches its second phase of life, it attaches itself to a hard surface and becomes a filter feeder, destined never to move again. In this form, as shown in the lower image above, it simply consumes whatever food particles happen to float into its mouth. At this point, it promptly digests its brain and most of its nervous system, since they are now nothing but a drain on precious resources. Waste not, want not.
These examples highlight the fact that all of the work our brain does to learn a representation of our environment is in service of predicting what we will likely observe next—including the consequences of our own actions—so that we can act accordingly to influence our surroundings. The better and further into the future we can predict, the greater the level of control we have over our environment, and the easier it is to improve our situation. What differs between species of animals is the sophistication of those prediction mechanisms and how they are tailored to the body of the organism—the ways that it can sense its environment and affect its surroundings.
Now that we know why our brains try to summarize the world in ways that let us do useful things, it’s natural to ask what we should do at any given moment. Why choose any one action over another? What does it mean to improve our situation? This is where our next component of interest comes in.
There are a functionally infinite number of things you could be doing at any given moment of your life. You have hundreds of muscles you can use to affect the world around you, each of which you can contract and release in an astronomically large number of combinations every fraction of a second. How does your brain decide which of them to choose? We don’t just take random actions at every moment and hope for the best. We take deliberate actions to achieve goals. This is the hallmark of intelligence.
Where do these goals come from? The answer comes in part from specialized regions of the brain that produce reward signals. These signals tell the rest of the brain how “good” or “bad” your current state is. We choose actions that take us to states yielding positive rewards, and away from states yielding negative rewards. Our goals—acquiring food when we’re hungry, running from predators when threatened, etc.—derive from this principle.
But this begs the question. Where do rewards come from? What determines what is good and what is bad? The process of evolution provides the bedrock here. Darwinian natural selection has spent hundreds of millions of years tuning the innate reward-generating circuits in brains to direct organisms’ behavior toward states that increase their likelihood of reproducing. Reproducing of course requires staying alive, and so being well-fed, not thirsty, and safe from predators are highly-rewarding states. Being in the good graces of attractive mates also increases an organism’s likelihood of passing on its genes, hence all of the positive rewards associated with sex.
Given these innate “ground truth” signals provided by evolution about goodness and badness, biological reinforcement learning systems learn, through trial and error, which actions to take so as to maximize the sum of the rewards they expect to receive in the future. This is the primary objective of a reinforcement learning system—accumulate the most reward possible over time by taking appropriate actions.
This is actually quite a difficult problem because rewards don’t often come immediately or reliably after taking a single action in a given state. Often very long sequences of actions must be taken, passing through many states, before any appreciable reward is received. A reinforcement learning system must figure out which of the actions it took along a path through a set of states were responsible for the rewards it received along that path—a problem known as credit assignment. There are well-established mathematical models of how this problem might be solved in biological systems, with evidence from experimental neuroscience to support them. The field of computational reinforcement learning, an area of active research for the past several decades, explores the implementation of these models in artificial systems such as robots.
Since evolution is not the guiding process when designing artificial systems in this paradigm, a critical question arises: where should the rewards in such AI systems come from? That’s the 64-trillion dollar question we must answer in the coming decades. Extracting the maximum economic value from truly intelligent robotic systems will center around solving what is known as the value alignment problem. In short, we must design AI systems such that their reward functions encourage behavior that is beneficial to humans. What “beneficial” means precisely and how to account for complex tradeoffs when these systems must interact with humans are open questions in the field of AI. These issues are too complex to address here, but there are many ideas about how we can begin to tackle this problem. The only thing most researchers agree on right now is that we had better get this right.
At OSARO, the tasks our robots perform are constrained to execute in industrial environments with strict safety protocols and little direct human interaction. As such, the reward functions we employ can be fairly simple. I’ll explain in more detail how OSARO’s software incorporates the framework I’ve been outlining thus far in a moment. But first, let’s finish our journey through OSARO’s name, as we have now come to the final component of interest.
I know, I know. We covered this already. But this section is actually about something just as important as observations themselves. Something so important to our reality that Einstein spent much of his life contemplating it: time. Recall that the components I’ve been describing comprise each step in a sequence that spans your entire life; indeed, that sequence is your life. The steps in this sequence occur in your brain hundreds of times per second. Observation, state, action, reward, observation, state, action, reward, observation, state, action reward… you get the picture. Listing observation again here emphasizes the central role that time plays in reinforcement learning.
To understand the world, we must learn representations that both summarize the history of our observations and allow us to make predictions about future observations, given some sequence of actions we intend to take. We must also correlate these sequences of action with the states and rewards they yield so that we will know how to behave in the future, and these correlations are learned through trial and error. All of these requirements for intelligent behavior are inextricably linked to the passage of time. Reinforcement learning is inherently a process.
Even state representations themselves often require a notion of time. Single observations are generally not sufficient to disambiguate what state you’re in at any given moment. In general we must experience several observations in sequence in order to understand what’s happening around us. Take the concept of object permanence, for example. This is simply the idea that an object in your visual field that moves out of view (e.g., behind another object) doesn’t cease to exist. Rather, your brain integrates the history of observations received when the object was in view, and uses that history to maintain the position of the object in your model of the world even after your observations no longer contain any data relating to it (i.e., when it’s no longer visible). This seems like quite an obvious concept, but it’s something that humans don’t learn until they are around 6 months old. When you think about it, that’s a pretty long time to think that your mom ceases to exist every time she plays peek-a-boo with you.
Environments in which a single observation is sufficient to tell you what state you’re in are said to have what is called the Markov property. This means that you can define your state representation to be your current observation without fear of misclassifying your true state. Reinforcement learning has been successfully applied to environments with the Markov property for decades now, in particular to games like Chess and Go. A fairly recent success was DeepMind’s AlphaZero, which used reinforcement learning to achieve super-human performance in Chess, Go, and Shogi, all in a matter of hours. While this was indeed an impressive result in the reinforcement learning community, the fact that these environments are Markovian certainly helped a lot.
The real-world is decidedly non-Markovian. Take a look in front of you. Are your house keys in sight? If not, consider that they might be in your pocket, or in your desk drawer, or under the couch cushions. All of these possibilities might be valid states of your world right now, but they are all consistent with your current observations because your keys are not currently in view. You may remember where your keys are, but that’s because of history—previous observations you experienced that showed your keys to be in your pocket, or desk, or elsewhere. Time is an integral part of how we perceive the world and act in it, and the letters bookending OSARO’s name pay homage to this fact.
So now you know what OSARO means, and that each of its letters stands for a critical component of a framework for learning and decision-making in biological and artificial intelligent systems. A big challenge for artificial reinforcement learning systems over the past couple of decades has been to apply them successfully to real-world problems—in particular robotic manipulation tasks. This is a challenge we deal with daily at OSARO. One of OSARO’s key solutions, our piece-picking system, learns to pick objects from a source bin full of consumer products and place them into destination bins that then get packed into boxes and shipped. Check out the clip below for an example of our piece picking system doing its thing.
How does OSARO apply the reinforcement learning framework to the industrial automation tasks we solve for our customers? We can use our piece-picking solution as an example and take things one component at a time again to illustrate.
Our software runs on various robotic platforms, but each provides our system with streams of information from which to learn. In the case of piece-picking, every robot cell has one or more cameras that provide the system with visual observations—a view of the source and destination bins from various angles. The robot arm also provides information about the position and orientation of the arm at each moment in time, as well as the amount of force being exerted on the end-effector (the tool attached to the end of the arm that actually does the picking). We often use suction-based end-effectors (as in the clip above), which additionally provide an observation signal indicating degree of air flow. This can be used to infer whether the end-effector is currently holding an object or not. All of these observation signals generate quite a lot of data for our algorithms to process. Deciding where and how to pick or place an object requires aggregating them into a compact, useful state representation.
Our software leverages state-of-the-art machine learning techniques to go from streams of observation data to meaningful, stable descriptors of the system’s state. For instance, images of the pick and place bins are passed through neural networks that learn to identify where the source and destination bins are in 3-dimensional space so that the system knows where it can move the arm to pick and place objects without colliding with the bins. Other neural networks are trained on images of the contents of the bins to learn things like how full the bins are, where the relevant objects that may need to be picked are located, and exactly where on those objects would be the most effective place for the robot to attempt a pick. These components of the state representation are aggregated together and used by the system to make decisions about where and how to pick and place objects safely and efficiently.
The system has many actions to choose from for any given pick request. These include selecting which object in the bin to grasp, which end-effector tool will be the most effective, the angle at which to pick up the object, and how close to get to the object during approach before slowing down to avoid damaging it. In other tasks, where precise placement position and orientation are also required, the system has even more actions to choose from during the placement motion, such as where to place the object and at what orientation, what height to release it from, and how fast to move when carrying it. The number of possible actions is too large to program a simple set of rules for selecting among them—hence the need to learn appropriate behavior via trial and error.
We guide the system to learn to behave as we would like it to by defining its reward function. The system receives positive rewards for successful picks, which occur when the target object is picked up and placed successfully in the destination bin without being damaged. Negative rewards are given for failures like unsuccessful pick attempts, dropping an object outside of a bin, picking up more than one object, or damaging an object. The system also receives more reward for doing its job faster, and is thus incentivized to pick as fast as possible without damaging or dropping objects.
Recall that learning and decision-making in our system is an ongoing process, unfolding over time. Each attempt at a pick provides a new opportunity for the system to learn about the effects of its actions. When it succeeds, it adjusts its behavior to perform similarly in similar situations. When it fails, it tries something different the next time it’s in a similar situation. The system initially fails often, but improves its speed and accuracy as it continues to attempt picks and stumbles across good solutions that it remembers for the future. This is the primary advantage of a learning-based system—as it performs its duties it continues to adapt and improve.
Hopefully this outline gives you a clear sense of how OSARO applies the reinforcement learning paradigm to real-world robotic systems. OSARO has deployed multiple instances of our piece-picking solution around the world, all of which are continuously collecting new data and improving the system as a whole. We are also actively working on adapting our software to solve more complex automation problems, including several challenging industrial assembly tasks, such as automotive assembly.
It’s an exciting time to be working on reinforcement learning systems that solve real problems facing many manufacturers and distributors around the world. These systems can be trained to perform repetitive yet difficult manipulation tasks efficiently and inexpensively. Increasingly many countries are encountering shortages of reliable human labor for such tasks, threatening critical supply chains and economic stability. The recent devastating disruption to the global supply chain caused by the COVID-19 pandemic highlights the immense potential for intelligent robotic systems to bolster the stability of logistics infrastructure around the world, and consequently the global economy. OSARO is proud to be at the forefront of advancing the capabilities of these systems to help realize this potential to improve the standard of living for billions of people.