It’s official, machines that can see (and drive) are now among us. This week Waymo announced its self-driving cars are on the road in Arizona and self-driving startup Embark has autonomous trucks transporting goods from Texas to LA. Computer vision has been a dream from the 1950’s (when serious research began) up until very recently, despite AI pioneer Marvin Minsky attempting to bring it to fruition in 1966 by famously telling a grad student “connect a camera to a computer, and have it describe what it sees.” Poor guy, 50 years on and we’re just scratching the surface.
So why has computer vision been such a tough nut to crack? First, because the process is one of the most complex we’ve attempted to comprehend. Second, we haven’t been able to figure out the body’s process beyond the macro steps.
How then, were we able to “connect a camera to a computer, and have it describe what it sees.” It comes down to solving 3 key parts of the process that is “sight”.
When someone throws a frisbee across a field, before the person on the other end can catch it the process is broadly this - The image of the frisbee passes through their retina and some preliminary analysis happens before it gets sent along to the brain. The first stop in the brain is the visual cortex, which analyzes the image more thoroughly. Once it’s been broken down by the visual cortex, the rest of the cortex compares the image to everything it already knows. This includes classifying the objects in the image, size and dimension, putting what is happening in context, and so on until the brain decides on a course of action: raising your arm and catching the frisbee.
From a computer science standpoint, the three stages of sight are problems that drastically increase in difficulty the further along you get. Replicating the eye is hard, replicating the visual cortex is really hard, and replicating and understanding your previous understanding (ie the context) is quite possibly the most complex task humans have ever attempted.
Recreating the eye is where we’ve been most successful. Cameras, sensors, and image processors have not only matched the human eye, but now exceeded it in many regards. We can see at vastly improved distances with greater clarity than ever thought possible, and even see in the dark or other types of light not visible to the human eye. Using increasingly larger and more optimally perfect lenses, combined with subpixels made with nanometer precision, we can record thousands of images per second and see more than ever before.
However, despite the quality and scale of their product even the telescopes we use to observe other galaxies can’t tell what the images they see are without help. It is the software behind the lens that does the heavy lifting - and is the more difficult piece to get right.
So how then, do developers begin to write software that replicates the visual cortex? The first challenge is to differentiate objects, and find patterns in the disorganized noise of an image. Our brains do this by neurons exciting one another if there is contrast along a line, or rapid motion in a particular direction. The next layer of networks aggregate these patterns into meta-patterns. This continues upward as other networks identify colors, textures, motion and direction. As more information is layered on a picture begins to form from the mess of complementary descriptions.
In the brain, it works a lot like this.
Once you can find lines and distinguish objects the next question becomes, “what is the object?” The first way researchers tried to tackle this in the early days of computer vision research centered around the problem “how can we tell if there is a tank in the woods?” (thanks Cold War). Researchers started by describing to the computer what a tank should look like. A tank looks like /this/ and moves like /this/, except for when you view it from the side where it looks more like /this/, or the turret is rotated then it looks like /this/, and so on.
For select objects in controlled environments, this brute force approach worked well. The problem is, for this to work at scale every object must be observed from every possible angle with variations for lighting, motion, and every other conceivable factor taken into account. It quickly became clear the data required to correctly identify even just a few objects would be impractically large.
Thankfully, the bottom-up approach our brains use has proven to be more practical. By applying a series of transformation algorithms to discover edges, imply objects using those edges, find perspective and movement across multiple images, etc. computers can be trained to see things the way our brains were. Advancements in AI and processing big data has been the key to achieving the level of complex math necessary to do this accurately at scale. The result has brought computer vision lightyears ahead of where we were just a few years ago, to the point computers can “tag” thousands of objects fairly accurately.
Image from Purdue’s E-Lab showing examples of objects that look and behave similarly
Now that we have a system that can recognize many varieties and behaviors of objects from multiple angles and in many situations, we reach the most difficult problem of teaching computers to comprehend what it sees. Just because computers can correctly identify a banana in all situations doesn’t mean it knows what bananas are, or if they are edible, or that they come from tropical climates.
To be effective good hardware and software require operating systems. For people, the rest of our brain acts as the operating system to connect and understand all of its individual processes - memory, the other 4 senses, attention and focus, and the collective lessons of our experiences. All connected in ways we barely understand, encoded in a language we can only attempt to comprehend, and living in a network of neurons more interconnected and frustratingly complex than anything else we’ve tried to uncover (except for maybe particle physics and string theory.)
Here is where the leading edges of computer science, general AI, psychology, neuroscience, and philosophy collide. Understanding on a functional level the way our minds work, and replicating those systems in machines.
Right now those siloed systems are producing incredible advances like self-driving cars, facial recognition, and safe and efficient factory robots. Barron's estimates that by 2021 the value of computer vision for AI will top $3 billion, and continue to grow at a 30% compounded yearly growth rate. That’s creates a lot of incentive to begin tackling the deeper problems of context and intention. There’s still a long way to go and the most complex problems still lay ahead, but considering the scale of the problem it’s incredible we’ve even gotten this far.