What is machine perception? How artificial intelligence (AI) perceives the world
Table of contents
- Types of machine perception
- Which human senses can machines mimic well?
- Is machine perception hard?
- How do the major AI companies handle machine perception?
- How are startups and challengers approaching machine perception?
- What can’t machine perception do?
Machine perception is the capability of a computer to take in and process sensory information in a way that’s similar to how humans perceive the world. It may rely on sensors that mimic common human senses — sight, sound, touch, taste — as well as taking in information in ways that humans cannot.
Sensing and processing information by a machine generally requires specialized hardware and software. It’s a multistep process to take in and then convert or translate raw data into the overall scan and detailed selection of focus by which humans (and animals) perceive their world.
Perception is also the first stage in many of the artificial intelligence (AI) sensory models. The algorithms convert the data gathered from the world into a raw model of what is being perceived. The next stage is building a larger understanding of the perceived world, a stage sometimes called cognition. After that comes strategizing and choosing how to act.
In some cases, the goal is not to make the machines think exactly like humans but just to think in similar ways. Many algorithms for medical diagnosis may provide better answers than humans because the computers have access to more precise images or data than humans can perceive. The goal is not to teach the AI algorithms to think exactly like the humans do, but to render useful insights into the disease that can help human doctors and nurses. That is to say, it is OK and sometimes even preferable for the machine to perceive differently than humans do.
Types of machine perception
Here some types of machine perception, in varying stages of development:
- Machine or computer vision via optical camera
- Machine hearing (computer audition) via microphone
- Machine touch via tactile sensor
- Machine smell (olfactory) via electronic nose
- Machine taste via electronic tongue
- 3D imaging or scanning via LiDAR sensor or scanner
- Motion detection via accelerometer, gyroscope, magnetometer or fusion sensor
- Thermal imaging or object detection via infrared scanner
In theory, any direct, computer-based gleaning of information from the world is machine perception.
Many of the areas usually considered challenges to developing good machine perception are those where humans do well, but that aren’t easy to encode as simple rules. For example, human handwriting often varies from word to word. Humans can discern a pattern but it is harder to teach a computer to recognize the letters accurately because there are so many small variations.
Even understanding printed text can be a challenge, because of the different fonts and subtle variations in printing. Optical character recognition requires programming the computer to think about larger questions, like the basic shape of the letter, and adapt if the font stretches some of the aspects.
Some researchers in machine perception want to build attachments to the computer that can really begin to duplicate the way humans sense the world. Some are building electronic noses and tongues that try to mimic or even duplicate the chemical reactions that are interpreted by the human brain.
In some cases, electronics offer better sensing than the equivalent human organs do. Many microphones can sense sound frequencies far outside the human range. They can also pick up sounds too soft for humans to detect. Still, the goal is to understand how to make the computer sense the world as a human does.
Some machine perception scientists focus on trying to simulate how humans are able to lock on to specific sounds. For example, the human brain is often able to track particular conversations in a noisy environment. Filtering out background noise is a challenge for computers because it requires identifying the salient features out of a sea of cacophony.
Which human senses can machines mimic well?
Computers rely upon many different sensors to let them connect with the world, but they all behave differently from the human organs that sense the same things. Some are more accurate and can capture more information about the environment with greater precision. Others aren’t as accurate.
Machine vision may be the most powerful sense, thanks to sophisticated cameras and optical lenses that can gather more light. While many of these cameras are deliberately tuned to duplicate the way the human eye responds to color, special cameras can pick up a wider range of colors, including some that the human eye can’t see. Infrared sensors, for example, are often used to search for heat leaks in houses.
The cameras are also more sensitive to subtle changes in the intensity of light, so it’s possible for computers to perceive slight changes better than humans. For example, cameras can pick up the subtle flush that comes with blood rushing through facial capillaries and thus track a person’s heartbeat.
Sound is often the next most successful type of machine perception. The microphones are small and often more sensitive than human ears, especially older human ears. They can detect frequencies well outside the human range, allowing computers to hear events and track sounds that humans literally can’t.
Microphones can also be placed in arrays, with the computer tracking multiple microphones simultaneously, allowing it to estimate the location of the source more efficiently than humans can. Arrays with three or more microphones can provide better estimates than humans who have only two ears.
Computers can sense touch, but usually only in special circumstances. The touchscreens or touchpads on phones and laptops can be very precise. They can detect multiple fingers and small movements. Developers have also worked to allow these sensors to detect differences in the length of a touch, so that actions like a long touch or a short tap can have different meanings.
Smell and taste are less commonly tackled by machine perception developers. There are few sensors that attempt to mimic these human senses, perhaps because these senses are based on such complex chemistry. In some labs, though, researchers have been able to break down the processes into enough small steps that some artificial intelligence algorithms can begin to smell or taste.
Is machine perception hard?
Artificial intelligence scientists learned quickly that some of the simplest tasks for humans can be maddeningly difficult for computers to learn to do. For example, looking at a room and searching for a place to sit down happens automatically for most of us. It’s still a difficult task for robots.
In the 1980s, Hans Moravec described the paradox this way: “It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.”
Some of this is because humans don’t notice how hard their brain is working to interpret its senses. Brain scientists often estimate that more than half of the brain works to understand what our eyes are gazing upon. We tend to see things without consciously deciding to look for them, at least in normal lighting. It’s only in the dark or in fog that humans search for visual clues about objects and where they might be.
Machine vision is just one area of machine perception, and scientists continue to struggle to duplicate even the simplest human tasks. When the algorithms work, they return answers that are straightforward, largely numeric and often lacking context or interpretation. The sensors may be able to spot a red object at a particular location, but identifying it or even determining whether it’s a part of another object is hard.
How do the major AI companies handle machine perception?
The major companies selling artificial intelligence algorithms all supply a variety of tools for sensing and processing types of human perception, from sight to language. They are most often differentiated by the software algorithms that process, analyze and present sensory findings and predictions. They offer raw tools for enterprises that want to work from a foundation, as well as domain-specific tools that tackle particular problems such as searching a video feed for anomalous actions or conversing with customers.
IBM has been a leader in improving its algorithms’ ability to see the world as humans do. Its Watson AI system, for example, begins with a sophisticated layer of natural language processing (NLP) that gives it a conversational interface. Clients can use IBM’s Watson Studio to analyze questions, propose hypothetical answers and then search through the evidence corpus for correct answers. The version that won games of Jeopardy against human champions is a good example of well-socialized algorithms that can interact with humans because they perceive words, more or less, as humans do.
Amazon offers a wide range of products and services, beginning with basic tools and also including specialized tools. Amazon Comprehend, for example, extracts information from natural language. A specialized version, Amazon Comprehend Medical, is focused on delivering the kind of automated analysis and coding needed by hospitals and doctors’ offices. Amazon HealthLake is a data storage product that folds in artificial intelligence routines to extract meaning and make predictions from the stored data.
Google offers a number of cloud products for basic and focused problem-solving. It has also been quietly adding better algorithms for machine perception to its standard products, making them more useful and often intuitive. Google Drive, for example, will quietly apply optical character recognition algorithms to read text in email or stored files. This lets users search successfully for words that may only be in an image or a meme. Google Photo will use higher-level classification algorithms to make it possible to search for images based upon their content.
Microsoft offers a wide variety of services to help clients build more perceptive tools. Azure Percept provides a collection of prebuilt AI models that can be customized and deployed with a simple Studio IDE. These edge products are designed to integrate both software and customized hardware in one product. Microsoft’s development tools are focused on understanding natural language as well as video and audio feeds that may be collected by internet of things (IoT) devices.
Meta also uses a variety of NLP algorithms to improve its basic product, its social network. The company is also starting to explore the metaverse and actively using natural language interfaces and machine vision algorithms to help users create and use the metaverse. For example, users want to decorate their personal spaces, and good AI interfaces make it simpler for people to create and explore different designs.
How are startups and challengers approaching machine perception?
A number of companies, startups as well as established challengers, are working to make their models perform as humans do.
One area where this is of great interest is autonomous transportation. When AIs are going to share the road with human drivers and pedestrians, the AIs will need to understand the world as humans do. Startups like Waymo, Pony AI, Aeye, Cruise Automation and Argo are a few of the major companies with significant funding that are building cars already operating on the streets of some cities. They are integrating well-engineered AIs which can catalog and avoid obstacles on the road.
Some startups are more focused on building just the software that tracks objects and potential barriers for autonomous motion. Companies like aiMotive, StradVision, Phantom AI and CalmCar are just a few examples of companies that are creating “perception stacks” that manage all the information coming from a variety of sensors.
These systems are often better than humans in a variety of ways. Sometimes they rely on a collection of cameras that can see simultaneously in 360 degrees around the vehicle. In other cases, they use special controlled lighting, like lasers, to extract even more precise data about the location of objects.
Understanding words and going beyond basic keyword searching is a challenge some startups are tackling. Blackbird.ai, Basis Technology and Narrative Science (now part of Tableau) are good examples of companies that want to understand the intent of the human who is crafting the text. They talk about going beyond simply identifying the keywords, to detecting narratives.
Some are searching for a predictive way to anticipate what humans may be planning to do by looking for visual clues. Humanising Autonomy wants to reduce liability and eliminate crashes by constructing a predictive model of humans from a video feed.
Some companies are focused on solving particular practical problems. AMP Robotics, for instance, is building sorting machines that can separate recyclable materials out of waste streams. These machines use machine vision and learning algorithms to do what humans do in the sorting process.
Some are simply using AI to enhance humans’ experience by understanding what humans perceive. Pensa Systems, for example, uses video cameras to examine store shelves and look for poor displays. This “shelf intelligence” aims to improve visibility and placement to make it easier for customers to find what they want.
What can’t machine perception do?
Computers think differently from humans. They are especially adept at simple arithmetic calculations and remembering large collections of numbers or letters. But finding a set of algorithms that allow them to see, hear or feel the world around them as humans do is more challenging.
The level of success varies. Some tasks, like spotting objects in an image and distinguishing among them, are surprisingly complex and difficult. The algorithms that machine vision scientists have created can work, but they are still fragile and make mistakes that a toddler would avoid.
Much of this is because we don’t have solid, logical models of how we apprehend the world. The definition of an item like a chair is obvious to humans, but asking a computer to distinguish between a stool and a low table is a challenge.
The most successful algorithms are often largely statistical. The machine learning systems gather a great deal of data and then compute elaborate, adaptive statistical models that generate the right answer some of the time. These machine learning algorithms and neural networks are the basis for many of the classification algorithms that can recognize objects in an image.
For all their success, these statistical mechanisms are just approximations. They’re more like parlor tricks. They approximate how humans think, but they don’t actually think in the same way. That makes it quite difficult to predict when they’ll fail.
In general, machine perception algorithms are useful, but they will make mistakes and produce incorrect results at unpredictable moments. Much of this is because we don’t understand human perception very well. We have some good logical building blocks from physics and psychology, but they are just the beginning. We don’t really know how humans perceive the world and so we make do with the statistical models for now.
Sometimes it’s best to focus more on what machines do better. Many of the cameras and image sensors, for instance, can detect light in wavelengths that can’t be seen by the human eye. The Webb Space Telescope, for example, operates entirely with infrared light. The images we see are modified by the computer to appear in colors in the visible range. Instead of building something that duplicated what human perception could do, these scientists created a telescope that extended the human range to see things that couldn’t otherwise be seen.