Do computers have vision? Let computers "see" the world

1. The Birth of Vision

In the billions of years since its birth, life on Earth has not undergone any major changes. It has been "lying flat" at the bottom of the primitive ocean, unable to move independently, hunt or forage.

It was not until about 500 million years ago that evolution suddenly began to explode. In the following tens of millions of years, life explored a variety of different body structures, covering almost all types of organisms today. They also possessed complex behaviors, such as hunting, seeking light and avoiding harm.

Although there are many reasons for the Cambrian explosion, one of the most important reasons is the emergence of vision. Vision has enabled organisms to leap forward in their ability to adapt to the environment, and has therefore become the most important sensory function.

At first glance, vision is the function of the eyes, because we always use our eyes to see things. But in fact, the eyes are just sensory organs that can only passively receive light information from the outside world. This information must be decoded in a complex way before it can be understood in the brain, allowing us to know what is happening around us and how we should react. Therefore, the brain is actually the most important visual organ.

It is not difficult for computers to simulate the function of "eyes", which can be easily done with a camera. However, it is very difficult to truly understand visual information like the visual area of the brain.

Image source: pixabay

When we humans are young, we only need to see a few cats in our lives to understand the visual characteristics of cats very clearly. The next time we see a strange cat, we can recognize it at a glance. However, it is difficult for us to transform such characteristics into a form that computers can understand. For example, although the cats in the picture are all cats, they have no similarity to the computer.

Therefore, although traditional visual algorithms set a large number of rules and try to extract various image features, they have not been able to understand the content of the image, so much so that they cannot even do things that are easy for humans, such as identifying whether the object in the image is a cat or a dog.

2. The power of neural network algorithms

In order to verify the accuracy of the algorithm in classifying pictures, Fei-Fei Li, a computer scientist who was teaching at Princeton University at the time, released a huge picture dataset ImageNet in 2010, which contained more than a thousand categories. In 2010, the most advanced algorithm at the time could only correctly identify about 72% of the pictures.

But the emergence of deep learning changed everything. In 2012, Geoffrey Hinton of the University of Toronto and two of his students published the neural network AlexNet. This network immediately made a huge breakthrough on ImageNet, raising the accuracy to more than 84%.

A few years later, Hinton won the Turing Award, and another author of the paper, Ilya Sutskever, became a member of the founding team of OpenAI, but that’s another story.

How does a neural network recognize an image? Let's look at a simple example. Suppose we want to recognize a handwritten number on a 28*28 image. We can stretch the pixels in the image to a sequence of 784 numbers. Then, we can pass this sequence as input to the neural network. The output of the neural network includes 10 neurons, and the output value of each neuron represents a number.

At first, after inputting the image data, the output result is random. But if we train this neural network with a large amount of training data, let the network modify its parameters according to the correct results, and continuously provide feedback, the neural network will gradually learn how to correctly recognize numbers.

But there are problems with this simple neural network.

3. The emergence of new problems

The first problem is that it has too many parameters. If we only use 100 neurons as the middle layer in addition to the input and output, there will be 784*100+100*10 = 79400 connections. The pictures we want to process are often much larger than 28*28 pixels, so the model will have too many parameters and become difficult to train. The second problem is that this method disrupts the distribution of pixels in the original picture, which does not conform to the human viewing mode.

How to solve these two problems? The researchers observed two characteristics.

First, to identify an object in a picture, it is not necessary to scan every pixel in the picture, but only to find out whether a certain important feature appears in the key area of the picture. For example, if we see a piece of black and white skin, we may be able to directly determine that the animal in the picture is a zebra.

Second, the location of the feature in the image is not critical. A cat is a cat no matter where it appears in the photo.

Therefore, researchers no longer shuffle pixels, but use a tool similar to a small window to slide on the image to capture local features at different locations of the image. These small windows can slide across the entire image through a set of parameters, thus reducing the number of parameters and capturing different areas of the image. Neural networks that use such "small windows" are also called convolutional neural networks. AlexNet is actually a simple convolutional neural network.

Subsequently, neural network technology was continuously optimized, the number of neurons and network layers continued to increase, and the performance continued to improve. A few years later, the accuracy on ImageNet exceeded 97%, at least approaching the human level on this dataset.

However, in addition to image classification, computer vision has many other tasks. Object recognition is more difficult than image classification. Object recognition tasks not only require identifying objects in the image, but also marking the location of the objects. Sometimes, the image contains more than one type of object.

Object recognition is widely used in autonomous driving because the autonomous driving system needs to be able to recognize different types of objects, such as other cars, pedestrians, traffic lights and signs, etc.

In addition, we also need models that understand data from different “modalities” and combine them together. For example, a model that combines text and images can generate images based on text.

In addition to processing existing images, we also want machines to generate new images and videos. Now, institutions such as OpenAI, Google, and Baidu have already had relatively mature image generation tools, but video generation technology is still relatively primitive and has a lot of room for improvement.

There is also an open question in the field of computer vision, which is whether it is possible to develop a general visual model like GPT-4 or chatGPT. After all, visual understanding is an integral part of intelligence, and a large language model without visual capabilities cannot convince everyone that it embodies all intelligence.

The article is produced by Science Popularization China-Starry Sky Project (Creation and Cultivation). Please indicate the source when reprinting.

Author: Guan Xinyu popular science author

Reviewer: Yu Yang, Head of Tencent Xuanwu Lab

<<: After 10 years of climbing mountains and crossing rivers, it was finally proved that this fruit originated in China

>>: Kh-BD debuts, is the Russian bomber’s “second spring” coming?

In ancient times when technology was not well developed, how did silver bills prevent counterfeiting?

Blog

PiaPia slaps her in the face! The truth is revealed after disassembling Qin L and Galaxy L6. Do domestic cars really cut corners?

Blog

Green tea, white tea, yellow tea, black tea, oolong tea, dark tea... Can't tell the difference? Learn more in this article →

Blog

How to quickly achieve user growth through active traffic generation?

Do computers have vision? Let computers "see" the world

In ancient times when technology was not well developed, how did silver bills prevent counterfeiting?

PiaPia slaps her in the face! The truth is revealed after disassembling Qin L and Galaxy L6. Do domestic cars really cut corners?

Green tea, white tea, yellow tea, black tea, oolong tea, dark tea... Can't tell the difference? Learn more in this article →

How to quickly achieve user growth through active traffic generation?

App Operation丨Apple Cleaning Free List Top 200! More than 40 games were removed from the list

Foreign online earning projects: once and for all, earn over 100 US dollars a day through pipeline income

Mini Program Mall Promotion and Traffic Generation, How to Promote WeChat Mini Program Mall?

About 180 million years ago! Chinese fossils reveal the oldest insect "marriage flight" behavior

HTC: Brand is more important than hardware parameters

Xiaohongshu’s traffic and core recommendation logic!

Recommend

Why do some medicines cause coma when taken as pieces, but some are fine when taken as chewed?

Why does sending original images on WeChat leak privacy? Let's talk about Exif

World Hypertension Day丨Take this anti-stress manual and stay away from the "invisible killer"

Alipay has made its biggest change and invaded Dianping and WeChat's territory

Tencent Micro Game Console Review: More Suitable for Heavy Mobile Game Players

Ginkgo biloba: a living fossil that can be seen and touched

TSMC will continue to supply Huawei. Taiwanese media: No fear of the US

Increased uric acid leads to gout, not just because of beer, seafood and barbecue...

How can you make a product go viral even with minimal advertising?

How do colors affect appetite? Do blue foods really make people lose their appetite?

Why does ofo invest in advertising while Mobike does public relations?

How did the exposure rate of millions of levels come about?

Android phones are booming this year

Dou Jingbo of Suning.com: The evolution of mobile clients

How to choose selling points for advertising? Analysis of promotional materials for the financial industry