Application of Image Technology in Live Broadcasting (Part 2) - Image Recognition

Application of Image Technology in Live Broadcasting (Part 2) - Image Recognition

In "Application of Image Technology in Live Broadcasting (Part 1)", we briefly described the principles and practical issues of Beauty Technology 1.0. At the beginning of the article, we mentioned the most critical technology of Beauty Technology 2.0 - face recognition. This is a complex but very popular technology. In this article, we will talk about image recognition, its principles and some specific practical issues. This sharing series is compiled from the speech of Tutu CTO at the Architect Salon.

1. A brief analysis of machine learning and deep learning: How to make machines understand the world?

Recently, the concepts of machine learning and deep learning have become very popular, especially this year when AlphaGo defeated a Korean chess player, which caused a sensation around the world. The concepts of machine learning and deep learning are easy to confuse, so much so that many media often use the two words interchangeably when writing reports. Since these two concepts are currently mainly used in the field of images, we will only distinguish between the two concepts in terms of image recognition, especially face recognition.

The concept of machine learning was proposed relatively early. In the early 1990s, people began to realize that a more efficient way to build pattern recognition algorithms was to replace experts (people with a lot of image knowledge) with data (which can be collected through cheap labor). Deep learning can be regarded as a branch of machine learning, and it has only received widespread attention and development in the past decade.

Let’s talk about the specific differences below.

First of all, machine learning recognizes objects based on pixel features. We collect a large amount of image material, select an algorithm, use this algorithm to parse the data, learn from it, and then make decisions and predictions about events in the real world.

Deep learning can be regarded as a branch of machine learning. It has only received widespread attention and development in the past decade. It is different from machine learning in that it simulates the way we humans recognize faces. For example, neuroscientists have discovered that when we humans recognize or observe something, edge detection neurons react first, which means that when we look at an object, we always observe the edge first. In this way, after a lot of observations and experiments by scientists, it is concluded that the core mode of human eye recognition is based on the capture of special levels. From a simple level to a complex level, this level transition is an abstract iterative process. Deep learning simulates the way we humans observe objects. First, we get a large amount of data on the Internet. After getting it, we have a large amount of samples. We capture the large amount of samples for training, capture the core features, and build a network. Because deep learning is to build a multi-layer neural network, there must be many layers. Some simple algorithms may only have four or five layers, but some complex ones, like the Google one just mentioned, have more than a hundred layers. Of course, some of these layers will do some mathematical calculations, and some layers will do image budgeting. Generally, as the layers go down, the features will become more and more abstract.

For example, recognize a face. If it is a face in a specific environment, if it encounters fog or is partially blocked by trees, the face becomes blurry, and machine learning based on pixel-by-pixel features cannot recognize it. It is too rigid and too easily disturbed by environmental conditions. Deep learning breaks down all elements and then uses neurons to "check": facial features, typical dimensions of a face, etc. Finally, the neural network will give a well-thought-out guess based on various factors and the weights of various elements, that is, how likely this image is to be a face.

Therefore, deep learning is better than machine learning in face recognition and other recognition, and even exceeds human recognition capabilities. For example, in 2015, Google released a facenet network for face detection, claiming that this network can achieve a recognition rate of more than 98%. However, the accuracy rate achieved by us humans when we look at samples is not as high as that of some of the most advanced technologies that use deep learning algorithms.

In terms of machine learning, the most popular face detection algorithms in the world are the HOG algorithm and the LBF feature algorithm. LBF is from OpenCV, a well-known open source library that contains a variety of image processing functions. It is also open source, but it works poorly on mobile platforms and cannot achieve the results we want. It is mentioned here because it is very famous and contains a variety of image processing functions, such as special processing, face recognition, object recognition, etc. OpenCV contains the implementation of the LBF algorithm.

There are many open source frameworks for deep learning, such as Caffe and TensorFlow. These frameworks only provide tools for building deep learning networks, but deep neural networks are the most critical thing. How to build a network? There are many ways to build a network. For example, if you pay attention to this aspect, you will find that you often see some terms, such as CNN and RNN. CNN may be more popular. It is a network that performs very well in face recognition and is now a more mainstream network. Of course, there are many networks, such as RNN or faster CNN networks, which have better performance when solving certain specific problems.

2. Some specific implementations of image recognition: taking intelligent pornography detection as an example

When we have the relevant deep learning technology, we can build applications on the server side. For example, we can do intelligent pornography detection, input a video stream, decode each frame, identify the problematic part, and process it; for example, mosaic, or save the content, and then send a notification to the backend to tell the backend that there is a picture that seems to contain something indescribable, etc., and then encode it and output it to other places, such as distributing it to CDN, etc. If these processes are identified manually, the cost is very high, and development must be solved through technical means.

***Let me talk about the experience on the mobile phone side: the test indicators of Tutu's products in terms of face detection performance. For example, in the tests we did on iOS and Android platforms, on iPhone 6, it takes 40 milliseconds to capture 40 feature points, which is equivalent to processing 25 frames in one second. Of course, in fact, it is not necessary to do so many times. Because of the visual persistence effect, the human eye observes things. Generally speaking, 12 frames is a dividing line. If it is less than 12 frames, you can feel the screen is stuck, but as long as it is greater than 12 frames, it looks continuous. So we usually limit the detection to 17 or 18 times, which is enough on iOS. On the Android side, the performance is indeed worse than that of the iOS platform. Whether it is the encapsulation of the API or the combination of the entire hardware, the same GPU model may not be able to achieve the same performance as iOS when used on Android devices. The iOS platform is indeed better than Android in all aspects. Xiaomi 5 is a relatively new device, and it takes about 60 milliseconds to capture 40 feature points.

3. The bottleneck of technology development: *** or hardware

Although deep learning APIs have been introduced on mobile phones, such as iOS 9, and iOS 10 has upgraded it to provide more functions, we usually develop and train on PCs until the code is ready and then run it on mobile devices. As mentioned earlier, training is a very critical part of the development of machine learning and deep learning.

What does training mean?

For example, I take 10,000 pictures and mark all the faces. After processing the 10,000 samples, I get experience and find out what the characteristics of a face are. For example, it involves 150 parameters, and I get a function. After adjustment, I get a function model. I train and test this model again, and finally get a better model. Then I find a lot of test data, such as 10,000 test data, to test this model. If the performance is very good, then this data model network is reliable and can be used in practice.

But the training process is very time-consuming. When we run a training, it may take 20 to 30 hours on the CPU. This is for simple models. For some complex models, such as the 125-layer neural network released by Google, it may take three or four days to run on the CPU. It takes so long to get a model, and then you know whether the model is good or bad. If you find that it doesn’t work, and change a small parameter, it will take another three or four days. So there is only one solution, which is to upgrade the hardware. For example, GPU replaces CPU to complete the calculation.

Here are some detailed indicators, for example, some algorithms need to detect in RGB space, whether there is any indescribable content in it. If we use GTX 980 Ti to run, it can be less than 20 milliseconds per frame. If we use i7 CPU to run, it will take 800 seconds to detect, which is completely incomparable with GPU. But the problem is that GPU equipment specially used for training is very expensive. GPUs of 7,000 or 8,000 yuan are not considered good in machine training. In addition, in order to avoid wasting time in complex scenes, such as AlphaGo training, a large number of equipment can only be used to make up for it. The cost can be imagined. That's why it is said that only companies with certain strength can afford to do deep learning.

At present, some major international companies, such as Microsoft, use FPGA solutions for many services, including cloud services. Baidu is also developing chips based on computing units, and the Chinese Academy of Sciences is also doing related research. Therefore, deep learning has been stuck in computing all the way, and computing power is far behind the requirements of our software, and the era has become a hardware competition. But in fact, this problem is not a recent one; as early as the early days of artificial intelligence, the concept of neural networks already existed, but the contribution of neural networks to "intelligence" was minimal, and the main problem was insufficient computing power. So now we can foresee that once quantum computing becomes possible, the era of artificial intelligence will truly arrive.

<<:  Application of Image Technology in Live Broadcasting (Part 1) - Beauty Technology

>>:  DEX file format analysis

Recommend

Why did Jack Ma personally start selling YunOS phones?

With Tencent launching TencentOS and reports that...

Community operation: How to build a high-value community from 0-1?

First, let me explain a concept. What is a growth...

How to place Baidu ads? Baidu search promotion delivery method

With the continuous development of the Internet, ...

From variable declaration in C language to block syntax in Objective-C

[[164693]] In this article, we start with simple ...

Is the effect of bidding promotion declining? You must analyze these 8 factors!

Nowadays, most of our SEM promotions revolve arou...

Startup teams, don’t wait until you’re out of money to realize these truths

[[143282]] Preface: There is a saying in the entr...

Let’s talk about what is WebView2?

Part 01 Introduction to WebView2 We all know that...