Difficulty of deep learning: The deeper the neural network, the harder the optimization problem

[[192056]]

The core problem of deep learning is a very difficult optimization problem. So in the decades after the introduction of neural networks, the difficulty of optimizing deep neural networks was an important factor that prevented them from becoming mainstream. And it led to their decline between the 1990s and the early 21st century. But now this problem has been basically solved. In this blog post, I will explore the "difficulty" of optimizing neural networks and how to explain this problem in theory. In short: the deeper the neural network becomes, the harder the optimization problem becomes.

The simplest neural network is a single-node perceptron, for which the optimization problem is convex. The benefit of a convex optimization problem is that all local minima are also global minima. There are a variety of optimization algorithms for solving convex optimization problems, and every few years better polynomial-time algorithms for convex optimization are discovered. Using convex optimization algorithms, it is easy to optimize the weights of a single neuron (see the figure below). Let's see what happens when we expand a single neuron.

Figure 1 Left: A convex function. Right: A non-convex function. Convex functions are easier to find the bottom of the function surface than non-convex functions (Source: Reza Zadeh)

The next natural step is to add more neurons while keeping the neural network at a single layer. For a single layer of n-node perceptrons, if there are edge weights that allow the neural network to correctly classify the training set, then such edge weights can be found in polynomial time O(n) using linear programming. Linear programming is also a special case of convex optimization. This raises a question: can we make similar guarantees for deeper, multi-layer neural networks? Unfortunately, no.

To provably solve the optimization problem for general neural networks with two or more layers requires algorithms that will run into some of the most unsolved problems in computer science. So we don’t hold out much hope for machine learning researchers trying to find algorithms that provably solve the optimization problem for deep neural networks. This is because the optimization problem is NP-hard, which means that if it can be provably solved in polynomial time, then thousands of other problems that have remained unsolved for decades can also be solved. In fact, J. Stephen Judd found the following problem to be NP-hard in 1988:

Given a general neural network and a set of training examples, is there a set of network edge weights such that the neural network produces correct outputs for all training examples?

Judd's research also showed that even if a neural network is required to produce correct outputs for only two-thirds of the training samples, it is still an NP-hard problem. This means that even in the worst case, approximate training of a neural network is still difficult in nature. The fact found by Blum and Rivest in 1993 is even worse: even a simple neural network with only two layers and three nodes is still an NP-hard problem for training optimization.

In theory, the difference between deep learning and many relatively simple models in machine learning (such as support vector machines and logistic regression models) is that these simple models can be mathematically proven to complete model optimization in polynomial time. For these relatively simple models, we can guarantee that even if we use optimization algorithms that run longer than polynomial time, we cannot find a better model. However, the existing optimization algorithms for deep neural networks cannot provide such a guarantee. After you train a deep neural network model, you don’t know whether this network model is the best model you can find under your current configuration. So you will have doubts about whether you can get a better model if you continue to train the model.

Fortunately, we can get very close to these best results in practice: we can get to a good enough local minimum by running classic gradient descent optimization methods, allowing us to make huge progress on many common problems, such as image recognition, speech recognition, and machine translation. We simply ignore the best results and run as many gradient descent iterations as time allows.

It seems that the traditional optimization theory results are brutal, but we can try to circumvent these problems through engineering methods and mathematical tricks, such as heuristic methods, adding more machines, and using new hardware (such as GPUs). Some research work is actively exploring why the theoretical results are brutal, but these classic optimization algorithms work so well.

<<: Exclusive interview with DeepMap COO Luo Wei: In the era of autonomous driving, how can startups break through in the field of high-precision maps?

>>: Google's most powerful chip upgrade, the second-generation TPU is the future of deep learning?

What do the two world records of 64-bit and 7.5mm Skyworth GLED Air mean?

Recommend

How to systematically make a competitive product analysis report?

The author starts with the concept of competitive...

The three major operators are investing heavily in cloud gaming. Can they break through the blockade of Internet giants such as Tencent and Alibaba?

With the massive construction of 5G networks, bec...

Why do Chinese people love growing vegetables so much?

On October 12, 2024, Shi Jun, a Ph.D. in botany a...

Sanda Ge and Li Jiaqi, 5 characteristics of selling goods on live streaming platforms!

Sanda Ge and Li Jiaqi are the anchors of Kuaishou...

Special report from Electric Technology: With the phenomenon of idle players brushing BP rampant and the number of online users exceeding 900,000, can PlayerUnknown's Battlegrounds continue to be popular?

Players online in PlayerUnknown's Battlegroun...

How much does it cost to customize the Lhasa teaching materials applet? What is the price quote for customizing the Lhasa teaching materials mini program?

The mini program provides convenience for publici...

【Creative Cultivation Program】Why is sea water salty? Where does the salt come from?

Author: Du Dong Geology As we all know, the Earth...

Difficulty of deep learning: The deeper the neural network, the harder the optimization problem

What do the two world records of 64-bit and 7.5mm Skyworth GLED Air mean?

One article to solve the search advertising problems on Toutiao and Baidu

If you are afraid of AIDS, please read this prevention information immediately!

Butterfly: When you see me, you will probably think of "those flowers"

How to make money with WeChat public account? Free sharing of tutorials on how to make money with public accounts

Amazing! The colored lights in the field actually hide such a big secret?

National Pomegranate Geography

Let's talk about phantom types in Swift

Using macaca for mobile hybrid automation testing (Part 2)

Can the construction of Dongguan’s first unmanned factory alleviate the labor shortage?

Recommend

How to systematically make a competitive product analysis report?

The three major operators are investing heavily in cloud gaming. Can they break through the blockade of Internet giants such as Tencent and Alibaba?

Why do Chinese people love growing vegetables so much?

Sanda Ge and Li Jiaqi, 5 characteristics of selling goods on live streaming platforms!

Special report from Electric Technology: With the phenomenon of idle players brushing BP rampant and the number of online users exceeding 900,000, can PlayerUnknown's Battlegrounds continue to be popular?

The sound of an inflatable lifelike doll. Does the inflatable doll sound nice?

Why is it so important for older people to get vaccinated against the new coronavirus?

Analysis and application of WebView cache principle

Hidden concerns under high sales volume: How can Roewe break the high complaint rate?

How to choose H5 multi-channel App promotion?

Netflix CEO says it will launch ad-free paid video service in China

5 tips for brand marketing positioning!

Han Xuejie's TCM beauty and health class, 90 ancient anti-aging recipes

How much does it cost to customize the Lhasa teaching materials applet? What is the price quote for customizing the Lhasa teaching materials mini program?

【Creative Cultivation Program】Why is sea water salty? Where does the salt come from?