Difficulty of deep learning: The deeper the neural network, the harder the optimization problem

Difficulty of deep learning: The deeper the neural network, the harder the optimization problem

[[192056]]

The core problem of deep learning is a very difficult optimization problem. So in the decades after the introduction of neural networks, the difficulty of optimizing deep neural networks was an important factor that prevented them from becoming mainstream. And it led to their decline between the 1990s and the early 21st century. But now this problem has been basically solved. In this blog post, I will explore the "difficulty" of optimizing neural networks and how to explain this problem in theory. In short: the deeper the neural network becomes, the harder the optimization problem becomes.

The simplest neural network is a single-node perceptron, for which the optimization problem is convex. The benefit of a convex optimization problem is that all local minima are also global minima. There are a variety of optimization algorithms for solving convex optimization problems, and every few years better polynomial-time algorithms for convex optimization are discovered. Using convex optimization algorithms, it is easy to optimize the weights of a single neuron (see the figure below). Let's see what happens when we expand a single neuron.

Figure 1 Left: A convex function. Right: A non-convex function. Convex functions are easier to find the bottom of the function surface than non-convex functions (Source: Reza Zadeh)

The next natural step is to add more neurons while keeping the neural network at a single layer. For a single layer of n-node perceptrons, if there are edge weights that allow the neural network to correctly classify the training set, then such edge weights can be found in polynomial time O(n) using linear programming. Linear programming is also a special case of convex optimization. This raises a question: can we make similar guarantees for deeper, multi-layer neural networks? Unfortunately, no.

To provably solve the optimization problem for general neural networks with two or more layers requires algorithms that will run into some of the most unsolved problems in computer science. So we don’t hold out much hope for machine learning researchers trying to find algorithms that provably solve the optimization problem for deep neural networks. This is because the optimization problem is NP-hard, which means that if it can be provably solved in polynomial time, then thousands of other problems that have remained unsolved for decades can also be solved. In fact, J. Stephen Judd found the following problem to be NP-hard in 1988:

Given a general neural network and a set of training examples, is there a set of network edge weights such that the neural network produces correct outputs for all training examples?

Judd's research also showed that even if a neural network is required to produce correct outputs for only two-thirds of the training samples, it is still an NP-hard problem. This means that even in the worst case, approximate training of a neural network is still difficult in nature. The fact found by Blum and Rivest in 1993 is even worse: even a simple neural network with only two layers and three nodes is still an NP-hard problem for training optimization.

In theory, the difference between deep learning and many relatively simple models in machine learning (such as support vector machines and logistic regression models) is that these simple models can be mathematically proven to complete model optimization in polynomial time. For these relatively simple models, we can guarantee that even if we use optimization algorithms that run longer than polynomial time, we cannot find a better model. However, the existing optimization algorithms for deep neural networks cannot provide such a guarantee. After you train a deep neural network model, you don’t know whether this network model is the best model you can find under your current configuration. So you will have doubts about whether you can get a better model if you continue to train the model.

Fortunately, we can get very close to these best results in practice: we can get to a good enough local minimum by running classic gradient descent optimization methods, allowing us to make huge progress on many common problems, such as image recognition, speech recognition, and machine translation. We simply ignore the best results and run as many gradient descent iterations as time allows.

It seems that the traditional optimization theory results are brutal, but we can try to circumvent these problems through engineering methods and mathematical tricks, such as heuristic methods, adding more machines, and using new hardware (such as GPUs). Some research work is actively exploring why the theoretical results are brutal, but these classic optimization algorithms work so well.

<<:  Exclusive interview with DeepMap COO Luo Wei: In the era of autonomous driving, how can startups break through in the field of high-precision maps?

>>:  Google's most powerful chip upgrade, the second-generation TPU is the future of deep learning?

Recommend

The thumb-sized "girl" left her mother on the fifth day after birth...

Is she the child her mother doesn't want? Is ...

How to write emotionally powerful copy? Here’s a must-have formula for you!

Writing copy is the most common thing done by new...

Practical information: Here is the ultimate guide to app content marketing!

As of 2018, there are more than 20,000 apps on th...

Do you only know the funnel model for conversion rate analysis? You can also use these

What is a conversion? 1. Conversion rate Let’s ta...

Ionic 1.0.0 released, HTML5 mobile application framework

Ionic 1.0.0 released, codenamed "uranium-uni...

Analysis of the fission gameplay of Weibo traffic diversion!

I have been in the Internet circle for 6 years an...

5 minutes, 10 minutes, half an hour... I am still constipated. What's wrong?

How long do you spend in the toilet each time? 5 ...

Have you ever heard of the peony? Where does this beautiful name come from?

Produced by: Science Popularization China Author:...

Musk launches a rocket, and gives a seal headphones? Isn't this a joke?

Recently, a clip of Elon Musk's interview wit...

TikTok Promotion: TikTok’s Growth Troubles!

Short videos were already in their infancy in 201...