Difficulty of deep learning: The deeper the neural network, the harder the optimization problem

Difficulty of deep learning: The deeper the neural network, the harder the optimization problem

[[192056]]

The core problem of deep learning is a very difficult optimization problem. So in the decades after the introduction of neural networks, the difficulty of optimizing deep neural networks was an important factor that prevented them from becoming mainstream. And it led to their decline between the 1990s and the early 21st century. But now this problem has been basically solved. In this blog post, I will explore the "difficulty" of optimizing neural networks and how to explain this problem in theory. In short: the deeper the neural network becomes, the harder the optimization problem becomes.

The simplest neural network is a single-node perceptron, for which the optimization problem is convex. The benefit of a convex optimization problem is that all local minima are also global minima. There are a variety of optimization algorithms for solving convex optimization problems, and every few years better polynomial-time algorithms for convex optimization are discovered. Using convex optimization algorithms, it is easy to optimize the weights of a single neuron (see the figure below). Let's see what happens when we expand a single neuron.

Figure 1 Left: A convex function. Right: A non-convex function. Convex functions are easier to find the bottom of the function surface than non-convex functions (Source: Reza Zadeh)

The next natural step is to add more neurons while keeping the neural network at a single layer. For a single layer of n-node perceptrons, if there are edge weights that allow the neural network to correctly classify the training set, then such edge weights can be found in polynomial time O(n) using linear programming. Linear programming is also a special case of convex optimization. This raises a question: can we make similar guarantees for deeper, multi-layer neural networks? Unfortunately, no.

To provably solve the optimization problem for general neural networks with two or more layers requires algorithms that will run into some of the most unsolved problems in computer science. So we don’t hold out much hope for machine learning researchers trying to find algorithms that provably solve the optimization problem for deep neural networks. This is because the optimization problem is NP-hard, which means that if it can be provably solved in polynomial time, then thousands of other problems that have remained unsolved for decades can also be solved. In fact, J. Stephen Judd found the following problem to be NP-hard in 1988:

Given a general neural network and a set of training examples, is there a set of network edge weights such that the neural network produces correct outputs for all training examples?

Judd's research also showed that even if a neural network is required to produce correct outputs for only two-thirds of the training samples, it is still an NP-hard problem. This means that even in the worst case, approximate training of a neural network is still difficult in nature. The fact found by Blum and Rivest in 1993 is even worse: even a simple neural network with only two layers and three nodes is still an NP-hard problem for training optimization.

In theory, the difference between deep learning and many relatively simple models in machine learning (such as support vector machines and logistic regression models) is that these simple models can be mathematically proven to complete model optimization in polynomial time. For these relatively simple models, we can guarantee that even if we use optimization algorithms that run longer than polynomial time, we cannot find a better model. However, the existing optimization algorithms for deep neural networks cannot provide such a guarantee. After you train a deep neural network model, you don’t know whether this network model is the best model you can find under your current configuration. So you will have doubts about whether you can get a better model if you continue to train the model.

Fortunately, we can get very close to these best results in practice: we can get to a good enough local minimum by running classic gradient descent optimization methods, allowing us to make huge progress on many common problems, such as image recognition, speech recognition, and machine translation. We simply ignore the best results and run as many gradient descent iterations as time allows.

It seems that the traditional optimization theory results are brutal, but we can try to circumvent these problems through engineering methods and mathematical tricks, such as heuristic methods, adding more machines, and using new hardware (such as GPUs). Some research work is actively exploring why the theoretical results are brutal, but these classic optimization algorithms work so well.

<<:  Exclusive interview with DeepMap COO Luo Wei: In the era of autonomous driving, how can startups break through in the field of high-precision maps?

>>:  Google's most powerful chip upgrade, the second-generation TPU is the future of deep learning?

Recommend

Principal Jiaqi's "Exclusive for Douyin Fans - Heart Hunting Plan"

Principal Jiaqi's "Douyin Fans Exclusive...

Tips for building a Tik Tok e-commerce live streaming feed plan!

With the continuous development of the Internet i...

Personal Trainer Nutritionist Online Course

Personal trainer nutritionist online course resou...

What is the “5W1H” of private domain traffic operation?

In the era of e-commerce, traffic on public platf...

How to build an Android MVVM application

1. Overview Databinding is a framework, MVVM is a...

Analysis of WeChat Reading VS NetEase Wuniu Reading Competitive Products

This article conducts a competitive analysis of W...

Google releases Android Studio 1.0 official version

[[124030]] Android Studio 1.0 is finally released...

Netflix apologizes for new film promotion, poster criticized for being too sexy!

Video streaming giant Netflix recently released a...