Paper link: https://arxiv.org/abs/1706.10207 Abstract: This paper aims to introduce key models, algorithms, and open problems in applying optimization methods to machine learning. This paper is written for readers who have some knowledge, especially those who are familiar with basic optimization algorithms but new to machine learning. First, we derive a formulation for a supervised learning problem and show how it gives rise to various optimization problems depending on the context and underlying assumptions. We then discuss some of the salient features of these optimization problems, focusing on the cases of logistic regression and deep neural network training. The second half of the paper focuses on several optimization algorithms, starting with convex logistic regression, and then discussing first-order methods, including the use of stochastic gradient methods (SGD), variance reducing stochastic methods, and second-order methods. Finally, we discuss how these methods can be applied to the training of deep neural networks, focusing on the difficulties caused by the complex non-convex structure of these models. 1 Introduction Over the past two decades, the fascinating algorithmic field of machine learning has emerged at an almost unprecedented pace. Machine learning is based on statistics and computer science, with mathematical optimization methods at its core. In fact, many of the latest theoretical and practical advances in the field of optimization methods have been influenced by machine learning and other data-driven disciplines. However, even with these connections, there are still many barriers between statistics, computer science, and optimization methods research dedicated to machine learning related problems. Therefore, this article attempts to break down this barrier by providing an overview of machine learning learning algorithms. The purpose of this paper is to give an overview of some key issues and research problems related to the field of machine learning. Considering the knowledge involved in the field of operations research, we assume that the reader is familiar with the basic theory of optimization methods, but we will still introduce relevant terms and concepts used in the general field of machine learning, hoping to facilitate communication between operations research experts and people from other contributing fields. To achieve this goal, we list the most important terms that will be introduced and used in this paper in Glossary 1. 1.1 Clarify the motivation 1.2 Learning and Optimization Problems 1.3 Learning Boundaries, Overfitting, and Regularization 2 Optimization methods for solving logistic regression problems (optimization methods for shallow models) When L and r are arbitrary convex functions of w, the method discussed in this section can be used to solve problem (11): This category includes many machine learning models, including support vector machines, Lasso (Least Absolute Shrinkage and Selection Operator), sparse inverse covariance selection, etc. For detailed information on these models, please refer to [86] and the references therein. In order to make each step specific, we take the regularized logistic regression of binary classification as an example. In order to simplify the symbols in the example, we make assumptions without loss of generality and let. (The bias term b0 is omitted here. This omission can be compensated by adding a one-dimensional eigenvalue that is always 1 to the input vector). When w and x are both d-dimensional, it can be made into a specific convex optimization problem. It is worth mentioning that for this type of problem, the regularization term is indispensable. To think about why it is indispensable, assume that for all i ∈{1,...,n}, there is a parameter vector w such that yi(wT*xi) > 0 and an unbounded ray {θw : θ > 0}. Then the problem is clear. In this example, when θ →∞, That is to say, the function (Equation 12) cannot reach the minimum value. On the other hand, by adding (forcing) the regularization function r, it can be guaranteed that problem (12) will have an optimal solution. For the regularization function r, we will refer to the common choices and r(w) = ||w||1. However, for simplicity, we usually choose the former because it makes Equation 12 continuously differentiable for each factor. In contrast, r(w) = ||w||1 leads to non-smooth problems, for which more complex algorithms are required for function minimization. 2.1 First-order method We first discuss the first-order method for solving problem (12), where "first-order" simply refers to the mathematical technique of taking the first partial derivatives of the parameters in the function F. 2.1.1 Gradient Descent Conceptually, the simplest way to minimize a smooth convex objective is gradient descent, as discussed in [62]. In this method, starting from an initial estimate w0, the weight estimates are iteratively updated using the following formula. where αk > 0 is a step size parameter. The choice of the step size sequence {αk} directly determines the performance of the algorithm. In the field of optimization research, it is generally believed that using a linear search to determine {αk } at each iteration can lead to a superior algorithm for solving various types of problems. However, for machine learning applications, this operation is expensive because each calculation of the function F requires passing the entire dataset, which is likely to be expensive if n is too large. Fortunately, when each αk is set to a positive constant α and it is a sufficiently small fixed value, theoretically, the convergence of the algorithm can still be guaranteed. (The fixed step size constant is called the learning rate in the field of machine learning. But even if it is not a constant, some people call αK or the entire sequence {αK } the learning rate.) The convergence speed of the algorithm depends on whether the function f is a strongly convex function or a weakly convex function. The gradient descent and accelerated gradient descent extensions for solving the logistic regression problem with L1 norm regularization are called ISTA and FISTA, respectively. We observe that in this case, even if λ > 0, the objective function is not strongly convex. Only when the objective function is convex [5] do ISTA and FISTA have the same sublinear convergence rate as their corresponding smooth functions. An important feature of gradient descent in ML training is that it calculates the computational cost of solving the gradient of the function F at each iteration. In ML training, the cost of a single gradient calculation is usually O(ND), which can indeed be seen, for example, in the case where the regularization term is, the gradient of the function F with respect to each specific w is 2.1.2 Stochastic Gradient Method The stochastic gradient method is well known in the field of operations research for its ability to minimize random objective functions and is also a well-known optimization algorithm in the ML community. The algorithm was originally proposed by Robbins and Monro [67] for solving a system of stochastic equations. It is noteworthy that it can be used to minimize a random objective with good convergence and a computational complexity of only O(d) per iteration instead of O(nd) (the computational complexity of gradient descent). At each iteration, the stochastic gradient method computes an unbiased estimate GK of the gradient F(Wk). This estimate can be computed at low cost; for example, for Equation (12), the stochastic gradient for a particular iteration can be solved as Among them, Sk is called a small batch, and all its elements are selected from the total data set {1,...,n} according to a uniform distribution. The following operation is similar to gradient descent: Undoubtedly, the key to this algorithm lies in the selection of the step size sequence {αk}. Unlike gradient descent, a fixed step size (i.e., learning rate) does not guarantee that the algorithm will converge to the minimum value of the strongly convex function F, but only guarantees convergence to the neighborhood of the minimum value. SGD converges more slowly than gradient descent. In particular, when function F is a strongly convex function, the algorithm only guarantees that a solution of the expected accuracy (i.e., a solution satisfying E[F(wk)]-F(w) ≤ ε) can be obtained when k ≥ O(1/ε). When function F is merely a convex function, the above solution can only be guaranteed when k ≥ O(1/ε^2) [11]. On the other hand, as mentioned earlier, if Sk is bounded by a constant (independent of n or k), then each iteration cost of SGD is 0(n) times smaller than gradient descent. However, in practice, standard SGD is not necessarily the most effective way to solve optimization problems in machine learning. In fact, there has been a lot of active research in the field of machine learning and optimization algorithms to develop improved or alternative SGD. In the next two parts, we will discuss two types of methods: variance reduction methods and second-order methods. But there are many other methods besides these two. For example, SGD with momentum is an extended version of SGD that has been found to perform better than standard SGD in practice. See Algorithm 1 in the figure below 2.1.3 Variance reducing method Considering problem (11), it was found that the SGD method can be improved by exploiting the structure of the objective F as a finite sum of n functions coupled with a simple convex function term. Several methods have been studied, such as SAG [74], SAGA [22], SDCA [76] and SVRG [44]. For ease of reference, we refer to SVRG as Algorithm 2. The algorithm performs a complete gradient calculation in each outer iteration, and then iterates L steps in a random direction, which is a random correction process of the entire gradient. The inner loop step size L must meet certain conditions to ensure convergence [44]. SVRG, short for Stochastic Variance Reducing Gradient, gets its name from the fact that the algorithm can be viewed as a variance-reducing variant of SGD (specifically finite-sum minimization). The researchers combined some ideas from SVRG and SAGA to propose a new method called SARAH. The only difference between the inner iteration step and SVRG is that the formula of SARAH is as follows: This change results in , so that the step size in SARAH is not based on an unbiased gradient estimate. However, it achieves improved convergence properties relative to SVRG. Table 2: Computational complexity of the first-order method for minimizing strongly convex functions Table 3: Computational complexity of the first-order method for minimizing general convex functions 2.2 Second-order method and quasi-Newton method Inspired by decades of research in deterministic optimization, one of the most active research areas in ML optimization is about how to use second-order derivative (i.e., curvature) information to speed up training. Unfortunately, when n or d are large, the computation and storage of the Hessian matrix becomes prohibitively expensive in machine learning applications. Another type of algorithm based on a model like (21) is the quasi-Newton method: Interestingly, these methods do not compute explicit second-order derivatives, but instead construct a Hessian approximation matrix consisting entirely of first-order derivatives by applying a low-rank update at each iteration. For example, let us briefly introduce the most popular quasi-Newton algorithm, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method. In this method, we first see that (21) is minimized and further find that it is actually convenient to compute the inverse Hessian approximation. Since with the step size sk = w_k+1 − wk and the displacement yk = ∇F(wk+1) − ∇F(wk), one chooses to minimize to satisfy the secant equation sk = (B^-1)yk. Using a carefully chosen canonical expression, the analytical form of this problem can be explicitly written as The difference between them can be represented as a second-order matrix only. For ease of reference, the complete classic BFGS algorithm is referred to as Algorithm 3. Even with second-order information, stochastic optimization methods (with no discrepancy reduction) cannot achieve convergence faster than sublinear. However, using second-order information is a good idea because it can reduce the constant in the convergence rate and also reduce the effects of ill-conditioning if the Hessian approximation matrix converges to the true solution of the Hessian matrix. Unfortunately, despite the observed practical efficiency gains, there is no true second-order method that theoretically achieves such gains. So far, most practical methods only guarantee the convergence (rate) properties of SGD as long as the Hessian (approximation) matrix remains well behaved. For example, if the sequence {Bk} (not necessarily generated by BFGS updates) satisfies for all k: This has the same convergence rate properties as SGD. We can reasonably assume that these restrictions apply to the methods discussed above, provided that these assumptions are appropriately guarded. However, caution should be exercised in the context of quasi-Newton methods, where the stochastic gradient estimate may be related to the Hessian approximation. 3 Deep Learning Major advances along these lines include the use of deep neural networks (DNNs). A corresponding branch of machine learning is called deep learning (or hierarchical learning), which represents a class of algorithms that attempt to construct high-level abstractions from data using deep, multi-layered graphs that contain continuous linear and nonlinear transformations [6, 51, 73, 37, 38, 23]. In recent years, scientists have studied a variety of neural network types, including fully connected neural networks (FNNs) [84, 28], convolutional neural networks (CNNs) [50], and recurrent neural networks (RNNs) [41, 57, 52]. For our purposes, we will focus on the first two types of neural networks, while keeping an eye out for others. 3.1 Problem Formulation 3.2 Stochastic Gradient Descent We cite the following to highlight the confusing reactions to applying optimization algorithms to training DNNs. First, for example in [11], there is a result showing that by applying SGD to minimize a non-convex objective function (drawn from the input × output space), it is guaranteed that the expected gradient risk will vanish, at least over a subsequence, i.e., . This result is reassuring, suggesting that SGD can achieve convergence guarantees similar to those of other state-of-the-art gradient-based optimization algorithms. However, despite the various guarantees in the literature, there are limitations; after all, while many gradient-based optimization algorithms ensure that the objective function is monotonically decreasing, SGD is not computed in this way. Thus, if a subsequence converges to a fixed point, how can we be sure that this point is not a saddle point, a local minimum with an error, or some maximum value where the objective value is worse than the initial point? In fact, we cannot be sure. That is, SGD methods are generally good at finding local minima, not global minima. On the other hand, SGD tends to converge slowly near fixed values, which may hinder its development in deep neural networks. In general, for non-convex problems, convergence rates of SGD are documented in [29, 30], but they are very limited, and in particular they do not apply to the discussion in §1.3. Therefore, we cannot argue in the same way that SGD is the best method for non-convex optimization problems in machine learning. In addition, The learning bounds in are useless because for many DNNs and CNNs, the complexity C of the classification produced by the neural network is much larger than the number of training examples n. In fact, in [90] it was empirically shown that neural networks can easily outperform typical types of datasets only if the data in these sets are randomly perturbed. 3.3 Hessian-free method It has been found that we can modify the backpropagation algorithm of a DNN to compute such Hessian-vector products, since they can be viewed as directional derivatives [65]. The complexity of computing such products is only a constant factor more than computing the gradient. The resulting class of methods is often called Hessian-free optimization methods, since no explicit Hessian matrix is stored when accessing and using the Hessian information. Due to the non-convexity of the objective function, additional problems arise in the case of DNNs, where the true Hessian matrix may not be positive definite. In general, two possible approaches to deal with this problem in deterministic optimization are to modify the Hessian matrix and to apply trust region methods. Both approaches have been explored in the context of training DNNs. For example, in [54,55], a Gauss-Newton method was proposed, where the first term in the formula for the Hessian of the function F in (11) is approximated by the Hessian matrix (the regularization term is omitted). where is the Hessian matrix of the loss function l(·, ·) with respect to the first parameter, ∇p(w, xi) is the Jacobian of the dy-dimensional function p(w, x) with respect to weight w, and ∇^2 [pj (w, xi)] for all j ∈ {1, . . . , dy} is the element-wise Hessian matrix with respect to w. 3.4 Subsampled Hessian method Recently, in a series of papers (3, 15, 34), researchers have used a very general stochastic model framework to analyze trust region, line search, and adaptive cubic regularization methods in both convex and nonconvex cases. In this work, it is shown that standard optimization methods that use stochastic inexact gradient and Hessian information can retain their convergence rate as long as the gradient and Hessian estimates are sufficiently accurate with some positive probability. In the case of machine learning and sampled Hessians and gradients, the result requires only that |SK| must be chosen large enough relative to the length of the steps taken by the algorithm. For example, in [3, 34], the size of |SK| is related to the radius of the confidence region. Note that for sampled Hessian matrices, the size of the sample set is much larger than for sampled gradients, so the idea of using exact Hessian estimates of the gradient has led to powerful algorithms with strong theoretical support and good practical efficiency. |