Is the mathematical foundation of deep neural networks too difficult for you?

Deep Feedforward Network

We start from statistics and naturally define a function f, and the data sample is given by ⟨Xi,f(Xi)⟩, where Xi is a typical high-dimensional vector and f(Xi) can take the value of {0,1} or a real number. Our goal is to find a function f∗ that best describes the given data (without overfitting), so that it can make accurate predictions.

In deep learning, generally speaking, it is a subset of parameter statistics, that is, there is a family of functions f(X;θ), where X is the input data and θ is the parameter (typically a high-order matrix). The goal is to find a set of optimal parameters θ∗ so that f(X;θ∗) is most suitable for describing the given data.

In a feedforward neural network, θ is the neural network, which consists of d functions:

Most neural networks are high-dimensional, so they can also be expressed by the following structural diagram:

Among them are the components of the vector-valued function f(i), that is, the components of the i-th layer of the neural network, and each is a function of . In the above structure diagram, the number of components of each layer of function f(i) is also called the width of layer i, and the widths between layers may be different. We call the number of layers d of the neural network the depth of the network. It is important that the neural network of the d-th layer is different from the previous network. It is the output layer. In the above structure diagram, the width of the output layer is 1, that is, f=f(d) is a scalar value. Usually statisticians like linear functions the most, but if we stipulate that the function f(i) in the neural network is a linear function, then the overall combination function f can only be a linear function, and it is completely unable to fit high-dimensional complex data. Therefore, we usually use nonlinear functions as activation functions.

The most commonly used activation function is inspired by the neuroscience model, that is, each cell receives multiple signals, but the synapse can only choose to activate or not activate a specific potential based on the input. Because the input can be represented as.

For some nonlinear function g, the function excited by the sample can be defined as:

Here g⊗ defines a nonlinear function with a linear function as its independent variable.

Usually we want the function g to be a nonlinear function, and we also need it to be easy to differentiate. Therefore, we generally use the ReLU (rectified linear unit) function g(z)=max(0,z). Other types of activation functions g include the logistic function: and the hyperbolic tangent function:.

The advantage of these two activation functions over ReLU is that they are both bounded functions.

As mentioned before, the final output layer is different from the previous layers. First, it is usually a scalar value, and second, it usually has some statistical interpretation:

It can usually be regarded as the parameters of the classical statistical model, and the output of the d-1 layer constitutes the input of the output layer activation function. The output layer activation function can use a linear function.

This linear function outputs the conditional mean of the Gaussian distribution. Others can also use σ(wTh+b), where σ represents the sigmoid function, that is

The sigmoid function treats the output as a Bernoulli trial where P(y) is exp(yz), where the more generalized soft-max function can be given as:

in.

Now, the components of z correspond to possible output values, and softmax(z)i represents the probability of output value i. For example, if we input an image into a neural network, the outputs (softmax(z)1,softmax(z)2,softmax(z)1) can be interpreted as the probability of different categories (such as cat, dog, wolf).

Convolutional Networks

A convolutional network is a neural network with linear operators, that is, some hidden geometric matrices are used as local convolution operators. For example, the k-th layer of a neural network can be expressed by an m*m-order matrix:

We define the function of the k+1 layer as a 2*2 matrix that performs convolution on the previous layer of the neural network, and then applies the nonlinear function g:

The parameters a(k), b(k), c(k) and d(k) depend only on the settings of the filters at different levels, not on the specific elements i, j. Although this constraint is not necessary in the general definition, it is reasonable for some applications such as machine vision. In addition to facilitating parameter sharing, this type of network naturally exhibits a sparse and excellent feature due to the definition of the function h.

Another common part of convolutional neural networks is the pooling operation. After performing the convolution and applying g to the matrix index function, we can replace the current function with the mean or maximum value of the surrounding functions. That is, set:

This technique can also be applied to dimensionality reduction operations.

Models and Optimization

Next, we need to understand how to obtain the neural network parameters, that is, what kind of θ should we take and how to evaluate θ. For this, we usually use the method of probabilistic modeling. That is, the parameter θ of the neural network determines a probability distribution P(θ), and we hope to obtain θ so that the conditional probability Pθ(y|x) reaches the maximum value. This is equivalent to minimizing the function:

The log-likelihood function can be replaced by the expectation. For example, if we fit y to a Gaussian distribution with mean f(x;θ) and unit covariance matrix, then we can minimize the average error:

So now how do we optimize the loss function J to achieve the best performance? First of all, we need to know that there are four main difficulties in optimization:

Too high data and feature dimensionality
The dataset is too large
The loss function J is a non-convex function
Too many parameters (overfitting)

In the face of these challenges, the natural solution is to use gradient descent. For our deep neural network, a better approach is to use the back-propagation method based on the chain rule of differentiation, which dynamically calculates partial derivatives to reduce the error back-propagation to update the weights.

Another very important technique is regularization. Regularization can solve the problem of model overfitting, that is, we usually take a penalty term for each feature to prevent the model from overfitting. Convolutional neural networks provide a solution to the overfitting problem through parameter sharing. Regularization provides another solution. We no longer optimize J(θ), but optimize J(θ)=J(θ)+Ω(θ).

Among them, Ω is a "complexity measure". In essence, Ω introduces a penalty term for "complex features" or "huge parameters". Some Ω regularization terms can use L2 or L1, or L0 which is a convex function. In deep learning, there are other ways to solve the overfitting problem. One is data enhancement, which is to use existing data to generate more data. For example, given a photo, we can crop, deform, rotate and other operations on this photo to generate more data. Another is noise, which is to add some noise to the data or parameters to generate new data.

Generative Models: Deep Boltzmann Machines

Deep learning uses many probabilistic models. The first one we describe is a graphical model. A graphical model is a model that represents probability distributions using a weighted graph, where each edge measures the correlation or causality between nodes with a probability. Because this deep network is a graph with weighted probabilities on each edge, it is natural for us to express it as a graphical model. The Deep Boltzmann Machine is a graphical model in which the joint distribution is expressed as an exponential function:

where the energy E of the configuration is given by the following expression:

In general, the middle levels are real-valued vectors, while the top and bottom levels are discrete or real-valued.

The graph model of the Boltzmann machine is a typical bipartite graph, where the vertices corresponding to each layer are only connected to the layers directly above and below it.

This Markov property means that under h1, the distribution of the v component is independent of h2,…,hd and other components of v. If v is discrete:

The same goes for other conditional probabilities.

Unfortunately, we do not know how to sample or optimize in graphical models, which greatly limits the application of Boltzmann machines in deep learning.

Deep Belief Networks

Deep belief networks are computationally simpler, although their definition is more complicated. These "hybrid" networks are essentially directed graphs with d layers, but the first two layers are undirected: P(h(d−1),h(d)) is defined as

For other layers,

Note that this is the opposite direction from before. However, the Caintic variable satisfies the following conditions: if

Defined by formula (1), they also satisfy formula (2).

We know how to directly sample the bottom layer conditional on other layers using the formula above; but to perform inference, we also need the conditional distribution of the output given the input.

Finally, we emphasize that while the kth layer of the Deep Boltzmann Machine depends on the k+1 and k-1 layers, in the Deep Belief Network, if we only condition on the k+1 layer, we can accurately generate the kth layer (without conditioning on other layers).

Lesson Plan

In this course, our main topics of discussion are:

Depth of expression
Computational issues
Simple and analyzable generative model

The first topic emphasizes the expressiveness of neural networks: what types of functions can be approximated by the network? The papers we plan to discuss are:

Cybenko, “Approximation of additive activation functions” (89).
Hornik, “Approximation Power of Multilayer Feedforward Networks” (91).
Telgarsky's "The Representational Advantage of Deep Forward Networks" (15).
Safran and Shamir, "Deep Disentangling of ReLU Networks" (16).
Cohen, Or, and Shashua, “On the expressiveness of deep learning: A tensor analysis” (15).

The first two papers (which we will discuss in detail later in the course) demonstrate the idea that you can express anything with just a single layer. However, the next few papers show that this single layer must be very wide, which we will show later in the course.

Regarding the second topic, what we discuss in this course regarding complexity results might include:

Livni, Shalev Schwartz, and Shamir, “On the computational efficiency of training neural networks” (14).
Danieli and Shalev-Schwartz, “Complexity-theoretic limits on learning DNF” (16).
Shamir, “Distribution-Specific Complexity of Learning Neural Networks” (16).

In terms of algorithms:

Janzamin, Sedghi, and Anandkumar, “Efficient training of neural networks using tensor methods” (16).
Hardt, Recht, and Singer, “Faster training, better generalization” (16).
Finally, the papers we will read on generative models will include:
Arora et al. (2014) "Learning Provable Constraints on Some Deep Representations".
Mossel (2016) "Deep Learning and Generative Hierarchical Models".

Today we will start looking at the first two papers on the first topic: those by Cybenko and Hornik.

Cybenko and Hornik's theory

In his 1989 paper, Cybenko proved the following:

[Cybenko (89)] Let σ be a continuous function with limits limt→–∞σ(t)=0 and limt→+∞σ(t)=1. (For example, σ can be an activation function with σ(t)=1/(1+e−t)) Then, the family of functions of the form f(x)=∑αjσ(wTjx+bj) is dense in Cn([0,1]).

Among them, Cn([0,1])=C([0,1]n) is the space of continuous functions from [0,1]n to [0,1], which has d(f,g)=sup|f(x)−g(x)|.

Hornik proved the following derivative of Cybenko:

[Hornik (91)] Consider the family of functions defined by the above theorem, but with no conditions on σ.

If σ is bounded and non-continuous, then the family of functions is dense in the space Lp(μ), where μ is any finite measure on Rk.

If σ is conditionally continuous, then the family of functions is dense in the space C(X), where C(X) is the space of all continuous functions on X and X⊂Rk is a set that satisfies a finite open cover (compact set).

If we add σ∈Cm(Rk), then the family of functions is dense in the space Cm(Rk) and C^{m,p}(μ), satisfying the finite open cover condition for any finite μ.

If the additional derivative of σ to order m is bounded, then for any finite measure μ on Rk, the family of functions is dense in C^{m,p}(μ).

In the above theory, Lp(μ) space is the space of functions f that satisfy ∫|f|pdμ<∞, and d(f,g)=(∫|f−g|pdμ)1/p. Before starting the proof, we need to quickly review the knowledge of function analysis.

Hahn-Banach extension theorem

If V is a standard vector space with linear subspaces U and z∈V∖U¯, then there exists a continuous linear map L:V→K with L(x) = 0, with L(z) = 1 for all x∈U, and ‖L‖≤d(U,z).

Why is this theorem useful? Cybenko and Hornik's result is proved by contradiction using the Hahn-Banach extension theorem. We consider the subspace U given by {Σαjσ(wTjx + bj)}, and we assume by contradiction that U¯ is not the entire function space. We conclude that there exists a continuous linear map L on our function space that is restricted to zero on U¯, but is not always zero. In other words, it suffices to show that any continuous linear map L that is zero on U must be the zero map, proving our desired result.

Now, a classic result in function analysis shows that a continuous linear function L on Lp(μ) can be expressed as

For g∈Lq(μ), where 1/p + 1/q = 1. The continuous linear function L on C(X) can be expressed as

where μ is a finite symbolic measure on X.

We can find linear function expressions in other spaces similar to those considered in Cybenko and Hornik's theorem.

Before moving on to the general proof, consider the (easy) case where the function space is Lp(μ) and σ(x) = 1 (x ≥ 0). How can we show that if all f in the set defined by the theorem satisfy L(f) = 0, then the function g∈Lq(μ) associated with L must be zero? By transformation, we obtain the index of any interval from σ, i.e., we can show that ∫bagdμ = 0 for any a < b. Since μ is finite (the finiteness of σ satisfies the condition), g must be zero. Using this example, we now consider the general case of Cybenko's theorem. We want to show that

This means μ = 0. First, we use the following Fourier analysis trick to reduce the dimension to 1: The measure μa is defined as

We observed

Furthermore, if we can show that for any a, μa ≡ 0, then μ ≡ 0 (“a measure is defined by all its projections”), that is

(Note that the finiteness of μ is used here.) After reducing the dimension to 1, we use another very useful trick (also using the finiteness of μ) - the convolution trick. By convolving μ with a small Gaussian kernel, we obtain a measure with density, the Lebesgue measure. We now proceed to the rest of the proof. Using the convolution trick, we have

and wish to prove that the density h = 0. Changing the variables, we rewrite condition (3) as

To prove that h = 0, we use the following tools from abstract Fourier analysis. Let I be the closed set of all extended linear spaces of h(wt+b). Due to the invariance of the I function, it is invariant under convolution; in abstract Fourier analysis, I is an ideal state for convolution. Let Z(I) denote the set of all Fourier transforms ω of all functions vanishing on I; then Z(I) is the set R or {0}, because if g(t) is ideal, then for w≠0, g(tw) is also ideal. If Z(I) = R, then in the ideal state all functions are constant 0, and we have proved it. Otherwise, Z(I) = {0}, and by Fourier analysis, I is the set of all functions with f^ = 0; that is, all non-constant functions. But if σ is orthogonal to all non-constant functions, σ = 0. We conclude that Z(I) = R, that is, h = 0, and we have completed the proof.

Original link: http://elmos.scripts.mit.edu/mathofdeeplearning/2017/03/09/mathematics-of-deep-learning-lecture-1/

<<: From environment setup to memory analysis: A guide to Python code optimization

>>: 7 Benefits of Data Visualization

Is the Laniakea Supercluster, composed of 100,000 galaxies, the largest structure in the universe?

Cut off 20% of the notch, behind Apple's "toothpaste squeezing": the pain and suffering of under-screen 3D visual technology

At the recent Apple conference, one of the highli...

This disease cannot be inherited from parents, they look basically the same! Just because...

Expert of this article: Dong Xiaoli, Director of ...

Qt 5.4 released to help cross-platform application development and device creation

December 10, 2014 - The Qt Company today released...

The holiday is coming to an end, but the children are still not paying attention? These 4 methods can improve children's attention

Does your doll have this situation? As soon as I ...

Zhihu marketing promotion logic!

This article mainly focuses on the following 7 po...

Is the mathematical foundation of deep neural networks too difficult for you?

Convolutional Networks

Models and Optimization

Generative Models: Deep Boltzmann Machines

Deep Belief Networks

Lesson Plan

Cybenko and Hornik's theory

Hahn-Banach extension theorem

Is the Laniakea Supercluster, composed of 100,000 galaxies, the largest structure in the universe?

Review! How to operate an active and high-quality community?

Some basic techniques for operation, promotion and attracting new customers!

What exactly is the 2.5D glass that manufacturers are always boasting about? Is it very powerful?

What should I do if the display volume of information flow ads is low? Must-read traffic optimization tips!

Are stars necessarily heavier than planets? This "forbidden" planet breaks the stereotype!

Raw peanuts VS cooked peanuts, which one is more nutritious?

2020 Double Eleven strategy, 2020 Double Eleven activity strategy!

Some people were silent when they saw this picture, while others were envious.

What does Android unified push bring to users? No messages will be missed, power consumption is significantly reduced

Recommend

How to make your fans happy to accept your advertisement

Guide to writing scripts for live broadcast rooms!

8 Tips to Improve App Store Rankings with ASO

Darwin was originally a "nerd" who wanted to be a pastor, and his brilliant masterpiece was not his original work?

No New Year's Eve in 2022? It will be absent for the next five years

When promoting an APP, is it correct to ask friends to forward it to their Moments?

5 golden rules for designing mobile emails

Community Operation Guide: How to double the activity?

Fan Deng's Low-Risk Entrepreneurship Course: Creating a Stable Cash Flow

What is the specific function of Wenchang Tower?

Cut off 20% of the notch, behind Apple's "toothpaste squeezing": the pain and suffering of under-screen 3D visual technology

This disease cannot be inherited from parents, they look basically the same! Just because...

Qt 5.4 released to help cross-platform application development and device creation

The holiday is coming to an end, but the children are still not paying attention? These 4 methods can improve children's attention

Zhihu marketing promotion logic!