Neural network basics: seven network units, four layer connection methods

Neural network basics: seven network units, four layer connection methods

In September 2016, Fjodor Van Veen wrote an article titled "The Neural Network Zoo" (see A comprehensive review of neural network architectures with pictures and texts: from basic principles to derivative relationships for details), which comprehensively reviewed a large number of neural network frameworks and drew intuitive schematic diagrams for explanation. Recently, he published another article titled "The Neural Network Zoo Prequel: Cells and Layers", which is a prequel to his previous article. It once again gave a graphic introduction to the cells and layers of the neural network that were mentioned in the article but not expanded in depth.

Cell

The Neural Network Zoo paper showed different types of units and different styles of layer connections, but didn't go into detail on how each unit type worked. A lot of the unit types were colored differently to make it easier to differentiate the network, but since then I've found that the units all work in a similar way, so I'll describe each one below.

The basic neural network unit is a type that falls into the regular feed-forward architecture and is fairly simple. The unit is connected to other neurons via weights, i.e., it can be connected to all neurons in the previous layer. Each connection has its own weight, which is often a random number at the beginning. A weight can be negative, positive, small, large, or zero. The value of each unit it is connected to is multiplied by its respective connection weight, and the resulting values ​​are all added together. On top of it, a bias term is also added. The bias term prevents the unit from outputting zero, speeds up its operation, and reduces the number of neurons required to solve the problem. The bias term is also a number, sometimes a constant (usually -1 or 1), and sometimes a variable. This sum is then passed to the activation function, and the resulting value is the unit value.

Convolutional units are very similar to feed-forward units, except that they are usually connected to only a few neurons in the previous layer. They are often used to preserve spatial information, because instead of connecting to a few random units, they connect to all units within a certain distance. This makes them well suited for processing data with a lot of local information, such as images and audio (but mostly images). Deconvolutional units are the opposite of convolutional units: they tend to decode spatial information through local connections to the next layer. Both units often have separately trained clones, each with its own weights and connected to each other in the same way. These clones can be thought of as separate networks with the same structure. Both are essentially the same as regular units, but are used differently.

Pooling and interpolating cells are frequently connected to convolutional cells. These cells are not actually cells, but primitive operations. Pooling cells take in the input connections and decide which connections get passed through. In images, this can be thought of as shrinking the picture. You no longer see all the pixels, and it has to learn which pixels to keep and which to throw away. Interpolating cells do the opposite, they take in some information and map it to more information. The extra information is composed, just like enlarging a low-resolution image. Interpolating cells are not the exact inverse of pooling cells, but both are relatively common because they are fast and simple to implement. They are connected separately, much like convolution and deconvolution.

Mean and standard deviation cells (which are almost exclusively found in pairs as probabilistic cells) are used to represent probability distributions. The mean is the average, and the standard deviation is how far away from the average you can get in either direction. For example, a probability cell for an image might contain information about how much red a particular pixel is. Let's say the mean is 0.5 and the standard deviation is 0.2. When sampling from these probability cells, you feed these values ​​into a Gaussian random number generator, and values ​​between 0.4 and 0.6 are fairly likely outcomes; those far from 0.5 are less likely (but still possible). Mean and standard deviation cells are often fully connected to the previous or next layer, and have no bias.

Recurrent units are not only connected to layers, but also have connections over time. Each unit has a previous value stored inside it. They are updated just like basic units, but with additional weights: the previous value connected to the unit, and most of the time also connected to all units in the same layer. These weights between the current value and the stored previous value are more like a volatile memory, like RAM, with the property of having a certain "state" and disappearing if not fed. Since the previous value is passed to the activation function, and this activation value is passed along with other weights with each update through the activation function, information is constantly being lost. In fact, the retention rate is so low that after 4-5 iterations, almost all information is lost.

Long Short-Term Memory cells are designed to address the problem of rapid information loss that occurs in recurrent cells. LSTM cells are logic circuits that replicate the way memory cells are designed for computers. Instead of storing two states, RNN cells store four: the current and final values ​​of the output, and the current and final values ​​of the "memory cell" state. LSTM cells contain three "gates": an input gate, an output gate, and a forget gate, and also just regular inputs. Each of these gates has its own weight, which means that connecting to this type of cell requires setting four weights (rather than just one). The gates function much like flow gates, rather than fence gates: they can let anything through, a little bit, nothing, or anything in between. This works by multiplying the input information with a value between 0 and 1 (stored in the gate value). The input gate then determines how much of the input can be added to the cell value. The output gate determines how much of the output value can be seen by the rest of the network. The forget gate is not connected to the previous value of the output cell, but to the previous memory cell value. It determines how much of the final memory cell state is retained. Since it is not connected to the output, less information is lost because no activation function is placed in the loop.

Gated recurrent units are a variation of LSTM. They also use gates to prevent information loss, but there are only two types of gates: update and reset. These are slightly less expressive, but faster because they have fewer connections everywhere. Actually, there are two differences between LSTM and GRU: GRU does not have hidden cell states protected by an output gate, but instead combines the input and forget gates into an update gate. The idea is that if you want a lot of new information, you can forget some of the old information (or vice versa).

layer

The most basic way to connect neurons into a graph is to connect everything to each other, which can be seen in Hopfield networks and Boltzmann machines. Of course, this means that the number of connections will increase exponentially, but the expressiveness is uncompromised. This is called full connectivity .

Later, it was discovered that it was useful to divide the network into layers, where a series or group of neurons in a layer are not connected to each other, but are connected to neurons in other groups. For example, the layers of the network in the restricted Boltzmann machine. Today, the idea of ​​using layers has been generalized to any number of layers, and can be seen in almost all architectures. This is also called fully connected (which can be a bit confusing) because fully connected networks are actually very uncommon.

Convolutional connections are more restricted than fully connected layers: each neuron is connected only to other neurons in a group that is close to it. Images and audio contain a lot of information that cannot be fed directly into the network in a one-to-one manner (e.g., one neuron for one pixel). The idea of ​​convolutional connections comes from the observation that it is important to preserve spatial information. This turns out to be a good idea, and is used in many neural network-based image and speech applications. But this setup is not as expressive as fully connected layers. It is actually a form of "importance" filtering, deciding which of these compact packets of information are important. Convolutional connections are also great for dimensionality reduction. Due to their implementation, even very distant neurons can be connected, but neurons with a range higher than 4 or 5 are rarely used. Note that "space" here usually refers to a two-dimensional space, which is used to express the three-dimensional surface of the connections between neurons. Connection ranges can be applied in all dimensions.

Another option is of course to have randomly connected neurons . This also has two main variations: allowing a subset of all possible connections, or connecting a subset of neurons between layers. Random connections tend to decrease linearly in performance, and can be used for fully connected layers in large networks that are running into performance issues. In some cases, sparsely connected layers with more neurons perform better, especially when there is a lot of information to store, but no exchange is needed (somewhat similar to the effectiveness of convolutional layers, but with random connections). Very sparsely connected systems (1% or 2%) are also used, as seen in ELM, ESN, and LSM. This is especially true in spiking networks, because the more connections a neuron has, the less energy each weight carries, meaning less propagation and pattern repetition.

Delayed connections are connections between neurons that do not get information from the previous layer, but from the past (mostly previous iterations). This allows temporal information (time, timing) to be stored. These connections are sometimes reset manually to clear the "state" of the network. The main difference from regular connections is that these connections keep changing, even when the network is not being trained.

The following diagram shows a small sample of the networks and their connections described above. I use this when I don’t know what is connected to what (especially when doing LSTM or GRU cells):

<<:  Architecture design: a design concept for remote call services (an application practice of zookeeper)

>>:  Convolutional neural networks cannot process "graph" structured data? This article tells you the answer

Recommend

Brand Matrix Building Guide

This article will talk about "How to build I...

Sun Jing's 14-day course Baidu Cloud download

Sun Jing's 14-day course resources introducti...

7-Day Primer on Portfolio Strategy

Introduction to the 7-day introductory resource f...

Facebook Ads FAQ and Ad Creation Process

1. Troubleshooting (1) My ad has been under revie...

Can Toutiao become a new battlefield for private domain traffic?

On December 23, WeChat for Business 3.0 was offic...

Online education app buying trends and delivery insights!

According to Adinsight product monitoring by Reyu...

Is Android becoming the new Windows?

[[139070]] During my career as a PC analyst, I fo...

Ten reasons why How-Old became popular

The How-Old website developed by Microsoft has re...

2019 Social Media Content Trends Report

Recently, GlobalWebIndex (GWI) released the GWI S...

A brief discussion on Android log analysis

【51CTO.com Quick Translation】It is well known tha...