LSTM, GRU and Neural Turing Machine: Detailed explanation of the most popular recurrent neural network in deep learning

A recurrent neural network (RNN) is a type of artificial neural network that can create loops in the network graph by adding additional weights to the network in order to maintain an internal state.

The benefit of adding state to neural networks is that they will be able to explicitly learn and exploit context in sequence prediction problems, i.e. problems with a sequential or temporal component.

In this post, you will embark on a journey to understand recurrent neural networks for deep learning.

After reading this article, you will know:

How ***Recurrent Neural Networks for Deep Learning work, including LSTM, GRU, and NTM.
*** The relevance of RNNs to the broader study of recurrence in artificial neural networks.
RNN research has achieved state-of-the-art performance on a range of difficult problems.

Note that we will not cover every possible type of RNN, but rather focus on a few RNNs used in deep learning (LSTM, GRU, and NTM) and the context for understanding them.

So let’s get started!

Overview

We will first set the scene in the field of recurrent neural networks; then we will take a deep look at LSTMs, GRUs, and NTMs for deep learning; after that we will take a moment to cover some advanced topics related to RNNs for deep learning.

Recurrent Neural Networks

Fully Recurrent Networks
Recursive Neural Networks
Neural History Compressor

Long Short-Term Memory (LSTM)
Gated Recurrent Unit (GRU) Neural Network
Neural Turing Machine (NTM)

Recurrent Neural Networks

First let's set the scene.

It is generally believed that recurrence provides a memory for the network topology.

A better way to think of it is that the training set consists of one example with a set of inputs that are used to cycle through the training examples. This is the "traditional convention", for example in traditional multilayer perceptrons

X(i) -> y(i)

But the training sample is supplemented by a set of inputs from previous samples. This is "non-traditional", such as recurrent neural networks

[X(i-1), X(i)] -> y(i)

As with all feedforward network paradigms, the key issue is how to connect the input layer to the output layer (including feedback activations) and then train the structure to converge.

Now, let's look at a few different types of recurrent neural networks, starting with a very simple concept.

Fully recurrent network

The classification structure of the multilayer perceptron is retained, but each element in the architecture has a weighted connection to every other element and a feedback connection to itself.

Not all of these connections will be trained, and the extreme nonlinearity of their error derivatives means that traditional backpropagation will not work, so only Backpropagation Through Time or Stochastic Gradient Descent (SGD) can be used.

Also, see Bill Willson’s Tensor Product Networks: http://www.cse.unsw.edu.au/~billw/cs9444/tensor-stuff/tensor-intro-04.html

Recurrent Neural Networks

Recurrent Neural Networks are a linear architecture variant of Recurrent Networks.

Recursion can promote branching in the hierarchical feature space, and the resulting network architecture can simulate it while training is ongoing.

Its training is achieved through sub-gradient methods using stochastic gradients.

R. Socher et al.'s 2011 paper "Parsing Natural Scenes and Natural Language with Recursive Neural Networks" describes this in detail using R language. See: http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Socher_125.pdf

Neural History Compressor

In 1991, Schmidhuber first reported a very deep learner that could perform credit assignment across hundreds of neural layers via unsupervised pre-training of a hierarchy of RNNs.

Each RNN is trained unsupervised to predict the next input. Only inputs that produce errors are then fed forward, conveying the new information to the next RNN in the hierarchy, which is then processed at a slower, self-organizing timescale.

It turns out that there is no information loss, just compression. The RNN stack is a "deep generative model" of the data. The data can be reconstructed from its compressed form.

See J. Schmidhuber et al., 2014, Deep Learning in Neural Networks: An Overview: http://www2.econ.iastate.edu/tesfatsi/DeepLearningInNeuralNetworksOverview.JSchmidhuber2015.pdf

When errors are back-propagated through large structures, the computation of the extremity of nonlinear derivatives can grow, making credit assignment difficult or even impossible, causing back-propagation to fail.

Long Short-Term Memory Network

Using traditional Back Propagation Through Time (BPTT) or Real Time Recurrent Learning (RTTL), the error signal flowing backwards in time tends to explode or vanish.

The time evolution of the backpropagated error depends exponentially on the size of the weights. Exploding weights may lead to weight oscillations, while vanishing weights may cause learning to bridge time lags and take too long or not work at all.

LSTM is a new recurrent network architecture that can be trained using a suitable gradient-based learning algorithm.
LSTM is designed to overcome the error back-flow problem. It can learn to bridge time intervals of more than 1000 steps.
This is true in the presence of noisy, incompressible input sequences without loss of short-time lag capability.

The error backflow problem can be overcome through an efficient gradient-based algorithm that allows the network architecture to force a constant (so no exploding or vanishing) error flow through the internal states of special units. These units can reduce the effects of "Input Weight Conflict" and "Output Weight Conflict".

Input weight conflicts: If the input is non-zero, then the same input weights must be used to both store certain inputs and ignore other inputs, so there will often be conflicting weight update signals.

These signals will try to make the weight participate in storing the input and protect the input. This conflict makes learning difficult and requires a more context-sensitive mechanism to control "write operations" through input weights.

Output weight conflicts: Whenever the output of a unit is non-zero, the weights of the output connections of this unit will attract conflicting weight update signals generated during sequence processing.

These signals will attempt to engage the outputting weights, capture information present in the processing unit, and at different times protect subsequent units from being disturbed by the output of the unit being fed.

These conflicts are not specific to long-term lags and can affect short-term lags as well. Notably, as the lag increases, stored information must be protected from interference, especially at advanced stages of learning.

Network architecture: Different types of units may convey useful information about the current state of the network. For example, an input gate (output gate) may use input from other memory cells to decide whether to store (read) specific information in its memory cell.

Memory cells contain gates. Gates are specific to the connections they mediate. Input gates are for correcting input weight conflicts, while output gates are for eliminating output weight conflicts.

Gates: Specifically, to mitigate input and output weight conflicts and interference, we introduce a multiplicative input gate unit to protect the stored memory contents from interference from irrelevant inputs, and a multiplicative output gate unit to protect other units from interference from irrelevant memory contents currently in the storage.

Example of LSTM architecture. This LSTM network has 8 input cells, 4 output cells, and 2 memory cells of size 2. in1 refers to the input gate, out1 refers to the output gate, and cell1 = block1 refers to the first memory cell of block 1. From "Long Short-Term Memory" in 1997

Because of the diversity of processing elements and feedback connections, the connections in LSTM are more complex than those in multilayer perceptrons.

Memory cell block: Memory cells share the same input gate and the same output gate, forming a structure called a memory cell block.

Memory modules facilitate information storage; as in traditional neural networks, encoding a distributed input within a single unit is not an easy task. A memory module of size 1 is a simple memory unit.

Learning: A variant of Real Time Recurrent Learning (RTRL) that takes into account the modified, multiplicative dynamics caused by input and output gates is used to ensure that non-decaying errors that backpropagate through the internal states of the memory cell errors to the memory cell net inputs are not further backpropagated in time.

Guessing: This randomized approach can outperform many time-delay algorithms. It has been shown that many long-delay tasks used in previous work can be solved faster than the proposed algorithm by simply guessing the random weights.

See S. Hochreiter and J. Schmidhuber, Long-Short Term Memory: http://dl.acm.org/citation.cfm?id=1246450

The most interesting applications of LSTM recurrent neural networks are in the field of language processing. For a more comprehensive description, see Gers' paper:

F. Gers and J. Schmidhuber's 2001 paper "LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages": ftp://ftp.idsia.ch/pub/juergen/L-IEEE.pdf
F. Gers’ 2001 PhD thesis: Long Short-Term Memory in Recurrent Neural Networks: http://www.felixgers.de/papers/phd.pdf

Limitations of LSTM

Problems like “strongly delayed XOR” cannot be easily solved by the efficient truncated version of LSTM.

Each memory cell module requires an input gate and an output gate. Other loop methods are not necessarily required.

The constant error flow through the memory cell through "Constant Error Carrousels" can achieve the same effect as a traditional feed-forward architecture (which gets the entire input string at once).

Like other feed-forward methods, LSTM also has a flaw in the concept of "regency". If precise counting of time steps is required, then additional counting mechanisms may be needed.

Advantages of LSTM

The algorithm’s ability to bridge long time lags comes from constant error backpropagation in the memory cells of its architecture.

LSTM can approximate noisy problem domains, distributed representations, and continuous values.

LSTMs can generalize well to the problem domain they are considering. This is important because some tasks cannot be solved using existing recurrent networks.

Fine-tuning the network parameters on the problem domain seems unnecessary.

In terms of update complexity per weight and time step, LSTM is essentially equivalent to BPTT.

LSTM is very powerful and has achieved state-of-the-art results in areas such as machine translation.

Gated Recurrent Unit Neural Network

Gated recurrent unit neural networks have been successfully applied to sequence and temporal data.

Best suited for speech recognition, natural language processing, and machine translation. Together with LSTM, they excel in long sequence problem domains.

Gating is considered in the LSTM topic, involving a gating network generating signals to control the way the current input and previous memory interact to update the current activation and thus the current network state.

The gates themselves are self-weighted and are selectively updated according to an algorithm throughout the learning phase.

Gate networks increase computational complexity, which in turn increases parameterization, introducing additional computational cost.

The LSTM RNN architecture uses the computation of a simple RNN as an intermediate candidate for the internal memory cell (state). The Gated Recurrent Unit (GRU) RNN reduces the gating signals of the LSTM RNN model to 2. These two gates are called the update gate and the reset gate.

The gating mechanism of GRU (and LSTM) RNNs is the same as that of simple RNNs in terms of parameterization. The weights corresponding to these gates are also updated using BPTT stochastic gradient descent as it tries to minimize the cost function.

Each parameter update will involve information about the state of the overall network. This can have adverse effects.

The concept of gating can be explored and extended using three new variants of the gating mechanism.

The three gating variants are: GRU1 (where each gate is computed using only the previous hidden state and bias), GRU2 (where each gate is computed using only the previous hidden state), and GRU3 (where each gate is computed using only the bias). We can observe a significant reduction in parameters, with GRU3 having the smallest number of parameters.

The three variants and the GRU RNN are benchmarked on the MNIST database of handwritten digits and the IMDB movie review dataset.

2 sequence lengths were generated from the MNIST dataset, while 1 sequence length was generated from the IMDB dataset.

The main driving signal for these gates seems to be the (cyclic) state, since it contains essential information about the other signals.

The use of stochastic gradient descent implicitly carries information about the state of the network. This may explain the relative success of using only the bias in the gate signal, since its adaptive update carries information about the state of the network.

Gating variants can be used to explore gating mechanisms using limited topology evaluation.

For more information, see:

R. Dey and FM Salem, 2017, "Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks": https://arxiv.org/ftp/arxiv/papers/1701/1701.05923.pdf
J. Chung et al., 2014, "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling": https://pdfs.semanticscholar.org/2d9e/3f53fcdb548b0b3c4d4efb197f164fe0c381.pdf

Neural Turing Machine

Neural Turing machines extend the capabilities of neural networks by coupling them with external memory resources, which they can interact with through the attention process. See the Synced article "Deep Explanation of Neural Turing Machines: From the Basic Concepts of Turing Machines to Differentiable Neural Computers".

This combined system is similar to a Turing machine or von Neumann architecture, but it is end-to-end differentiable, allowing it to be efficiently trained using gradient descent.

Preliminary results show that Neural Turing Machines can reason about basic algorithms such as copying, sorting, and associative recall based on input and output samples.

What makes RNNs stand out from other machine learning methods is their ability to learn and perform complex transformations on data over long timescales. In addition, we all know that RNNs are Turing complete, so they have the ability to simulate arbitrary programs as long as they are connected in the right way.

The capabilities of standard RNNs are extended to simplify the solution of algorithmic tasks. This richness is mainly achieved through a huge addressable memory, so it is named Neural Turing Machine (NTM) by analogy with the richness of Turing's finite-state machine implemented by wired storage tape.

Unlike Turing machines, Neural Turing Machines are differentiable computers that can be trained via gradient descent, providing a practical mechanism for learning programs.

Neural Turing Machine architecture. The NTM architecture is roughly as shown above. In each update cycle, the controller network receives an input from the external environment and gives an output in response. It also reads and writes a memory matrix through a series of parallel read and write heads. The dotted line is the boundary between the NTM loop and the external world. From "Neural Turing Machines" 2014

Crucially, each component of the architecture is differentiable, making it straightforward to train using gradient descent. This is achieved by defining “blurry” read and write operations that interact with more or less all elements in memory (rather than processing individual elements like a normal Turing machine or digital computer).

For more information, see:

A. Graves et al., Neural Turing Machines, 2014: https://arxiv.org/pdf/1410.5401.pdf
"Evolving Neural Turing Machines for Reward-based Learning" by R. Greve et al. 2016: http://sebastianrisi.com/wp-content/uploads/greve_gecco16.pdf

NTM Experiment

The copy task tests whether the NTM can store and recall long sequences of arbitrary information. The network is fed an input sequence of a random binary vector followed by a separator.

The network is trained to replicate sequences of random 8-bit vectors, where the sequence length is randomized between 1 and 20. The target sequence is simply a copy of the input sequence (without separators).

The repeated copy task is an extension of the copy task, which requires the network to output the copied sequence a given number of times and then give a sequence termination flag. The main purpose of this task is to see whether NTM can learn simple nested functions.

The input to the network is a sequence of random binary vectors of random length, followed by a scalar value indicating the number of replicas we want, which appear on a single input channel.

Associative recall tasks involve organizing data that is generated indirectly, that is, when one data item refers to another item. A list of items is created such that querying one of them requires the network to return the subsequent item.

We define a sequence of binary vectors, bounded on the left and right by delimiters. After a few items have been propagated into the network, the network is queried by showing it random items to see if it can produce the next item.

The dynamic N-gram task tests whether the NTM can quickly adapt to new predictive distributions by using memory as an overwriteable table; the network can use this table to keep count of transition statistics, thus emulating the traditional N-gram model.

Consider the set of all possible 6-gram distributions over binary sequences. Each 6-gram distribution can be expressed as a table of 32 numbers that specifies the probability that the next bit will be 1 based on all possible lengths of 5 binary histories. A particular training sequence is generated by drawing 200 consecutive bits using the current lookup table. The network observes this sequence, one bit at a time, and is asked to predict the next bit.

The priority sorting task tests the sorting ability of the NTM. The input to the network is a sequence of random binary vectors and a scalar priority score for each vector. The priority is uniformly distributed in the range [-1, 1]. The target sequence is a sequence of binary vectors sorted according to their priorities.

NTMs have the feed-forward structure of LSTMs as one of their components.

Summarize

In this article, you learned about recurrent neural networks for deep learning. Specifically, you learned:

How ***Recurrent Neural Networks for Deep Learning work, including LSTM, GRU, and NTM.
*** The relevance of RNNs to the broader study of recurrence in artificial neural networks.
RNN research has achieved state-of-the-art performance on a range of difficult problems.

Statement: This article is reproduced from Machine Learning Mastery, the original text comes from Machine Learning Mastery, the author is Jason Brownlee, and the translator is Panda.

<<: A Brief Analysis of High-Performance IO Model

>>: Technical Tips | How to build an efficient operation and maintenance management platform under the microservice architecture?

A 10,000-word research report on Xiaohongshu!

LSTM, GRU and Neural Turing Machine: Detailed explanation of the most popular recurrent neural network in deep learning

A 10,000-word research report on Xiaohongshu!

When will car manufacturing by Internet companies no longer be a war of words?

It is obviously our national treasure, why is the name of other countries written?

Quick quiz: Are you right-brained or left-brained?

How to plan an event from 0 to 1?

What are the differences between chargers of different sizes?

Meandering: The Yellow River's posture in Sichuan

How much does it cost to develop a Huludao specialty mini program?

How was your App Store last night?

Mosquito interception technology + Baidu passive drainage system 2.0 course video

Recommend

A Harvard medical student ate 720 eggs in a month, and his body changed like this...

Emergency consultation: Who sealed Shen Gongbao's throat?

How did LeTV, the much-questioned "story king", succeed in becoming an Internet TV company?

230,000 patents shared for free: This giant wants to end the global mobile phone patent war

Why is the bottom of a beverage bottle designed to look like a flower? Is it for beauty or to hold less?

Brand promotion and marketing, how to create another Jiang Xiaobai? !

Chufengce City Public Account Xiaomiquan Video Course + Baidu Cloud Download with Disk

2019 Tik Tok promotion and operation strategy!

Xiaohongshu virtual money-making project, no technical content, daily income 100-200

4 entry points to teach you how to operate vertical products!

Are mushrooms plants? No, they are macrofungi!

Are Windows 2-in-1s doomed to fail?

Solid info! 10 common methods to expand Baidu bidding in 2021

Fission event planning and promotion plan!

What is the bridge of communication between "Striver" and "Exploration One"?