Transfer Learning: How to Learn Deeply When Data Is Insufficient

Transfer Learning: How to Learn Deeply When Data Is Insufficient

[[191502]]

The most common obstacle in solving problems with deep learning techniques is the massive amount of data required to train the model. So much data is needed because the machine encounters a large number of parameters in the model during the learning process. When faced with a specific problem in a certain field, it is often not possible to get the data of the scale required to build the model. However, the relationships obtained for a certain type of data in a model training task can also be easily applied to different problems in the same field. This is called transfer learning.

I think the difficulty of achieving artificial intelligence is no different from building a rocket. You need a powerful engine and a lot of fuel. If you only have a powerful engine but no fuel, the rocket will definitely not be able to take off. If you only have a weak engine, no matter how much fuel you have, it will not be able to take off. If you want to build a rocket, a powerful engine and a lot of fuel are indispensable. If we use this as an analogy for deep learning, the deep learning engine can be regarded as the rocket engine, and the massive amount of data we provide to the algorithm can be regarded as the fuel. — Andrew Ng

Deep learning has become popular recently and has achieved remarkable results in fields such as language translation, strategy game playing, and self-driving cars, which involve millions of data sets. The most common obstacle in solving problems using deep learning technology is the huge amount of data required to train the model. So much data is needed because the machine encounters a large number of parameters in the model during the learning process.

For example, common ranges of parameter numbers in these models include:

Deep Learning

Neural networks (aka deep learning) are layered structures that can be stacked together (like Lego blocks).

Deep learning technology is actually a large-scale neural network. We can think of this network as a flow chart, where data enters from one end and is output from the other end after referencing/understanding each other. We can also split the neural network into multiple parts and get the inference results we need from any part. We may not get meaningful results, but we can still do it, for example, Google DeepDream does this.

Size (model) ∝ Size (data) ∝ Complexity (problem)

There is an interesting, approximately linear relationship between the size of the model and the size of the data required. The basic inference is that for a particular problem (e.g., the number of categories), the model must be large enough to capture the relationships between the data (e.g., textures and shapes in images, grammar in text, and phonemes in speech). Early layers in the model can identify high-level relationships between different components of the input (e.g., edges and patterns), and subsequent layers can identify information needed to help make the final decision, which often helps distinguish different outcomes. Therefore, if the complexity of the problem is high (e.g., image classification), the number of parameters and the amount of data required will be very large.

What AlexNet “sees” at each stage

Transfer learning is here!

When faced with a specific problem in a certain field, it is often not possible to obtain the data of the scale required to build a model. However, the relationship obtained for a certain type of data in a model training task can also be easily applied to different problems in the same field. This technique is also called transfer learning.

Qiang Yang, Sinno Jialin Pan, “A Survey on Transfer Learning”, IEEE Transactions on Knowledge & Data Engineering, vol. 22, no., pp. 1345–1359, October 2010, doi:10.1109/TKDE.2009.191

Transfer learning is like a top secret that no one wants to keep. Although everyone in the industry knows about it, the outside world is completely unaware of it.

Changes in search trends for the three keywords machine learning, deep learning, and transfer learning in Google searches

According to statistics from the most important papers in the field of deep learning published by Awesome — Most Cited Deep Learning Papers, more than 50% of the papers use some form of transfer learning or pre-training. For people with limited resources (data and computing power), the importance of transfer learning technology is increasing day by day, but this concept has not yet received the social impact it deserves. The people who need this technology the most don’t even know that it exists.

If deep learning is the holy grail and data is the gatekeeper, then transfer learning is the key to the door.

With transfer learning, we can use a pre-trained model that has been trained on a large and easily available dataset (although it was trained on a completely different task, the input is exactly the same, but the output is different). Then we can find the layers whose outputs can be reused. We can use the outputs of these layers as input to train a smaller network with fewer parameters. This small network only needs to understand the internal relationships of the specific problem, and has learned the patterns in the data from the pre-trained model. In this way, a model trained to detect cats can be reused to reproduce Van Gogh's paintings.

Another major benefit of transfer learning technology is that it can improve the "generalization" of the model. Large models tend to overfit the data, for example, the amount of data used for modeling far exceeds the number of implicit phenomena, and the effect may not be as good as when testing when processing unseen data. Because transfer learning allows the model to see different types of data, it can learn better underlying rules.

Overfitting is more like rote learning. — James Faghmous

Transfer learning can reduce the amount of data

Suppose you want to end the controversy over whether the skirt is blue-black or white-gold. First, you need to collect a large number of pictures of skirts that have been proven to be blue-black and white-gold. If you want to build an accurate model yourself and train it using a method similar to the one mentioned above (including 140 million parameters!), you need to prepare at least 1.2 million pictures, which is basically impossible. At this time, you can try transfer learning.

If transfer learning is used, the number of parameters required for training is calculated as follows:

Number of parameters = [size(input) + 1] * [size(output) + 1] = [2048+1]*[1+1] ~ 4098 parameters

The number of parameters required has been reduced from 1.4*10⁸ to 4*10⊃3;, a reduction of five orders of magnitude! It is enough to collect less than 100 pictures. What a relief!

If you don’t have the patience to continue reading and want to know the color of the skirt immediately, you can jump to the end of this article to see how to build such a model yourself.

A Step-by-Step Guide to Transfer Learning — Sentiment Analysis with Examples

In this example there are 72 movie reviews.

62 articles do not contain explicit emotions and will be used to pre-train the model

8 articles contain clear emotions and will be used to train the model

2 articles contain clear emotions and will be used to test the model

Since there are only 8 labeled sentences (sentences with clear sentiment), we can first pre-train the model for context prediction. If we train the model using only these 8 sentences, we can get an accuracy of 50% (which is about the same as flipping a coin).

We will use transfer learning to solve this problem, first training the model with 62 sentences, then using part of the first model to train a sentiment classifier based on it. After training with the next 8 sentences, we tested it with the last 2 sentences and got 100% accuracy.

Step 1

We will train a network that models the relationship between words. We pass in a word contained in a sentence and try to predict the word that appears in the same sentence. The embedding matrix in the following code is of size vocabulary x embedding_size, which stores the vector representing each word (here the size is "4").

graph = tf.Graph()with graph.as_default(): train_inputs = tf.placeholder(tf.int32, shape=[batch_size]) train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1]) valid_dataset = tf.constant(valid_examples, dtype=tf.int32) with tf.device('/cpu:0'): embeddings = tf.Variable(tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0)) embed = tf.nn.embedding_lookup(embeddings, train_inputs) nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size, embedding_size], stddev=1.0 / math.sqrt(embedding_size))) nce_biases = tf.Variable(tf.zeros([vocabulary_size])) loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels, num_sampled, vocabulary_size)) optimizer = tf.train.GradientDescentOptimizer(1.0).minimize(loss) norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True)) normalized_embeddings = embeddings / norm valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset) similarity = tf.matmul(valid_embeddings, normalized_embeddings, transpose_b=True) init = tf.global_variables_initializer()

pretraining_model.py is hosted on GitHub, you can view the source file

Step 2

We continue to train this graph so that words that appear in the same context get similar vector representations. We preprocess the sentences, remove all stop words, and tokenize them. We then pass one word at a time, trying to minimize the distance between the word vector and the surrounding words, and maximize the distance between the word vector and the random words that are not in the context.

with tf.Session(graph=graph) as session: init.run() average_loss = 0 for step in range(10001): batch_inputs, batch_labels = generate_batch(batch_size, num_skips, skip_window) feed_dict = {train_inputs: batch_inputs, train_labels: batch_labels} _, loss_val, normalized_embeddings_np = session.run([optimizer, loss, normalized_embeddings], feed_dict=feed_dict) average_loss += loss_val final_embeddings = normalized_embeddings.eval()

training_the_pretrained_model.py is hosted on GitHub, you can view the source file

Step 3

We will then try to predict the sentiment of the sentence. We have 10 sentences (8 for training and 2 for testing) with positive and negative labels. Since the model obtained in the previous step already contains vectors learned from all words, and the numerical attributes of these vectors can represent the context of the words, the prediction of sentiment can be further simplified.

Instead of using the sentence directly, we set the sentence vector to the average of all the words it contains (this task is actually achieved through a technique similar to LSTM). The sentence vector is passed as input to the network, and the output is a score for whether the content is positive or negative. We use a hidden intermediate layer and train the model on labeled sentences. As you can see, although only 10 samples are used each time, the model achieves 100% accuracy.

input = tf.placeholder(<span class="hljs-string"><span class="hljs-string">"float"</span></span>, shape=[<span class="hljs-keyword">None</span>, x_size])
y = tf.placeholder(<span class="hljs-string"><span class="hljs-string">"float"</span></span>, shape=[<span class="hljs-keyword">None</span>, y_size])
w_1 = tf.Variable(tf.random_normal((x_size, h_size), stddev=<span class="hljs-number"><span class="hljs-number">0.1</span></span>))
w_2 = tf.Variable(tf.random_normal((h_size, y_size), stddev=<span class="hljs-number"><span class="hljs-number">0.1</span></span>))
h = tf.nn.sigmoid(tf.matmul(X, w_1))yhat = tf.matmul(h, w_2)predict = tf.argmax(yhat, dimension=<span class="hljs-number"><span class="hljs-number">1</span></span>)cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(yhat, y))updates = tf.train.GradientDescentOptimizer(<span class="hljs-number"><span class="hljs-number">0.01</span></span>).minimize(cost)sess = tf.InteractiveSession()init = tf.initialize_all_variables()sess.run(init)<span class="hljs-keyword"><span class="hljs-keyword">for</span></span> epoch <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> range(<span class="hljs-number"><span class="hljs-number">1000</span></span>): <span class="hljs-keyword"><span class="hljs-keyword">for</span></span> i <span class="hljs-keyword"><span class="hljs-keyword">in</span></span> range(len(train_X)): sess.run(updates, feed_dict={<span class="hljs-string">X:</span> train_X[<span class="hljs-string">i:</span> i + <span class="hljs-number"><span class="hljs-number">1</span></span>], <span class="hljs-string">y:</span> train_y[<span class="hljs-string">i:</span> i + <span class="hljs-number"><span class="hljs-number">1</span></span>]}) train_accuracy = numpy.mean(numpy.argmax(train_y, axis=<span class="hljs-number"><span class="hljs-number">1</span></span>) == sess.run(predict, feed_dict={<span class="hljs-string">X:</span> train_X, <span class="hljs-string">y:</span> train_y})) test_accuracy = numpy.mean(numpy.argmax(test_y, axis=<span class="hljs-number"><span class="hljs-number">1</span></span>) == sess.run(predict, feed_dict={<span class="hljs-string">X:</span> test_X, <span class="hljs-string">y:</span> test_y})) print(<span class="hljs-string"><span class="hljs-string">"Epoch = %d, train accuracy=%.2f%%, test accuracy=%.2f%%"</span></span> % (epoch+<span class="hljs-number"><span class="hljs-number">1</span></span>,<span class="hljs-number"><span class="hljs-number">100.</span></span>*train_accuracy,<span class="hljs-number"><span class="hljs-number">100.</span></span>* test_accuracy))

training_the_sentiment_model.py is hosted on GitHub, see the source file

Although this is just an example, it can be found that with the help of transfer learning technology, the accuracy has increased rapidly from 50% to 100%. To view the complete example and code, please visit the following address:

https://gist.github.com/prats226/9fffe8ba08e378e3d027610921c51a78

Some real-world examples of transfer learning

#### Image Recognition:

Image Enhancement

[[191503]]

Style transfer

Object Detection

Skin cancer detection

[[191504]]

#### Text recognition:

Zero ShotTranslation

Emotion Classification

Difficulties in implementing transfer learning

Although it is possible to train models with less data, the technique requires a higher level of skill. Just look at the number of hard-coded parameters in the above example and imagine how difficult it is to use transfer learning techniques when you have to constantly adjust these parameters before the model is trained.

The current problems faced by transfer learning technology include:

Find large-scale datasets required for pre-training

Deciding on the model to use for pre-training

If either model doesn't work as expected, it will be difficult to debug.

Unsure how much additional data is needed to train the model

Difficulty deciding where to stop when using pre-trained models

Based on the pre-trained model, determine the number of layers and parameters required for the model

Host and provide the combined model

Update pre-trained models as more data or better technology becomes available

Data scientists are hard to find. Finding people who can find data scientists is just as hard. — Krzysztof Zawadzki

NanoNets makes transfer learning easier

After experiencing these problems firsthand, we set out to solve them by building a cloud-based deep learning service that supports transfer learning techniques, and tried to solve these problems through this easy-to-use service. The service includes a series of pre-trained models that we have trained for millions of parameters. You only need to upload your own data (or search for data on the Internet), and the service will select the most suitable model for your specific task, build a new NanoNet based on the existing pre-trained model, and input your data into the NanoNet for processing.

Nanonet is free for programmers to learn and see the effects.

NanoNets’ transfer learning technology (this architecture is only a basic presentation)

Building NanoNet (Image Classification)

1. Select the category you want to process here.

2. Start searching the web and building models with one click (you can also upload your own images).

3. Resolve the controversy over the blue and gold dress (once the model is ready, we will allow you to upload test images through an easy-to-use web interface, and also provide an API that is not dependent on a specific language).

To get started building your first NanoNet, visit: www.nanonets.ai.

Original author: Sarthak Jain, click "Read original text" to view the English link.

<<:  There is no artificial intelligence in the world? Are we fooled by deep learning?

>>:  Android modularization exploration and practice

Recommend

Deconstructing the live streaming methods of e-commerce anchors

Can small-base accounts and white-label merchants...

How much does it cost to join a meat and poultry mini program in Hohhot?

What is the price for joining the Hohhot Meat and...

Advertising channel costs, optimization and techniques!

Regarding advertising, I shared some experiences ...

How to plan an event? Activity planning process conception

1. Principles of activity task allocation 1) Spec...

Why can’t mobile phones be multi-tasked like computers?

The human brain seems to be naturally equipped wi...

In 5 simple steps, you can restore the “paper animation” from 190 years ago!

When you chat on WeChat, do you often use some in...

3 questions and 5 steps to help you create a successful online event

When doing an online event, we often feel uncerta...

How to operate an App from Fenda

1. Introduction Looking back, from the popularity...

The 2016 death list of e-commerce and physical stores!

During this year’s Double 11, Uniqlo’s Tmall flag...