Deep learning: preconceptions, limitations, and the future

I know it's a weird blog title to use a negative word, but there was a discussion the other day that corresponds to some of the issues I'm thinking about. It all started with a post by Jeff Leek about using deep learning in the small sample range. In short, he argues that when the sample is small (which is common in biology), linear models with a few parameters can outperform even deep networks with a small number of layers and hidden units. He goes on to show that a very simple linear predictor with ten informative features outperforms a simple deep network when classifying 0 and 1 in the MNIST dataset with about 80 samples. This prompted Andrew beam to write a rebuttal (see: Be careful with training models, you can play deep learning with little data) in which he writes that a properly trained deep network can outperform a simple linear model even with very few training samples. During the debate, more and more biomedical informatics researchers began to adopt deep learning for different problems. Is the hype real or is a linear model all we need? As always, the answer is that it depends. In this article, I will present several use cases of machine learning where deep learning doesn’t make a lot of sense to apply and deal with preconceptions that I think hinder the application of deep learning, especially for beginners.

Preconceptions about Deep Learning

First, let’s address some of the preconceptions that laypeople have mistaken for half-truths. There are two big preconceptions and one more technical one that is somewhat of an extension of Andrew Beam’s “false beliefs.”

Deep learning can really be applied to small data samples

Deep learning’s reputation is built on large amounts of data (e.g., Google Brain projects feeding deep networks with large amounts of YouTube videos), and has been publicized as a complex algorithm running on large amounts of data. Unfortunately, the relationship between big data and deep learning is sometimes reversed: the myth that deep learning cannot be applied to small data samples. If your neural network has a small sample size, with a high parameter-to-sample ratio, it may seem that it will overfit. However, if you only consider the sample size and dimensionality of a given problem, whether it is supervised or unsupervised learning, you are modeling data in a vacuum, without context. It is likely that you have data resources related to your problem, or that a domain expert can provide a strong prior, or that the data is organized in a special way. In all of these cases, deep learning has the opportunity to be an effective method, for example, to encode useful representations for a larger related dataset and apply them to your problem. The classic illustration of this is very common in natural language processing, where you can learn word embeddings for a large corpus and use them as embeddings for a smaller, narrower corpus to complete a supervised learning task. Taken to the extreme, you can combine a group of neural networks to learn a representation and an effective way to reuse the representation on a small sample set. This is called one-shot learning and has been successfully used in many high-dimensional data fields, such as computer vision and drug discovery.

Deep learning is not the answer to everything

The second most common preconception I hear is hype. Many would-be practitioners hope that deep nets will give them a mysterious performance boost, simply because neural networks work well in other domains. Others are inspired by work in modeling and manipulating images, music, and language, and rush into the field by trying to train the latest GAN architectures. Hype is real, and in many ways. Deep learning has become a major tool in machine learning and a critical tool for data modeling. The popularity of deep learning has spawned core frameworks such as TensorFlow and PyTorch that are surprisingly useful, even outside of deep learning. The underdog story has inspired researchers to revisit previously forgotten methods such as evolutionary strategies and reinforcement learning. But it is not a panacea by any means. In addition to the fact that there is no free lunch, deep learning models can be very subtle and require careful and sometimes expensive hyperparameter searches, debugging, and testing (more on this later). Moreover, there are many cases where deep learning does not have much practical value, and where simpler models are more efficient.

Deep Learning is More Than Just .fit()

One aspect of deep learning has been somewhat lost in translation from other areas of machine learning. Many tutorials and introductory material describe deep learning models as consisting of layers of interconnected layers (composed of nodes), where the first layer is the input layer and the last layer is the output layer. You can train them using stochastic gradient descent. After perhaps a brief mention of how stochastic gradient descent works and what backpropagation is, most of the explanation focuses on the richness of neural network types. The optimization methods themselves get little extra attention, which is unfortunate because it's likely that a good part of the reason deep learning works is due to these particular methods, and guidance on how to optimize their parameters and how to split up the data to use them effectively is crucial to getting good results in a reasonable amount of time. Why stochastic gradients are so important is still unknown, but some clues are emerging here and there. One of my favorites is the explanation of the method as part of performing Bayesian inference. Essentially, every time you do some form of arithmetic, you are performing some Bayesian inference using certain assumptions and priors.

Stochastic gradient descent is not hard, and recent work has shown that the step is actually a Markov chain that has a stationary distribution under certain assumptions, which can be viewed as a variational approximation to the posterior. So when you stop your SGD and take the parameters to zero, you are essentially sampling from this approximation. I find this idea very instructive, because the parameters of the optimizer play a big role in this way. For example, since you can increase the learning parameters of SGD, the Markov chain will become unstable until it finds a local minimum that samples a large area; that is, you increase the variance of the steps. On the other hand, if you decrease the learning parameters, the Markov chain will slowly approximate narrower minima until it converges in a tight region; that is, you increase the variance of a particular region. Another parameter, the batch size in SGD, also controls the type of region the algorithm converges in. A wider region for small batches, and a narrower region for larger batches.

This complexity means that optimizers for deep nets are top-tier: they are a core part of the model, as important as the layer architecture. This is not the case with many other models in machine learning. Linear models and support vector machines are convex optimization problems, which are not really that different and have only one true answer. This is why people from other fields or people using tools like scikit-learn are confused when they don't find a very simple API with .fit().

Limitations of Deep Learning

So, when is deep learning really not suitable for a task? In my opinion, there are several main scenarios where deep learning is not suitable.

Low budget or low commitment issues

Deep learning models are very flexible, with a huge number of architectures and node types, optimizers, and normalization strategies. Depending on the application, your model might have convolutional or recurrent structures; it might be really deep or have only a few hidden layers; it might use rectified linear units or some other activation function; it might have dropout or not, and the weights are likely normalized. This is just a partial list, and there are many other types of nodes, links, and even loss functions to try. There are many hyperparameters to tweak and many architectures to explore, although training a large neural network is very time-consuming. Google recently touted its AutoML pipeline for automatically finding the best architecture, but it requires over 800 GPUs running at full throttle for weeks, which is not for everyone. The point is that training deep networks is very expensive in terms of computation and debugging, which makes no sense for many everyday prediction problems, and even small networks are too slow to tune. Even if you have the budget and commitment, there is no reason not to try other approaches first as a baseline. You might be pleasantly surprised to find that a linear support vector machine is all you really need.

Explain and communicate model parameters/feature importance to a general audience

Deep networks are notorious for being black boxes, with powerful predictive power but no explanation. Even though there are a lot of tools that have recently shown great results in some areas, they are not completely transferable to all applications. These tools work well when you want to make sure that the network is not deceiving you, which is mainly achieved by storing data sets or focusing on specific fake features. But is it still difficult to explain the importance of the features to the overall decision of the deep network? In this area, nothing can really beat linear models because of the direct relationship between the learned coefficients and the response. This is especially critical when conveying these explanations to a general audience. For example, physicians need to integrate all kinds of scattered data to make a diagnosis. The simpler and more direct the relationship between the variable and the outcome, the better the physician can use it and will not underestimate/overestimate its value. Furthermore, in many cases, the accuracy of the model is not as important as the theoretical explanation. For example, a policy maker may want to know the impact of demographic variables on mortality, and he is likely to be more interested in a direct approximation of the relationship between the two than the prediction accuracy. In both cases, deep learning is at a disadvantage compared to simpler and deeper methods.

Establishing causal mechanisms

The extreme case of model interpretation is trying to build a mechanistic model, that is, a model that actually captures the phenomenon behind the data. Good examples include trying to guess whether two molecules interact in a particular cellular environment? Or hypothesizing how a particular marketing strategy actually affects sales. Experts in this field believe that there is no substitute for the old-fashioned Bayesian approach; it is the best way for us to perform causal representation and reasoning. Vicarious recently had some outstanding results in this area (https://www.vicarious.com/img/icml2017-schemas.pdf), demonstrating why this more principled approach can generalize better than deep learning in video game tasks.

Learning from “non-structural” features

This is up for debate. One area where I find deep learning to be good is in finding useful representations of data for a particular task. A good example is the word embeddings mentioned above. Natural language has a rich and complex structure that can be approximated using a "context-aware" network: each word is represented as a vector that encodes the context in which it is used most often. Using word embeddings learned on a large corpus for a natural language processing task can sometimes give you a boost on a particular task on another corpus. However, if the corpus in question is completely unstructured, then deep learning won't help much. For example, you are doing object classification by looking at an unstructured list of keywords. Since keywords are not used in any particular structure, word embeddings are unlikely to help much. In this case, the data is truly a bag of words and the representation is sufficient for the task. A counter-argument might be that word embeddings are not actually that expensive, and you might be able to capture keyword similarity better if you use pre-trained word embeddings. However, I still prefer to start with bag-of-words representations and see if I can get good predictions. After all, each dimension of a bag of words is easier to interpret than the corresponding word embedding slot.

Depth is the future

Deep learning is hot, well-funded, and developing rapidly. When you read a deep learning paper at a conference, it may be the result of two or three iterations. This brings up a lot of my above point: deep learning may still be super useful in some scenarios in the near future. The tools for interpreting deep learning are getting better and better. Recent software (such as Edward) combines Bayesian modeling and deep network frameworks (see: Deep Probabilistic Programming Language Edward: Combining Bayes, Deep Learning and Probabilistic Programming), quantifying the uncertainty of neural network parameters and simple Bayesian reasoning with probabilistic programming and automatic variational inference. In the long run, there is a simplified modeling vocabulary that reveals the significant properties that deep networks can have, thereby reducing the parameter space of things that need to be tested. So keep brushing arXiv, maybe this article has been iterated in a month or two.

This article is reprinted from Machine Heart, the original article comes from hyperparameter.space, the author is Pablo Cordero.

<<: Speaking of the relationship between PHP's Memcache & Memcached extensions, do you understand it?

>>: Detailed explanation of sentiment analysis based on Naive Bayes and its implementation in Python

Learn how to master the fan economy from Luoji Siwei, Xiaomi, Guo Jingming and Han Han!

Blog

Qoros lost 1.35 billion yuan in the first three quarters. Its small goal for next year is to turn operating cash flow positive.

Blog

LeTV launches 500 million yuan investment at Music Fans Festival, releasing ecological value

A quick overview of scientific and technological achievements on the horizontal screen | Mengtian Laboratory Module and China Space Station

China's space station construction It is the ...

Deep learning: preconceptions, limitations, and the future

Preconceptions about Deep Learning

Deep learning can really be applied to small data samples

Deep learning is not the answer to everything

Deep Learning is More Than Just .fit()

Limitations of Deep Learning

Low budget or low commitment issues

Explain and communicate model parameters/feature importance to a general audience

Establishing causal mechanisms

Learning from “non-structural” features

Depth is the future

Learn how to master the fan economy from Luoji Siwei, Xiaomi, Guo Jingming and Han Han!

Qoros lost 1.35 billion yuan in the first three quarters. Its small goal for next year is to turn operating cash flow positive.

LeTV launches 500 million yuan investment at Music Fans Festival, releasing ecological value

How to make a mini program for a clothing store in Foshan? How to build a clothing WeChat applet?

The culprit for “loss of voice” may not be the air conditioner, there may be a serious problem behind it!

From community to e-commerce, how did Xiaohongshu become popular?

What is the difference between choosing Shenma Promotion Account Hosting and managing it by the company itself?

How to advertise on public accounts? Case Analysis

Let’s talk about how to increase followers on apps like Tik Tok and Zhihu!

Still worried about promotion? Here is a list of promotion channels for your reference

Recommend

Case! “User incentive” routine fails? See how others do it

Come and see these high-tech trucks that are no longer just big

I put the light into a bag during the day, and opened it at night. Why is it gone? Where did it go?

Four major advertising channels and budget planning in 2019!

A quick overview of scientific and technological achievements on the horizontal screen | Mengtian Laboratory Module and China Space Station

The renovated "Bird International Airport" allows birds to settle down with confidence

The new infrastructure with a 34 trillion yuan investment has computing power as its core driving force

What is Douyin Xindong Takeaway? How can Douyin Xindong Takeaway merchants register and open a store?

Have Chinese scientists made another major breakthrough in creating the thing on “Iron Man’s” chest?

How should a designer with zero foundation learn front-end

The UFO often seen in the sky could be this!

Are the flowers blooming early in Beijing? Will they bloom again in the spring?

They are the ones who understand God the most.

A case study on efficient advertising and customer acquisition optimization in the beauty industry!

Will Bitcoin Become the New Safe-Haven Asset?