Facebook AI Director: Deep Learning Technology Trend Report

Facebook AI Director: Deep Learning Technology Trend Report
New Wisdom Original 1

Source: Yann LeCun

Compiled by: MiLi

Yann LeCun is the inventor of convolutional neural networks and the head of Facebook AI Research. The 150 slides below are LeCun's comprehensive and detailed thinking on the field of deep learning. LeCun is very optimistic about unsupervised learning and believes that unsupervised learning is the only form of learning that can provide enough information to train billions of neural networks.

But LeCun also believes that it is very difficult to do well, after all, the world is incomprehensible. Let's see what kind of surprises LeCun brings us in these 150 PPTs.

Yann LeCun: 150 PPT full text

If you want to download the full text, please reply 0326 to download in the Xinzhiyuan subscription account.

Deep Learning

ByYann Le Cun

Courant Institute of Mathematical Science, New York University,

Facebook AI Research


Do we need to clone the brain to develop intelligent machines?

The brain is the basis for the existence of intelligent machines

- Birds and bats are evidence of heavier-than-air flight

brain

Today's high-speed processors

Can we develop artificial intelligence systems by replicating the brain?

Is the computing power of computers only 10,000 times that of the brain? It is very likely to be 1 million times: synapses are complex. 1 million times is 30 years of Moore's Law

It is better to take inspiration from biology; but if you simply copy and paste from biology without understanding the basic principles, you are doomed to fail. Airplanes were inspired by birds; they use the same basic principles of flight; however, airplanes do not flap their wings and do not have wings.

Let’s draw inspiration from nature, but we don’t need to copy it.

It's good to imitate nature, but we also need to understand nature. For airplanes, we developed aerodynamics and compressible fluid dynamics, and we knew that feathers and wing flapping are not the key.

1957: Perceptron (first learning machine)

A simple simulated neuron with adaptive "synaptic weights" computes the weighted sum of its inputs and outputs +1 if the weighted sum is above a threshold, or -1 otherwise.

Perceptron learning algorithm

Usual machine learning (supervised learning)

Design a machine with adjustable knobs (similar to the weights in a perceptron); select a training example, run it through the machine, and measure the error; figure out which direction the knob needs to be adjusted to reduce the error; repeat this operation using all training examples until the knob stabilizes.

Usual machine learning (supervised learning)

Design a machine with an adjustable knob; select a training sample, run it through the machine, and measure the error; adjust the knob to reduce the error; repeat until the knob stabilizes;

Machine learning = function optimization

This is like walking in a foggy mountain, reaching the village in the valley by walking in the direction of the steepest downhill slope; but each sample gives us a noisy estimate of the direction, so our path is quite random.

Generalization: Recognizing situations not seen during training

After training: Test the machine with samples it has never recognized before;


Supervised Learning

We can train a machine with many examples such as tables, chairs, dogs, cats, and people; but can the machine recognize tables, chairs, dogs, cats, and people that it has never seen before?

Machine Learning at Scale: The Reality

Billions of “knobs” (or “weights”), thousands of categories; millions of examples; identifying each example may require billions of operations; but these operations are just some simple multiplications and additions.

Traditional Model of Pattern Recognition

Traditional approach to pattern recognition (since the late 50s), fixed/designed features (or fixed matrix) + trainable classifier, perceptron (Cornell University, 1957)

Deep learning = the entire machine is trainable

Traditional pattern recognition: fixed and handcrafted feature extractors; mainstream modern pattern recognition: unsupervised mid-level features; deep learning: representations are hierarchical and trained;

Deep learning = learning hierarchical representation

More than one stage of nonlinear feature transformation is deep learning; training of convolutional codes for feature visualization on ImageNet [Zeiler & Fergus 2013]

Trainable feature levels

As the level of abstraction increases, the level of representation increases; each stage is a transformation of trainable features; Image recognition:

Pixel → Edge → Texture Primitive → Subject →

Part → Object

Character → word → word group → clause → sentence → story

speech

Example → spectral band → sound → ... → phone → phoneme → word

Shallowness vs. Depth == Lookup Table vs. Multi-Step Algorithm

"Shallow and wide" vs "deep and narrow" == "more memory" vs "more time", lookup table vs algorithm; few functions can be done in two steps without an exponentially large lookup table; by exponential factor, "storage" can be reduced by more than two steps.

How does the brain interpret images?

The ventral (identification) pathway in the visual cortex contains multiple stages; retina- LGN – V1 – V2 – V4 – PIT – AIT….etc;

Multi-layer neural network

Multi-layer neural network

Multiple layers of simple units; each unit computes a weighted sum of its inputs; the weighted sum passes through a nonlinear function; a learning algorithm changes the weights;

Typical multi-layer neural network architecture

  • Complex learning machines can be invented by assembling modules in a network;
  • Linear modules
  • Output = W. Input + B
  • ReLU module (rectified linear unit)
  • Output i=0 if input i<0;
  • Output i = input, if else;
  • Cost Module: Squared Distance
  • Cost = ||In1-In2||2
  • Objective Function
  • L(Θ)=1/pΣk C(Xk,Yk,Θ)
  • Θ=(W1,B1,W2,B2,W3,B3)

Build a network by assembling modules

All major deep learning frameworks use modules (inspired by SN/Lush, 1991), Torch7, Theano, TensorFlow….

Calculate the slope by backpropagating

Practical Application of the Chain Rule

Pulling down the slope algebraically:

● dC/dXi-1 = dC/dXi . dXi/dXi-1

● dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1

Push down the weight slope:

● dC/dWi = dC/dXi . dXi/dWi

● dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi

Any architecture will work?

Any connection diagram is allowed;

Acyclic directed graph

Recurrent networks need to be “unfolded in time”

Allow any module

As long as the corresponding arguments and other non-terminal inputs are continuous, inversion can be performed in almost all positions.

Almost all architectures provide automatic differentiation capabilities;

Theano, Torch7+autograd,…

The program becomes a method for calculating acyclic directed graphs (DAGs) and automatically finding paths.

The objective function of a multi-layer network is non-convex.

1-1-1 Network

– Y = W1*W2*X

Objective function: identity function of quadratic loss

An example: X=1,Y=1 L(W) = (1-W1*W2)^2

Convolutional Networks

(ConvNet or CNN for short)

Convolutional Network Architecture

Multiple convolutions

Animation: Andrej Karpathy URL: //cs231n.github.io/convolutional-networks/

Convolutional Network (created in 1990)

filter-tanh → aggregate → filter-tanh → aggregate → filter-tanh

Hubel and Wiesel's model of the visual cortex

Simple cells are used to detect local features, and complex cells are used to "aggregate" the output products of simple cells located nearby in the visual cortex, [Fukushima 1982][LeCun 1989, 1998], [Riesenhuber 1999], etc.

Overall architecture: multi-step normalization → filter set → nonlinearity → aggregation

Standardization: Whiteness variation (free choice)

Subtraction: Average removal, high pass filter

Division: Local Standardization, Standard Deviation

Filter Bank: Dimensionality Expansion, Mapping to Supercomplete Cardinality

Nonlinearity: sparsification, saturation, lateral inhibition mechanisms, etc.

Correction (ReLU), reduction of effective components, tanh,

Aggregation: A collection of spatial or functional categories

LeNet1 demonstration in 1993

Multi-character recognition [Matan et al., 1992]

Each layer is a convolution

ConvNet sliding window + weighted finite state machine

ConvNet sliding window + weighted FSM

Check Reader (Bell Labs, 1995)

Image Transformer Network trained to read check amounts, trained using negative log-likelihood loss for full coverage. 50% correct, 49% rejected, 1% error (detectable later) Used by many banks in the US and Europe since 1996, processing about 10% to 20% of handwritten checks in the US in the early 2000s.

Face detection [Vaillant et al. 1993, 1994]

ConvNet is used for large image processing, multi-size heatmaps, candidate non-maximum suppression, and it takes 6 seconds for a 256×256 image on SPARCstation



Synchronized face detection and pose estimation


Convolutional Network Pedestrian Detection

Scene analysis and annotation

Scene parsing and annotation: Multi-scale ConvNet architecture

Each output can see a lot of input background, and train supervision on fully annotated images

Method 1: Majority voting in superpixel regions


Scene parsing and annotation of RGB and depth images

Scene analysis and annotation

Without post-processing, frame by frame, ConvNet runs at 50 ms per frame on Virtex-6 FPGA hardware, limited by the ability to communicate over Ethernet.


ConvNet for long-range adaptive robot vision (DARPA LAGR project 2005-2008)


Long distance vision for winder net

Preprocessing (125 ms), ground plane estimation, horizon alignment, conversion to YUV + local contrast normalization, measurement of the normalized image “with” invariant pyramid

Convolutional Network Architecture

100 features per 3x12x25 input window; YUV image bands 20-36 pixels high, 36-500 pixels wide

Convolutional Networks for Visual Object Recognition

In the mid-2000s, ConvNets achieved quite good results in object classification, with the dataset "Caltech101": 101 categories, 30 training examples per category, but the results were slightly inferior to more "traditional" computer vision methods for the following reasons:

1. The dataset is too small;

2. The computer is too slow;

Then, two things happened . . .

ImageNet dataset [Fei-Fei et al., 2012]

1.2 million training samples

1000 categories

Fast and programmable general-purpose GPUs

Capable of 1 trillion operations per second

Extremely Deep ConvNet Object Recognition

100 million to 1 billion connections, 10 million to 1 billion parameters, 8 to 20 layers

Training extremely deep ConvNets on GPUs

The top 5 error probabilities of ImageNet are:

15%;

[Sermanet et al. 2013]

13.8% VGGNet [Simonyan, Zisserman 2014]

7.3%

GoogLeNet [Szegedy et al. 2014]

6.6%

ResNet [He et al. 2015]

5.7%

Extremely deep ConvNet architecture

Small matrix, not much subsampling process (fragmented subsampling)

Matrix: First layer (11×11)

First layer: 3×9 matrix, RGB->96 feature map, 11×11 matrix, 4 steps

Learning in Action

How are the first layer filters learned?

Deep learning = learning hierarchical representation

Nonlinear feature transformations with more than one stage are called deep, feature visualization convolutional network learning on ImageNet [Zeiler & Fergus 2013]

ImageNet: Classification

Name the main objects in the image, top 5 error rate: if the error is not in the top 5, it is considered an error. Red: ConvNet, Blue: Not ConvNet

ConvNets object recognition and localization

Classification + Localization: Multi-scale Sliding Window

Apply a convnet sliding window over the image to perform multi-scale weighting; sliding a convnet over the image is cheap. For each window, predict a classification and bounding box parameters. Even if the object is not completely within the window, the convnet can predict what it thinks the object is.

Results: Fine-tuned ImageNet detection before ImageNet1K training


Detection Example:

Detection Example:

Detection Example:

Deep Face

[Taigman et al. CVPR, 2014]

Aligning ConvNet Matrix Learning

Using automatic annotation on Facebook

8 million photos per day

Matrix Learning and Siamese Architecture

Contrative objective function, similar objects should produce output products that are close to each other, and dissimilar objects should produce output products that are far away, reducing dimensions through learning and constant positioning, [Chopra et al., CVPR 2005] [Hadsell et al., CVPR 2006]

Person Recognition and Pose Prediction

Image Captioning: Generating Descriptive Sentences

C3D: 3D ConvNet Video Classification

Segmenting and localizing objects (DeepMask)

[Pinheiro, Collobert, Dollar ICCV 2015]

ConvNet generates object face models

DeepMask++ Recommendations

Identify the route

train

After 2.5 days of running on 8×4 Kepler GPUs with EASGD [Zhang, Choromanska, LeCun, NIPS 2015]


::__IHACKLOG_REMOTE_IMAGE_AUTODOWN_BLOCK__::86

result

Mapping ConvNets under Supervision

Generating images using ConvNets

Mapping ConvNets under Supervision

Draw a chair, chair algorithm in feature space

ConvNets for Speech Recognition

Speech Recognition and Convolutional Networks (New York University/IBM)

Acoustic model: 7-layer ConvNet. 54.4 million parameters.

Transforms sound signals into 3,000 interrelated subphonemic categories

ReLU unit + detached from the previous layer

After 4 days of GPU training

Speech Recognition and Convolutional Networks (New York University/IBM)

Training samples.

40 Mel frequency cepstral coefficient window: 40 frames per 10 microseconds

Speech Recognition and Convolutional Networks (New York University/IBM)

The first layer convolution matrix, 9×9 size 64 matrix


Speech Recognition and Convolutional Networks (New York University/IBM)

Multi-language recognition, multi-scale input, large-scale viewing window

ConvNets are everywhere (or will be soon)

ConvNet Chip

Currently, NVIDIA, Intel, Teradeep, Mobileye, Qualcomm and Samsung are developing ConvNet chips.

Many startups: Movidius, Nervana, etc.

In the near future, ConvNets will drive cars

NVIDIA: Driver assistance system based on ConvNet technology

Drive-PX2: Open-source platform for driver assistance systems (=150 Macbook Pros)

Embedded supercomputer: 42TOPS (= 150 MacBook Pros)

MobilEye: A driver assistance system based on ConvNet technology

Configured in Tesla Model S and Model X

ConvNet Connectomics [Jain, Turaga, Seung, 2007]

3DConvNet volume images, using 7x7x7 neighboring voxels to label each voxel as "membrane" or "non-membrane", have become the standard method for connectomics

Brain tumor detection

Cascade input CNN architecture, 802,368 parameters, trained on 30 patients, results shown at BRAT2013

Predicting DNA/RNA-protein binding with ConvNets

"Predicting DNA- and RNA-binding protein sequence specificity by deep learning" - Nature Biotechnology, July 2015, by B Alipanahi, A Delong, M Weirauch, B Frey

Deep Learning is Everywhere (ConvNets are Everywhere)

Many applications on Facebook, Google, Microsoft, Baidu, Twitter, IBM, etc.

Image recognition for photo collection search

Image/video content filtering: spam, nudity and violence.

Search and news source rankings

People upload 800 million pictures to Facebook every day

(If we include Instagram, Messenger and WhatsApp, that’s 2 billion images per day)

Every photo on Facebook goes through two ConvNets every 2 seconds.

One is image recognition and annotation;

Another is facial recognition (not yet activated in Europe)

In the near future ConvNets will be everywhere:

Self-driving cars, medical imaging, augmented reality, mobile devices, smart cameras, robots, toys and more.

Embedded World

Thinking Vector

"My neighbor's Samoyed dog looks like a Siberian Husky"

Embedded World

iNSTAGRAM Embed Video


Representing the world with “thinking vectors”

Any object, concept or "idea" can be represented by a vector

[-0.2, 0.3, -4.2, 5.1, …..] represents the concept of "cat"

[-0.2, 0.4, -4.0, 5.1, …..] represents the concept of "dog"

These two vectors are very similar because cats and dogs share many common attributes.

Adding reasoning to manipulate thought vectors

Comparison of vectors for questions, answers, information extraction, and content filtering

Reasoning, planning, and language translation by combining and transforming vectors

Memory Storage Thinking Vector

MemNN (Memory Neural Network) is a good example

At FAIR, we want to “embed the world” into thinking vectors.

Natural Language Understanding

Can text be embedded?

[Bengio 2003] [Collobert and Weston 2010]

Predict the text based on the text before and after it

Synthesis of semantic attributes

Tokyo-Japan=Berlin-Germany

Tokyo-Japan+Germany=Berlin

Question answering system


Question answering system

Question answering system

Language Translation with LSTM Networks

Multi-level LSTM recursive module

Reading and encoding English sentences

Generate French sentences at the end of English sentences

Very similar accuracy to the current state of the art

How do neural networks remember things?

Recurrent networks cannot remember things for long periods of time

The cortex can only remember things for 20 seconds

We need a hippocampus (a separate memory module)

LSTM [Hochreiter 1997], register

Memory Networks [Weston et al., 2014] (FAIR), Associative Memory

Stacked Augmented Recurrent Neural Networks [Joulin and Mikolov, 2014] (FAIR)

NTM [DeepMind, 2014], “Tapes”.

Storing/stacking augmented recurrent networks

Stacked Augmented RNN

Weakly supervised MemNN:

Find an available storage location.

Memory Networks [Weston, Chopra, Bordes, 2014]

Adding short-term memory to the network

::__IHACKLOG_REMOTE_IMAGE_AUTODOWN_BLOCK__::116

Obstacles to artificial intelligence

The four missing pieces of AI (besides computing power)

Deep cognitive learning of theory

What is the geometry of objective functions in deep networks?

Why is the ConvNet architecture so good? [Mallat, Bruna, Tygert..]

Representation/Integration of deep learning with reasoning, attention, planning, and memory

Much research has focused on reasoning/planning, attention, memory, and learning “algorithms”

Memory-enhanced neural network "differentiable" algorithms

Combine supervised, unsupervised, and reinforcement learning into a single “algorithm”

If they work, Boltzmann machines could be very useful.

What to stack - where autoencoders, ladder networks, etc.

Discover the structure and patterns of the world through observation and living like animals and humans.

The mysterious geometry of objective functions

Deep Networks with ReLUs and Max Pooling

Linear Transformation Stack Maximum Discrete Operator

ReLUs point method

Maximum Summary

Switch from one layer to another

Deep Networks and ReLUs: The objective function is a piecewise polynomial function

If we use a loss function, the increment depends on Yk.

Piecewise polynomial of random coefficients on w

a lot: The distribution of random (Gaussian) coefficients of a polynomial at critical points on a sphere [Ben Arous et al.]

Random Matrix Theory of High-Order Spherical Spin Glasses

Random Matrix Theory

Deep Networks and ReLUs: The objective function is a piecewise polynomial function

Train a scaled-down (10×10) MNIST 2-layer network from multiple initial conditions. Measure the loss on the test set.

Reinforcement learning, supervised learning, unsupervised learning: three types of learning

Three types of learning

Reinforcement Learning

The machine occasionally makes predictions about scalar effects

A portion of the sample bytes

Monitor learning

The machine predicts the type or amount of each input

100,000 to 10,000 bits per sample

Unsupervised Learning

Machines make predictions for any input and any observable

Predicting future shots in videos

Each sample has millions of bytes

How much information does the machine need to predict?

Reinforcement Learning (Cherry)

The machine occasionally makes predictions about scalar effects

A portion of the sample bytes

Supervised Learning (Sugar Coating)

The machine predicts the type or amount of each input

10 to 10,000 bytes per sample

Unsupervised Learning (Cake)

Machines make predictions for any input and any observable

Predicting future shots in videos

Each sample has millions of bytes

Unsupervised learning is the “black box” of artificial intelligence

Almost all learning performed by animals and humans is unsupervised learning.

We learn about the workings of the world through observation;

The world we study is three-dimensional

We know that objects can move independently of each other;

We know that objects are permanent.

We learn how to predict the world one second or one hour from now.

We build world models through predictive unsupervised learning

Such a prediction model gives us a "common sense" understanding

Unsupervised learning allows us to learn about the laws of the world.

Common sense acquired through unsupervised learning

Learning about the world’s prediction models gives us common sense;

If we say: “Gérard picks up his bag and leaves the room”, you can infer:

Gérard stood up, stretched out his arms, walked to the door, opened it, and walked out.

He and his bag are no longer in the room.

He couldn't have disappeared or flown away.

Unsupervised Learning

Energy-based unsupervised learning

Energy function: take the lowest value in the data stream and the highest value elsewhere

If it is the desired energy output, press down;

In other cases, press upwards;

Generative Adversarial Networks


Laplacian GAN: Laegan (aka EYESCREAM)

Learning to Generate Images [Denton et al., NIPS 2015]

The generator outputs the image represented by the Laplacian pyramid coefficients

The discriminator learns how to distinguish between real and fake Laplacian images.

"EyeScream"

"EyeScream"/"LAPGAN"

Discovering patterns

DCGAN: Generating Images via Adversarial Training

[Radford, Metz, Chintala, 2015]

Input: random numbers;

Output: Bedroom



Navigation Flow

DCGAN: Generating Images via Adversarial Training

Training with comic characters

Insertion between characters

Facial Algebra (in DCGAN space)

DCGAN: Generating Images via Adversarial Training

[Radford, Metz, Chintala, 2015]

Unsupervised Learning: Video Prediction

Unsupervised learning is the black box of artificial intelligence

Unsupervised learning is the only form of learning that provides enough information to train billions of neural networks.

Supervised learning requires too much labeling effort

Reinforcement learning requires too many attempts

But we don’t know how to do unmonitored operations (or even how to formalize them).

We have so many ideas and methods

But they don’t work very well

Why is it so difficult? Because the world is inherently unpredictable.

The predictor produces the average of all possible futures - a fuzzy image

ConvNet Multi-Scale Video Prediction

4 to 8 frame input → ConvNet without aggregation → 1 to 8 frame output

Unable to use square root error: fuzzy prediction

The world is inherently unpredictable, and MSE training predicts the average of possible future situations: blurred images

ConvNet Multi-Scale Video Prediction


ConvNet Multi-Scale Video Prediction



ConvNet Multi-Scale Video Prediction

Compare to those who used LSTM [Srivastava et al., 2015]

Unsupervised learning prediction

Some results have been achieved in "adversarial training"

But we are still far from a complete solution.

Predictive Learning


Machine intelligence and artificial intelligence will be very different

What will artificial intelligence look like?

Human and animal behavior has evolutionary-innate drives

Fight/flight, hunger, self-preservation, pain avoidance, desire for social interaction, etc.

Much of the wrong things humans do to each other is caused by these drives.

Violent behavior when threatened, desire for material resources and social power, etc.

However, AI systems do not have these driving forces unless we configure them into the system.

It is difficult for us to imagine intelligent entities without a driving force.

Although we have many examples in the animal world.


How do we align the “ethical values” of AI to align with human values?

We will establish some fundamental, immutable, inherent drivers:

Human trainers will associate rewards with behaviors that make the humans around them happy and comfortable.

This is how children (and social animals) learn to behave in society.

Can we prevent unsafe AI?

Yes, just like we guard against potentially dangerous airplanes and cars.

How to produce artificial intelligence at the same level as humans?

The emergence of human-level AI will not be an isolated “event.”

It will be gradual

It doesn’t happen in isolation either.

No organization has a monopoly on good ideas.

Advanced artificial intelligence is now a scientific problem rather than a technological challenge.

Building unsupervised learning is our biggest challenge

Individual breakthroughs will be quickly replicated

Artificial intelligence research is a global community.

Most good ideas come from academia

Although the most impressive applications come from industry

It is important to distinguish between intelligence and autonomy

The smartest systems are not autonomous.

in conclusion

Deep learning is leading a wave of applications

Today: Image recognition, video cognition: Insights in action

Today: Better language recognition: Language recognition in action

In the near future: better language understanding, conversation and translation will be possible

Deep learning and convolutional networks are being widely used

Today: Image understanding capabilities are already widely used by Facebook, Google, Twitter, and Microsoft

In the near future: autonomous driving, medical image analysis, and robot perception will become possible

We need to find hardware (and software) for embedded applications

For digital cameras, mobile devices, cars, robots and toys.

We are still a long way from inventing truly intelligent machines.

We need to integrate reasoning with deep learning.

We need a good "episodic" (short-term) memory.

We need to find good theoretical principles to support unsupervised learning.

via: New Intelligence

<<:  The only thing standing between Alibaba Music and a world-class music organization is Gao Xiaosong?

>>:  Hu Quan: Killer Applications in the Industrial 4.0 Era

Recommend

Pinduoduo's activity operation routine gameplay!

The most "showy" one in this Double Ele...

Top 10 marketing trend predictions for 2022!

Looking back at 2021, there were many impressive ...

Will I definitely learn some cool skills if I go to a big bat company?

[[143894]] A classmate said that he went to Alipa...

Event promotion: Improve the conversion rate of offline activities?

This article will use offline activity cases to a...

Appointment with the Starry Sky|Look! The first "star" of 2022

This article is reproduced from Xinhua News Agenc...

Is it difficult to implement graceful backend keepalive? Not possible!

[[287287]] Keep alive status We know that the And...

Exploring the "Euclid" of the Dark Universe

On July 1, 2023, the long-awaited Euclid telescop...

14 seconds! China's own CPU + operating system achieves a leap forward

1 second, 2 seconds, 3 seconds... Several "c...

If you see these creatures on the beach, don't touch them!

Many people like to go to the beach for fishing o...