Source: Yann LeCun Compiled by: MiLi Yann LeCun is the inventor of convolutional neural networks and the head of Facebook AI Research. The 150 slides below are LeCun's comprehensive and detailed thinking on the field of deep learning. LeCun is very optimistic about unsupervised learning and believes that unsupervised learning is the only form of learning that can provide enough information to train billions of neural networks. But LeCun also believes that it is very difficult to do well, after all, the world is incomprehensible. Let's see what kind of surprises LeCun brings us in these 150 PPTs. If you want to download the full text, please reply 0326 to download in the Xinzhiyuan subscription account. Deep Learning ByYann Le Cun Courant Institute of Mathematical Science, New York University, Facebook AI Research
The brain is the basis for the existence of intelligent machines - Birds and bats are evidence of heavier-than-air flight brain Today's high-speed processors Can we develop artificial intelligence systems by replicating the brain? Is the computing power of computers only 10,000 times that of the brain? It is very likely to be 1 million times: synapses are complex. 1 million times is 30 years of Moore's Law It is better to take inspiration from biology; but if you simply copy and paste from biology without understanding the basic principles, you are doomed to fail. Airplanes were inspired by birds; they use the same basic principles of flight; however, airplanes do not flap their wings and do not have wings. Let’s draw inspiration from nature, but we don’t need to copy it. It's good to imitate nature, but we also need to understand nature. For airplanes, we developed aerodynamics and compressible fluid dynamics, and we knew that feathers and wing flapping are not the key. 1957: Perceptron (first learning machine) A simple simulated neuron with adaptive "synaptic weights" computes the weighted sum of its inputs and outputs +1 if the weighted sum is above a threshold, or -1 otherwise. Perceptron learning algorithm Usual machine learning (supervised learning) Design a machine with adjustable knobs (similar to the weights in a perceptron); select a training example, run it through the machine, and measure the error; figure out which direction the knob needs to be adjusted to reduce the error; repeat this operation using all training examples until the knob stabilizes. Usual machine learning (supervised learning) Design a machine with an adjustable knob; select a training sample, run it through the machine, and measure the error; adjust the knob to reduce the error; repeat until the knob stabilizes; Machine learning = function optimization This is like walking in a foggy mountain, reaching the village in the valley by walking in the direction of the steepest downhill slope; but each sample gives us a noisy estimate of the direction, so our path is quite random. Generalization: Recognizing situations not seen during training After training: Test the machine with samples it has never recognized before;
We can train a machine with many examples such as tables, chairs, dogs, cats, and people; but can the machine recognize tables, chairs, dogs, cats, and people that it has never seen before? Machine Learning at Scale: The Reality Billions of “knobs” (or “weights”), thousands of categories; millions of examples; identifying each example may require billions of operations; but these operations are just some simple multiplications and additions. Traditional Model of Pattern Recognition Traditional approach to pattern recognition (since the late 50s), fixed/designed features (or fixed matrix) + trainable classifier, perceptron (Cornell University, 1957) Deep learning = the entire machine is trainable Traditional pattern recognition: fixed and handcrafted feature extractors; mainstream modern pattern recognition: unsupervised mid-level features; deep learning: representations are hierarchical and trained; Deep learning = learning hierarchical representation More than one stage of nonlinear feature transformation is deep learning; training of convolutional codes for feature visualization on ImageNet [Zeiler & Fergus 2013] Trainable feature levels As the level of abstraction increases, the level of representation increases; each stage is a transformation of trainable features; Image recognition: Pixel → Edge → Texture Primitive → Subject → Part → Object Character → word → word group → clause → sentence → story speech Example → spectral band → sound → ... → phone → phoneme → word Shallowness vs. Depth == Lookup Table vs. Multi-Step Algorithm "Shallow and wide" vs "deep and narrow" == "more memory" vs "more time", lookup table vs algorithm; few functions can be done in two steps without an exponentially large lookup table; by exponential factor, "storage" can be reduced by more than two steps. How does the brain interpret images? The ventral (identification) pathway in the visual cortex contains multiple stages; retina- LGN – V1 – V2 – V4 – PIT – AIT….etc; Multi-layer neural network Multi-layer neural network Multiple layers of simple units; each unit computes a weighted sum of its inputs; the weighted sum passes through a nonlinear function; a learning algorithm changes the weights; Typical multi-layer neural network architecture
Build a network by assembling modules All major deep learning frameworks use modules (inspired by SN/Lush, 1991), Torch7, Theano, TensorFlow…. Calculate the slope by backpropagating Practical Application of the Chain Rule Pulling down the slope algebraically: ● dC/dXi-1 = dC/dXi . dXi/dXi-1 ● dC/dXi-1 = dC/dXi . dFi(Xi-1,Wi)/dXi-1 Push down the weight slope: ● dC/dWi = dC/dXi . dXi/dWi ● dC/dWi = dC/dXi . dFi(Xi-1,Wi)/dWi Any architecture will work? Any connection diagram is allowed; Acyclic directed graph Recurrent networks need to be “unfolded in time” Allow any module As long as the corresponding arguments and other non-terminal inputs are continuous, inversion can be performed in almost all positions. Almost all architectures provide automatic differentiation capabilities; Theano, Torch7+autograd,… The program becomes a method for calculating acyclic directed graphs (DAGs) and automatically finding paths. The objective function of a multi-layer network is non-convex. 1-1-1 Network – Y = W1*W2*X Objective function: identity function of quadratic loss An example: X=1,Y=1 L(W) = (1-W1*W2)^2 Convolutional Networks (ConvNet or CNN for short) Convolutional Network Architecture Multiple convolutions Animation: Andrej Karpathy URL: //cs231n.github.io/convolutional-networks/ Convolutional Network (created in 1990) filter-tanh → aggregate → filter-tanh → aggregate → filter-tanh Hubel and Wiesel's model of the visual cortex Simple cells are used to detect local features, and complex cells are used to "aggregate" the output products of simple cells located nearby in the visual cortex, [Fukushima 1982][LeCun 1989, 1998], [Riesenhuber 1999], etc. Overall architecture: multi-step normalization → filter set → nonlinearity → aggregation Standardization: Whiteness variation (free choice) Subtraction: Average removal, high pass filter Division: Local Standardization, Standard Deviation Filter Bank: Dimensionality Expansion, Mapping to Supercomplete Cardinality Nonlinearity: sparsification, saturation, lateral inhibition mechanisms, etc. Correction (ReLU), reduction of effective components, tanh, Aggregation: A collection of spatial or functional categories LeNet1 demonstration in 1993 Multi-character recognition [Matan et al., 1992] Each layer is a convolution ConvNet sliding window + weighted finite state machine ConvNet sliding window + weighted FSM Check Reader (Bell Labs, 1995) Image Transformer Network trained to read check amounts, trained using negative log-likelihood loss for full coverage. 50% correct, 49% rejected, 1% error (detectable later) Used by many banks in the US and Europe since 1996, processing about 10% to 20% of handwritten checks in the US in the early 2000s. Face detection [Vaillant et al. 1993, 1994] ConvNet is used for large image processing, multi-size heatmaps, candidate non-maximum suppression, and it takes 6 seconds for a 256×256 image on SPARCstation Synchronized face detection and pose estimation
Scene analysis and annotation Scene parsing and annotation: Multi-scale ConvNet architecture Each output can see a lot of input background, and train supervision on fully annotated images Method 1: Majority voting in superpixel regions
Scene analysis and annotation Without post-processing, frame by frame, ConvNet runs at 50 ms per frame on Virtex-6 FPGA hardware, limited by the ability to communicate over Ethernet.
Preprocessing (125 ms), ground plane estimation, horizon alignment, conversion to YUV + local contrast normalization, measurement of the normalized image “with” invariant pyramid Convolutional Network Architecture 100 features per 3x12x25 input window; YUV image bands 20-36 pixels high, 36-500 pixels wide Convolutional Networks for Visual Object Recognition In the mid-2000s, ConvNets achieved quite good results in object classification, with the dataset "Caltech101": 101 categories, 30 training examples per category, but the results were slightly inferior to more "traditional" computer vision methods for the following reasons: 1. The dataset is too small; 2. The computer is too slow; Then, two things happened . . . ImageNet dataset [Fei-Fei et al., 2012] 1.2 million training samples 1000 categories Fast and programmable general-purpose GPUs Capable of 1 trillion operations per second Extremely Deep ConvNet Object Recognition 100 million to 1 billion connections, 10 million to 1 billion parameters, 8 to 20 layers Training extremely deep ConvNets on GPUs The top 5 error probabilities of ImageNet are: 15%; [Sermanet et al. 2013] 13.8% VGGNet [Simonyan, Zisserman 2014] 7.3% GoogLeNet [Szegedy et al. 2014] 6.6% ResNet [He et al. 2015] 5.7% Extremely deep ConvNet architecture Small matrix, not much subsampling process (fragmented subsampling) Matrix: First layer (11×11) First layer: 3×9 matrix, RGB->96 feature map, 11×11 matrix, 4 steps Learning in Action How are the first layer filters learned? Deep learning = learning hierarchical representation Nonlinear feature transformations with more than one stage are called deep, feature visualization convolutional network learning on ImageNet [Zeiler & Fergus 2013] ImageNet: Classification Name the main objects in the image, top 5 error rate: if the error is not in the top 5, it is considered an error. Red: ConvNet, Blue: Not ConvNet ConvNets object recognition and localization Classification + Localization: Multi-scale Sliding Window Apply a convnet sliding window over the image to perform multi-scale weighting; sliding a convnet over the image is cheap. For each window, predict a classification and bounding box parameters. Even if the object is not completely within the window, the convnet can predict what it thinks the object is. Results: Fine-tuned ImageNet detection before ImageNet1K training Detection Example: Detection Example: Detection Example: Deep Face [Taigman et al. CVPR, 2014] Aligning ConvNet Matrix Learning Using automatic annotation on Facebook 8 million photos per day Matrix Learning and Siamese Architecture Contrative objective function, similar objects should produce output products that are close to each other, and dissimilar objects should produce output products that are far away, reducing dimensions through learning and constant positioning, [Chopra et al., CVPR 2005] [Hadsell et al., CVPR 2006] Person Recognition and Pose Prediction Image Captioning: Generating Descriptive Sentences C3D: 3D ConvNet Video Classification Segmenting and localizing objects (DeepMask) [Pinheiro, Collobert, Dollar ICCV 2015] ConvNet generates object face models DeepMask++ Recommendations Identify the route train After 2.5 days of running on 8×4 Kepler GPUs with EASGD [Zhang, Choromanska, LeCun, NIPS 2015] ::__IHACKLOG_REMOTE_IMAGE_AUTODOWN_BLOCK__::86 result Mapping ConvNets under Supervision Generating images using ConvNets Mapping ConvNets under Supervision Draw a chair, chair algorithm in feature space ConvNets for Speech Recognition Speech Recognition and Convolutional Networks (New York University/IBM) Acoustic model: 7-layer ConvNet. 54.4 million parameters. Transforms sound signals into 3,000 interrelated subphonemic categories ReLU unit + detached from the previous layer After 4 days of GPU training Speech Recognition and Convolutional Networks (New York University/IBM) Training samples. 40 Mel frequency cepstral coefficient window: 40 frames per 10 microseconds Speech Recognition and Convolutional Networks (New York University/IBM) The first layer convolution matrix, 9×9 size 64 matrix
Multi-language recognition, multi-scale input, large-scale viewing window ConvNets are everywhere (or will be soon) ConvNet Chip Currently, NVIDIA, Intel, Teradeep, Mobileye, Qualcomm and Samsung are developing ConvNet chips. Many startups: Movidius, Nervana, etc. In the near future, ConvNets will drive cars NVIDIA: Driver assistance system based on ConvNet technology Drive-PX2: Open-source platform for driver assistance systems (=150 Macbook Pros) Embedded supercomputer: 42TOPS (= 150 MacBook Pros) MobilEye: A driver assistance system based on ConvNet technology Configured in Tesla Model S and Model X ConvNet Connectomics [Jain, Turaga, Seung, 2007] 3DConvNet volume images, using 7x7x7 neighboring voxels to label each voxel as "membrane" or "non-membrane", have become the standard method for connectomics Brain tumor detection Cascade input CNN architecture, 802,368 parameters, trained on 30 patients, results shown at BRAT2013 Predicting DNA/RNA-protein binding with ConvNets "Predicting DNA- and RNA-binding protein sequence specificity by deep learning" - Nature Biotechnology, July 2015, by B Alipanahi, A Delong, M Weirauch, B Frey Deep Learning is Everywhere (ConvNets are Everywhere) Many applications on Facebook, Google, Microsoft, Baidu, Twitter, IBM, etc. Image recognition for photo collection search Image/video content filtering: spam, nudity and violence. Search and news source rankings People upload 800 million pictures to Facebook every day (If we include Instagram, Messenger and WhatsApp, that’s 2 billion images per day) Every photo on Facebook goes through two ConvNets every 2 seconds. One is image recognition and annotation; Another is facial recognition (not yet activated in Europe) In the near future ConvNets will be everywhere: Self-driving cars, medical imaging, augmented reality, mobile devices, smart cameras, robots, toys and more. Embedded World Thinking Vector "My neighbor's Samoyed dog looks like a Siberian Husky" Embedded World iNSTAGRAM Embed Video
Any object, concept or "idea" can be represented by a vector [-0.2, 0.3, -4.2, 5.1, …..] represents the concept of "cat" [-0.2, 0.4, -4.0, 5.1, …..] represents the concept of "dog" These two vectors are very similar because cats and dogs share many common attributes. Adding reasoning to manipulate thought vectors Comparison of vectors for questions, answers, information extraction, and content filtering Reasoning, planning, and language translation by combining and transforming vectors Memory Storage Thinking Vector MemNN (Memory Neural Network) is a good example At FAIR, we want to “embed the world” into thinking vectors. Natural Language Understanding Can text be embedded? [Bengio 2003] [Collobert and Weston 2010] Predict the text based on the text before and after it Synthesis of semantic attributes Tokyo-Japan=Berlin-Germany Tokyo-Japan+Germany=Berlin Question answering system Question answering system Question answering system Language Translation with LSTM Networks Multi-level LSTM recursive module Reading and encoding English sentences Generate French sentences at the end of English sentences Very similar accuracy to the current state of the art How do neural networks remember things? Recurrent networks cannot remember things for long periods of time The cortex can only remember things for 20 seconds We need a hippocampus (a separate memory module) LSTM [Hochreiter 1997], register Memory Networks [Weston et al., 2014] (FAIR), Associative Memory Stacked Augmented Recurrent Neural Networks [Joulin and Mikolov, 2014] (FAIR) NTM [DeepMind, 2014], “Tapes”. Storing/stacking augmented recurrent networks Stacked Augmented RNN Weakly supervised MemNN: Find an available storage location. Memory Networks [Weston, Chopra, Bordes, 2014] Adding short-term memory to the network ::__IHACKLOG_REMOTE_IMAGE_AUTODOWN_BLOCK__::116 Obstacles to artificial intelligence The four missing pieces of AI (besides computing power) Deep cognitive learning of theory What is the geometry of objective functions in deep networks? Why is the ConvNet architecture so good? [Mallat, Bruna, Tygert..] Representation/Integration of deep learning with reasoning, attention, planning, and memory Much research has focused on reasoning/planning, attention, memory, and learning “algorithms” Memory-enhanced neural network "differentiable" algorithms Combine supervised, unsupervised, and reinforcement learning into a single “algorithm” If they work, Boltzmann machines could be very useful. What to stack - where autoencoders, ladder networks, etc. Discover the structure and patterns of the world through observation and living like animals and humans. The mysterious geometry of objective functions Deep Networks with ReLUs and Max Pooling Linear Transformation Stack Maximum Discrete Operator ReLUs point method Maximum Summary Switch from one layer to another Deep Networks and ReLUs: The objective function is a piecewise polynomial function If we use a loss function, the increment depends on Yk. Piecewise polynomial of random coefficients on w a lot: The distribution of random (Gaussian) coefficients of a polynomial at critical points on a sphere [Ben Arous et al.] Random Matrix Theory of High-Order Spherical Spin Glasses Random Matrix Theory Deep Networks and ReLUs: The objective function is a piecewise polynomial function Train a scaled-down (10×10) MNIST 2-layer network from multiple initial conditions. Measure the loss on the test set. Reinforcement learning, supervised learning, unsupervised learning: three types of learning Three types of learning Reinforcement Learning The machine occasionally makes predictions about scalar effects A portion of the sample bytes Monitor learning The machine predicts the type or amount of each input 100,000 to 10,000 bits per sample Unsupervised Learning Machines make predictions for any input and any observable Predicting future shots in videos Each sample has millions of bytes How much information does the machine need to predict? Reinforcement Learning (Cherry) The machine occasionally makes predictions about scalar effects A portion of the sample bytes Supervised Learning (Sugar Coating) The machine predicts the type or amount of each input 10 to 10,000 bytes per sample Unsupervised Learning (Cake) Machines make predictions for any input and any observable Predicting future shots in videos Each sample has millions of bytes Unsupervised learning is the “black box” of artificial intelligence Almost all learning performed by animals and humans is unsupervised learning. We learn about the workings of the world through observation; The world we study is three-dimensional We know that objects can move independently of each other; We know that objects are permanent. We learn how to predict the world one second or one hour from now. We build world models through predictive unsupervised learning Such a prediction model gives us a "common sense" understanding Unsupervised learning allows us to learn about the laws of the world. Common sense acquired through unsupervised learning Learning about the world’s prediction models gives us common sense; If we say: “Gérard picks up his bag and leaves the room”, you can infer: Gérard stood up, stretched out his arms, walked to the door, opened it, and walked out. He and his bag are no longer in the room. He couldn't have disappeared or flown away. Unsupervised Learning Energy-based unsupervised learning Energy function: take the lowest value in the data stream and the highest value elsewhere If it is the desired energy output, press down; In other cases, press upwards; Generative Adversarial Networks Laplacian GAN: Laegan (aka EYESCREAM) Learning to Generate Images [Denton et al., NIPS 2015] The generator outputs the image represented by the Laplacian pyramid coefficients The discriminator learns how to distinguish between real and fake Laplacian images. "EyeScream" "EyeScream"/"LAPGAN" Discovering patterns DCGAN: Generating Images via Adversarial Training [Radford, Metz, Chintala, 2015] Input: random numbers; Output: Bedroom Navigation Flow DCGAN: Generating Images via Adversarial Training Training with comic characters Insertion between characters Facial Algebra (in DCGAN space) DCGAN: Generating Images via Adversarial Training [Radford, Metz, Chintala, 2015] Unsupervised Learning: Video Prediction Unsupervised learning is the black box of artificial intelligence Unsupervised learning is the only form of learning that provides enough information to train billions of neural networks. Supervised learning requires too much labeling effort Reinforcement learning requires too many attempts But we don’t know how to do unmonitored operations (or even how to formalize them). We have so many ideas and methods But they don’t work very well Why is it so difficult? Because the world is inherently unpredictable. The predictor produces the average of all possible futures - a fuzzy image ConvNet Multi-Scale Video Prediction 4 to 8 frame input → ConvNet without aggregation → 1 to 8 frame output Unable to use square root error: fuzzy prediction The world is inherently unpredictable, and MSE training predicts the average of possible future situations: blurred images ConvNet Multi-Scale Video Prediction
ConvNet Multi-Scale Video Prediction Compare to those who used LSTM [Srivastava et al., 2015] Unsupervised learning prediction Some results have been achieved in "adversarial training" But we are still far from a complete solution. Predictive Learning
What will artificial intelligence look like? Human and animal behavior has evolutionary-innate drives Fight/flight, hunger, self-preservation, pain avoidance, desire for social interaction, etc. Much of the wrong things humans do to each other is caused by these drives. Violent behavior when threatened, desire for material resources and social power, etc. However, AI systems do not have these driving forces unless we configure them into the system. It is difficult for us to imagine intelligent entities without a driving force. Although we have many examples in the animal world.
We will establish some fundamental, immutable, inherent drivers: Human trainers will associate rewards with behaviors that make the humans around them happy and comfortable. This is how children (and social animals) learn to behave in society. Can we prevent unsafe AI? Yes, just like we guard against potentially dangerous airplanes and cars. How to produce artificial intelligence at the same level as humans? The emergence of human-level AI will not be an isolated “event.” It will be gradual It doesn’t happen in isolation either. No organization has a monopoly on good ideas. Advanced artificial intelligence is now a scientific problem rather than a technological challenge. Building unsupervised learning is our biggest challenge Individual breakthroughs will be quickly replicated Artificial intelligence research is a global community. Most good ideas come from academia Although the most impressive applications come from industry It is important to distinguish between intelligence and autonomy The smartest systems are not autonomous. in conclusion Deep learning is leading a wave of applications Today: Image recognition, video cognition: Insights in action Today: Better language recognition: Language recognition in action In the near future: better language understanding, conversation and translation will be possible Deep learning and convolutional networks are being widely used Today: Image understanding capabilities are already widely used by Facebook, Google, Twitter, and Microsoft In the near future: autonomous driving, medical image analysis, and robot perception will become possible We need to find hardware (and software) for embedded applications For digital cameras, mobile devices, cars, robots and toys. We are still a long way from inventing truly intelligent machines. We need to integrate reasoning with deep learning. We need a good "episodic" (short-term) memory. We need to find good theoretical principles to support unsupervised learning. via: New Intelligence |
>>: Hu Quan: Killer Applications in the Industrial 4.0 Era
The most "showy" one in this Double Ele...
In Fuzhou, you can see the earliest ship design a...
On March 12, the Ministry of Industry and Informa...
Looking back at 2021, there were many impressive ...
[[143894]] A classmate said that he went to Alipa...
The day before yesterday, the mother of a primary...
Liu Yiwei's "21 Ways to Get Commissions&...
This article will use offline activity cases to a...
This article is reproduced from Xinhua News Agenc...
[[287287]] Keep alive status We know that the And...
On November 4, the 2023 (10th batch) World Herita...
On July 1, 2023, the long-awaited Euclid telescop...
"Freshwater" means water that does not ...
1 second, 2 seconds, 3 seconds... Several "c...
Many people like to go to the beach for fishing o...