MIT proposes a network dissection framework to automatically peek into the black box of neural network training

MIT proposes a network dissection framework to automatically peek into the black box of neural network training

New MIT technology helps illuminate the inner workings of neural networks trained on visual data.

[[195593]]

Neural networks learn how to perform computational tasks by analyzing large training data sets, and they are responsible for many of today's best-performing artificial intelligence systems, such as speech recognition systems, automatic translators, and self-driving cars. But neural networks are black boxes, and once they are trained, even their designers do not understand how they work: what data they process and how they process it.

Two years ago, a team of computer vision researchers from MIT’s CSAIL lab described a way to peer into the black box of neural network training to recognize visual scenes. The method provided some interesting insights, but required sending the data to human reviewers via Amazon’s Mechanical Turk crowdsourcing service.

At this year's CVPR conference, CSAIL researchers upgraded the above system and will present a fully automated version. The previous paper gave an analysis of one neural network (on one task), and the new paper will give an analysis of four neural networks (over 20 tasks), including tasks such as recognizing scenes and objects, coloring grayscale images, and solving puzzles. Some new networks are too large, so it is too expensive to analyze them using the old method.

The researchers also conducted several sets of experiments on the Internet, which not only revealed the characteristics of various computer vision and computational-photography algorithms, but also provided some evidence for how the human brain is organized.

The name of the neural network comes from the simulation of the human nervous system, which has a large number of relatively simple but densely connected information processing nodes. Similar to neurons, the nodes of a neural network receive information signals from neighboring nodes, and then activate and release their own signals, or do not react. Like neurons, the advantage of node activation reactions is that they can change.

In both papers, MIT researchers modified neural networks and trained them to complete computer vision tasks in order to reveal how each node responded to different input images. They then selected the 10 input images that most triggered each node.

In the previous paper, the researchers sent the images to humans hired through Mechanical Turk and had them identify what the images had in common. In the new paper, the researchers used a computer system to do the same.

“We catalogued more than 1,100 visual concepts, such as green, earth texture, wood, human face, bicycle wheel, snowy mountain, and so on,” said David Bau, an MIT graduate student. “We took multiple datasets that others had developed and combined them with datasets that were densely labeled with visual concepts, and we got many, many labels, and we knew which pixel corresponded to which label.”

Other authors of the paper include co-first author Bolei Zhou, Antonio Torralba, a professor of electrical engineering and computer science at MIT, Aude Oliva, a senior research scientist at CSAIL, and Aditya Khosla, a Ph.D. student of Torralba who is now CTO of the medical computing company PathAI.

The researchers also know which pixel in which image corresponds to the best response of a given network node. Today's neural networks are organized into layers, with data fed into the first layer, then processed and passed to the next layer, and so on. With visual data, the input image is broken into small pieces, and each piece is fed into a separate input node.

For each response from a given layer of nodes in one of their networks, the researchers were able to trace the triggering pattern and thus identify the specific image pixels that corresponded to it. Because their system could frequently identify labels that corresponded to exact groups of pixels, it could characterize the node's behavior in great detail.

In the dataset, the researchers organized these visual concepts in layers. Each level starts with the concepts of the top layer, such as color and texture, and then materials, components, objects, and scenes. Generally speaking, the lower layers of the neural network can correspond to simple visual features, such as color and texture, and the higher layers can stimulate responses to more complex features.

But the layers also allow researchers to quantify where neural networks are focused when they are trained to perform specific tasks. For example, training a neural network to colorize a black-and-white image will focus on a large number of nodes that recognize textures. Another example is training a network to track objects in a video frame, which will focus more on image recognition than training a network to recognize scenes. In this case, many nodes are actually dedicated to object recognition.

The researchers' experiments also shed light on a puzzle in neuroscience. Studies of human subjects implanted with electrodes to control neurological disorders have shown that individual neurons in the brain fire in response to specific visual stimuli. This hypothesis, formerly known as the grandmother-neuron hypothesis, is more familiar to neuroscientists as the Jennifer-Aniston neuron hypothesis. They came up with the hypothesis after finding that neurons in several neurological patients tended to respond only to depictions of specific Hollywood stars.

Many neuroscientists dispute this explanation. They believe that clusters of neurons, not individual neurons, control sensory recognition in the brain. So the Jennifer Aniston neuron is just a bunch of neurons firing together in response to the image of Jennifer Aniston. And it's possible that many clusters of neurons are responding to the stimulus together, but they haven't been tested.

Because the MIT researchers' analysis technique is fully automated, they were able to test whether something similar happened during the process of training neural networks to recognize visual scenes. In addition to identifying individual network nodes that were tuned to specific visual concepts, they also randomly selected combinations of nodes. However, combinations of nodes selected far fewer visual concepts than individual nodes, about 80 percent.

“To me, this suggests that the neural network is actually trying to approximate a grandmother neuron,” Bau said. “It’s not that they’re trying to throw the concept of a grandmother neuron everywhere, but they’re trying to assign it to a neuron. That’s an interesting implication, and most people don’t believe that this architecture is that simple.”

Paper: Network Dissection: Quantifying Interpretability of Deep Visual Representations

Paper link: http://netdissect.csail.mit.edu/final-network-dissection.pdf

We propose a general framework, Network Dissection, to quantify the interpretability of CNN hidden representations by evaluating the correspondence between individual hidden units and a set of semantic concepts. Given a CNN model, our proposed method scores the semantics of each intermediate convolutional layer hidden unit using a large dataset of visual concepts. These semantically-laden units are assigned a wide range of labels, from objects, components, scenes to textures, materials, and colors. We use our proposed method to test the hypothesis that the interpretability of a unit is equivalent to its random linear combination, and then apply our method to compare the latent representations of different networks when trained to solve different supervised and self-supervised training tasks. We further analyze the impact of training iterations, compare networks trained with different initializations, examine the impact of network depth and width, and measure the impact of dropout and batch normalization on the interpretability of deep visual representations. We show that our proposed method can reveal properties of CNN models and training methods beyond measures of their discriminative power.

<<:  The second issue of the Aiti Tribe live broadcast class: How to make good use of HTML5 in mobile Internet products?

>>:  DeepMind's first step in decrypting the black box: It turns out that the cognitive principles of neural networks are the same as those of humans!

Recommend

3 types of traffic thinking in e-commerce marketing!

By using our strengths and taking advantage of op...

There are so many advertising channels, how do you choose?

Faced with the intensified competition of product...

3-step analysis | The user growth system behind Xiaohongshu’s massive content

High-quality product content is inseparable from ...

How to build a user growth system?

This is a sharing on "How to build a members...

Promotion strategy of Sina Weibo Fanstong

Someone told me that Sina Weibo’s scale is not up...

How to predict user churn rate and make strategies in advance?

The user churn rate directly reflects the market ...

How ordinary people can make their first million through marketing

This is a thought process that teaches you how to...

How to use AARRR methodology to achieve product growth?

In 2007 Dave McClure proposed a business growth m...

"Black hat SEO spinach quick ranking" Hubei SEO website ranking optimization skills!

「Black hat SEO spinach quick ranking」 Hubei SEO w...

The last unknown blue ocean of Android emulator mobile games

Since the development of the mobile game market, ...

Tips to increase sales 10 times through live streaming!

There are many ways to improve the efficiency of ...

How to deeply interpret those operational professional terms?

"After launching this activity, the daily UV...