DeepXplore: The first white-box framework for systematically testing real-world deep learning systems

By Yoav Hollander

Compiled by Machine Heart

Participants: Wu Pan, Yan Qi

In May, a paper titled "DeepXplore: Automated Whitebox Testing of Deep Learning Systems" by several researchers from Columbia University and Lehigh University proposed an automated white box testing method for deep learning systems, DeepXplore. Please refer to the report of Machine Heart "Academia | New research proposes DeepXplore: the first systematic white box framework for testing real deep learning systems". Recently, Yoav Hollander gave an in-depth interpretation of this research in a blog post, and also discussed the topic of "verifying machine learning-based systems" more broadly. Machine Heart compiled and introduced this article.

The paper "DeepXplore: Automated Whitebox Testing of Deep Learning Systems" describes a new and (I think) quite important approach to verifying machine learning based systems, and it more or less breaks down the boundaries between the world of machine learning and the world of dynamic verification like CDV (coverage driven verification).

As I was reading the paper, I kept saying to myself, “Well, that’s great, but they seem to have missed something.” So I contacted the authors, and it turns out they have a pretty good idea of what that something is (and they’re planning to address some of it in future research). I’ll quote some of their answers in this post.

About DeepXplore

This paper describes a method for verifying systems based on deep neural networks (DNNs). The following is from the abstract of the original paper:

We present DeepXplore: a whitebox framework for systematically testing real-world deep learning systems. The framework addresses two challenges: (1) generating inputs that trigger different parts of a deep learning system's logic; and (2) identifying incorrect behaviors of deep learning systems without manual intervention. First, we introduce neuron coverage to evaluate different parts of a deep learning system that are trained on the inputs used for testing. Then, we leverage multiple deep learning systems with similar functionality as cross-references, thus avoiding manual inspection of incorrect behaviors. We show how the process of finding inputs that trigger different behaviors while achieving high neuron coverage of a deep learning algorithm can be formulated as a joint optimization problem, which can then be solved efficiently using gradient-based optimization techniques.

DeepXplore effectively finds thousands of different incorrect corner-case behaviors (such as self-driving cars crashing into guardrails and malware masquerading as good software) on state-of-the-art deep learning models trained on five popular datasets, including driving data collected by Udacity in Mountain View and ImageNet data.

There are 4 main ideas I like:

Neuron coverage was used;
Check the output of a DNN by comparing the outputs of multiple similar DNNs;
Automatically "nudge" the process towards the goal, i.e., the checks defined in (2) find inconsistencies and the coverage defined in (1) is maximized, taking into account constraints;
(3) is achieved using an efficient gradient-based joint optimization.

But for each of these, I have questions. This is not a bad thing: in fact, the fact that this paper raises so many questions is a big plus - I'm already looking forward to the follow-up research. My questions are mainly related to the "whitebox-only" nature of DeepXplore (which is also a strength, of course).

Let me go through each of these four main ideas in order, discussing why I like them and the associated issues:

1. Using neuron coverage

For a set of DNN runs, their coverage metric is "what proportion of neurons (DNN nodes) are activated at least once during these runs". The core idea is that for a given DNN input, each neuron is either activated (i.e. exceeds the threshold) or remains at 0. As the original paper says:

Recent studies have shown that each neuron in a DNN is often responsible for extracting a specific feature of its input… This finding intuitively explains why neuron coverage is a good metric for DNN testing comprehensiveness.

Note that these nodes correspond to features that are not necessarily intuitively describable in our language (e.g., an object with two eyes), but the optimization process of DNN training usually does make them correspond to some "reusable features", that is, things that make DNNs useful on many different inputs.

Researchers have tried to automatically make neurons active before, such as in the paper "Understanding Neural Networks Through Deep Visualization" to visualize the work of neurons, but using neuron coverage for verification should be a new and good idea.

Possible issues:

But note that neuron coverage (analogous to code coverage in software verification) is really implementation coverage. And implementation coverage is not enough as far as we know, mainly because it does not help with inadvertent vulnerabilities: if your receive/transmit SW module forgot to implement transmit, or forgot about "receive while transmitting", then you can achieve 100% code coverage and be done with it until someone actually writes a transmit test (or the user tries to transmit themselves).

See multiple discussions on coverage (implementation, features, etc.): https://blog.foretellix.com/2016/12/23/verification-coverage-and-maximization-the-big-picture/

The same is true for DNNs: if you forget to train your driving DNN on “keep right”, careful turns, or people painting graffiti on buses, you might achieve 100% coverage on a system that has a lot of bugs.

Author response:

We completely agree with you. Full neuron coverage (like code coverage) does not guarantee that all possible vulnerabilities will be found. Having said that, we are also thinking about expanding the definition of neuron coverage to include different types of coverage (such as neuron path coverage, which is like path coverage in traditional software).

I think extending neuron coverage can help a little. Suppose (here it is a big simplification) our driving DNN has a neuron for detecting "dog on the road" and another neuron for detecting "cat on the road". Then, in addition to covering them individually, we also want to cover their counterexamples ("no dog") and combinations ("there is a dog and a cat", "there is a cat and no dog") and sequences ("there is a cat and then there is a dog" - this is the neural path coverage mentioned above).

However, this is just implementation coverage, see more comments below.

Checking DNNs by comparing them to other implementations

The way they check behavior is by comparing multiple, more or less different DNN-based implementations to see if they match. This is an extension of software "differential testing", which (for those of you coming from hardware verification) is basically the same as the old "checking with a reference model".

Potential issues:

Reference model checking usually assumes that the reference model is (almost) perfect, i.e. it can be trusted. Of course, this is not always the case, so there are potential problems.

For example, will this find bugs that we missed? This only takes into account some of the DNN cases we considered during training, and not others. But I doubt this is enough. The paper says:

If all tested DNNs make the same mistake, DeepXplore cannot generate corresponding test cases. However, we found that this is not a significant problem in practice, because most DNNs are built and trained independently, and the probability that they all make the same mistake is low.

But this assumes that (a) you can get many different implementations in a realistic and competitive market, and (b) at least one of the authors (like Tsunami) has thought that this isn't necessarily realistic.

Author response:

This is indeed a limitation of differential testing. One way to test a model in an independent way is to use adversarial DNN testing techniques, which currently only allow for slight perturbations that are invisible to the human eye. You could probably extend adversarial training to use a wide range of realistic constraints. For example, by changing the effect of lighting. However, the main problem with this approach is the lack of data labels. It is difficult to determine whether a DNN model has made the correct classification. In theory, any perturbation can arbitrarily change the true label of the image.

Actually, this is probably a pretty good way to not have a "separate" way of checking things, but the CDV option (which does mean a lot of work) says that humans need to write separate/manual sets of checks/assertions.

For example, if we are talking about the Udacity driving example, you can write some checks like "can't drive towards other cars"... These are probabilistic checks, so it's a trivial job, but people working on self-driving cars might do it. Also, safety issues are never solved by machine learning alone.

These people will also use functional coverage to augment implementation coverage, such as "Did we take a left turn? An unprotected left turn? A left turn with no cars on the same road as us? A left turn in the rain?" as I mentioned earlier.

There is another potential problem with differential checking: suppose there is a large overlap, but each of them considers some cases that the other doesn't. Then running DeepXplore will mark those cases as "inconsistent", and you have to go through them all to find the right ones (ie: is this a bug in my implementation? Or just one that someone else forgot?)

In addition, there are often cases where there is “no one right answer”: the car can turn left or right, or the decision boundary can be different (“can I start now?”). When inconsistency simply means “there might be a bug here”, we may need a long, manual process.

Drives execution to check for errors while maximizing coverage and obeying constraints

This is probably the most interesting part. As they describe it:

Finally, we show how the problem of generating test inputs that maximize the neural coverage of a deep learning system while revealing as many discriminative behaviors as possible (i.e., differences between multiple similar deep learning systems) can be formulated as a joint optimization problem that can be efficiently solved on large-scale real-world deep learning classifiers. Unlike traditional programs, the functions approximated by most popular deep neural networks (DNNs) used by deep learning systems are differentiable. Therefore, given white-box access to the corresponding model, the gradients corresponding to these inputs can be computed exactly.

I really like that this combines what we call coverage maximization and bug-hunting in the same tool. They also try to do it under some constraints: for example, in the vision task, they just make the image darker or cover small corners of the image, and when detecting malware in files, they insist on some constraints on the structure of the file.

Potential issues:

There is a big problem here, which I call the "constraint problem": because the allowable constraints are not flexible enough, the input obtained can only be "the input used for training, plus some minor changes."

It is desirable to be clear about what the flexible constraints are, because they make both simple modifications and drastic changes possible (like driving on the left or painting on a bus), while they also prevent completely unrealistic changes (like a car floating in air).

Is this possible? It certainly sounds difficult, but the malware example shows that people can customize constraints to a certain extent.

My favorite conclusion in most verification is some variation of CDV (coverage driven verification). Therefore, I am looking forward to creating a random verification, model-based input for verifying all systems, including DNN-based systems. The DeepXplore folks only discuss methods for taking existing inputs and mutating them. It should be interesting to find some combination of methods.

Author response:

You are right, the constraints we proposed in the image setting are still not flexible enough. They are considered because they can be effectively guided by the gradient. In fact, there are many other data augmentation techniques: for images, we can rotate them, flip them, and even do some semantic transformations (e.g., change BMW X6 to BMW X1). However, we cannot use gradients to efficiently compute these transformations - we can only randomly transform the images with these constraints, hoping that some of the transformations will induce different states between the models. It is very interesting to specify the types and properties of these constraints that can be effectively guided by the gradient.

…coming up with realistic models/constraints for data (e.g., images) is a difficult problem in itself. In DeepXplore, we start with realistic inputs, hoping that the mutated samples are valid, so they can appear in the real world. We also tried starting with random samples, and found that DeepXplore was able to find difference-inducing inputs, but they did not look like unrealistic images.

In the world of machine learning, a popular approach to perform similar tasks is to use Generative Adversarial Networks (GANs): given a random vector, it can learn to generate realistic inputs that are indistinguishable from real inputs (e.g., https://github.com/jayleicn/animeGAN). We are also using this technique to study how to generate inputs.

I’ve written about “GANs for Verification” on my blog, and I agree that GANs can help generate some kind of direct, realistic verification inputs. But I’m skeptical that the “novelty” is limited, so it’s not good enough.

What I'd really like to see is a pipeline where the input is generated from some forced, high-level description. For example, consider the following (completely speculative): suppose we have a text-to-image neural network, and you tell it that there is a relationship between text and image: "a car is coming from the left", and it gives you images related to your description. I describe (in the post mentioned above) how people use a GAN and an RNN to handle this kind of thing, but it's likely not the only way.

Now you feed the resulting image (or sequence of images) into Udacity's driver DNN, and you get a "required-steering-angle" output. Now, instead of (say) adding black noise to the image, you can instead noise the text input. And you can also impose logical constraints on the text.

If this works (and it probably will), you could perhaps achieve even more complexity and variety in the input by applying another DNN that combines the outputs of multiple such text-to-image DNNs to composite scenes (“the combination of a car coming from the left, a dog standing on the right, and a light that just turned green”).

This viewpoint may not work as stated - I am just trying to outline a direction that is closer to "real CDV (coverage driven verification)".

Efficient, Gradient-Based Joint Optimization on DNNs

In an earlier post I expressed my hope that machine learning would be used more in the verification of intelligent automated systems. One of the reasons is:

Deep neural networks are based on "differentiable" calculations, so this property makes it easier to operate according to your ideas. I will discuss more analysis on this in later posts.

Now, the "later post" is right in front of you. Let's see how the authors explain how they compute error-causing inputs.

Note that at a high level, our gradient computation is similar to the backpropagation operation performed during training a DNN, but there is a key difference: unlike ours, the backpropagation operation treats the input value as a constant and the weight parameter as a variable.

In other words, they are using the same fundamentally effective linear algebra machinery that is the basis for training and evolving DNNs to do other things to find inputs that will result in a coverage point (or an error, or both).

Think about how hard it is to do regular hardware and software verification: for example, to achieve a specific error line, you use a lot of methods, such as smart random combined with a lot of manual work, or concolic generation, or (for sufficiently small systems) model checking - you can find the references at the end of this post (https://blog.foretellix.com/2016/09/01/machine-learning-for-coverage-maximization/). But for DNNs, it's relatively simple. By the way, DeepXplore can only report on static, state-less DNNs, but the technology can be extended to handle problems such as reinforcement learning involving an input sequence.

I feel like this could be used for a lot of things (see the paper section titled “Other applications of DNN gradients”). For example, here’s a preliminary, vague idea: I’ve written about “explainable AI” before, and mentioned why it helps with verification (among many other things). Perhaps we could use techniques like DeepXplore (finding precise bounds and complementary examples about those bounds) to characterize local regions of an input (“if this input becomes this, then the answer should become that”), so that we could let users understand DNNs that way.

Other comments

Here are some scattered comments I came across while reading this paper:

It is likely that machine learning systems alone will not be able to address all safety issues:

As I mentioned in "Reinforcement Learning and Safety" (https://blog.foretellix.com/2017/03/28/misc-stuff-mobileye-simulations-and-test-tracks/), most people assume that DNNs alone cannot take into account all safety-related edge cases (for many reasons, such as over-training on wrong data will cause the system to deviate from the normal state; for example, you want to convince yourself that your system meets safety specifications by reading the code). Therefore, people need some kind of "safety wrapper" that is not a DNN (I mentioned three such wrappers in the above-mentioned article).

If I'm right, then (assuming DeepXplore is limited to the DNN system), DeepXplore can only do part of the job, and for every "error" it finds, we should hand it off to another larger system and see if the wrapper can get the job done. This could be easy, but not necessarily.

The author replied:

You are right. Any real-world self-driving car will have some kind of wrapper to handle possible decisions of the DNN, rather than relying directly on the DNN to handle all error cases. So, this wrapper should also be a good target for testing.

About adversarial examples:

I mentioned adversarial examples in the post “Using Machine Learning to Verify Machine Learning?” I said:

This article "Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples" (https://arxiv.org/pdf/1605.07277.pdf) mentions the same deceptive samples in various machine learning systems (deep neural networks, support vector machines, decision trees, etc.). ***, this article "EXPLAINING AND HARNESSING ADVERSARIAL EXAMPLES" attempts to explain why deceptive samples should often be transferred to all linear models. In addition, this article also shows that correctly classified inputs only form "thin manifolds" in the input space. They are surrounded by deceptive samples, but most of the input space consists of "invalid samples", which are completely irrelevant to what the training focuses on.

If, when performing the same task, adversarial examples that fool one deep learning implementation also tend to fool other DNN implementations, then perhaps those erroneous examples are also capable of doing the same thing. If this is true, then the check by comparing between K implementations will not work unless K is very large.

The author responded:

Transferability is a really interesting observation. In our experiments, we found that while some input states are transferable, most are not. In fact, the goal of differential testing is to find such non-transferable states.

At the same time, we note that if the space of "invalid samples" is actually very large, then we need to split it into: "although it is not considered in training, it should be included" (like "driving on the left"); at the same time, we also need to split it into: "not considered in training for good reasons" (like "cars in space, one flying above another"). This goes back to the very difficult "constraint problem" mentioned above. [[194466]]

Original link: https://blog.foretellix.com/2017/06/06/deepxplore-and-new-ideas-for-verifying-ml-systems/

<<: An article about ButterKnife to make the code more concise

>>: My deep learning development environment: TensorFlow + Docker + PyCharm, etc. What about yours?

Magisk creator speaks out for the first time after joining Google: Maintenance will continue, but major changes will occur

Blog

A universal method to improve operations and marketing conversion rates!

China's autonomous driving ushers in a new breakthrough: CCTV live broadcast of Baidu World 2020 to experience fully unmanned driving

Blog

When using WeChat Pay, remember to turn on these 4 switches to make the money in WeChat safer

Blog

Product operation fission growth model!

Blog

In the hot summer, there is an "otter" that is super sweet! Do otters have their own "bubble effects" when swimming in the water?

Blog

In a world that sells privacy, you and I are just products of Apple.

Blog

Android cold start optimization you must master

Blog

Xpeng Motors acquires Didi’s smart car business for RMB 5.8 billion. He Xiaopeng expresses his feelings about coincidence and opportunity on Weibo

Xpeng Inc. issued an announcement on the Hong Kon...

DeepXplore: The first white-box framework for systematically testing real-world deep learning systems

About DeepXplore

1. Using neuron coverage

Possible issues:

Checking DNNs by comparing them to other implementations

Potential issues:

Drives execution to check for errors while maximizing coverage and obeying constraints

Potential issues:

Efficient, Gradient-Based Joint Optimization on DNNs

Other comments

It is likely that machine learning systems alone will not be able to address all safety issues:

About adversarial examples:

Magisk creator speaks out for the first time after joining Google: Maintenance will continue, but major changes will occur

A universal method to improve operations and marketing conversion rates!

China Automobile Dealers Association: China Automobile Retention Value Report in October 2022

Deng Shao: Analysis of the overall situation and new business forms of the OTT industry in 2015

China's autonomous driving ushers in a new breakthrough: CCTV live broadcast of Baidu World 2020 to experience fully unmanned driving

When using WeChat Pay, remember to turn on these 4 switches to make the money in WeChat safer

Product operation fission growth model!

In the hot summer, there is an "otter" that is super sweet! Do otters have their own "bubble effects" when swimming in the water?

In a world that sells privacy, you and I are just products of Apple.

Android cold start optimization you must master

Recommend

WeChat lifts 5,000 friends limit, but users can no longer view Moments

Product operation: scenario-based operation plan for community group purchasing!

How to develop a complete user growth system architecture?

It’s time to go to Gannan!

Xpeng Motors acquires Didi’s smart car business for RMB 5.8 billion. He Xiaopeng expresses his feelings about coincidence and opportunity on Weibo

How should bidding novices adjust keyword bids?

Xiaohongshu Beginner’s Guide: 5 Tips to Master Operations!

Huawei Q1 router review: specially designed to solve all kinds of WiFi signal problems

Jia Zhen, "Little Red Book Merchant Camp No. 6"

Life cannot bear the weight! Can the load-bearing wall be repaired after it is demolished?

Is anesthetic a medicine or a poison? Uncovering the secret of "propofol" abused in the entertainment industry!

The list of finalists for the first 51CTO Developer Competition has been announced

How much does it cost to join a makeup app in Xiaogan?

Actual test: Android's built-in Google Play Protect protection capability is useless

The underlying logical understanding of marketing planning!