How to Evolve Neural Networks with Automatic Machine Learning

For most people working in machine learning, designing a neural network is like creating a work of art. Neural networks usually start with a common architecture, and then we need to constantly adjust and optimize the parameters until we find a good combination of layers, activation functions, regularizers, and optimization parameters. Under the guidance of some well-known neural network architectures such as VGG, Inception, ResNets, DenseNets, etc., we need to repeat the operation of the network's variables until the network reaches the speed and accuracy we expect. As network processing power continues to improve, it is becoming more and more feasible to automate the network optimization process.

In shallow models like Random Forests and SVMs, we have been able to automate the optimization of hyperparameters. Some common toolkits, such as sk-learn, provide methods for searching the hyperparameter space. In its simplest and most basic form, "hyperparameters" are the parameters we search for in all possible parameters, or we randomly sample from the parameter distribution. (Click this link for details.) Both of these methods face two problems: first, they waste resources when searching in the wrong parameter area; second, they are inefficient when dealing with a large number of dynamic feature parameter sets. Therefore, it becomes quite difficult to change the architecture of the processor. Although there are many seemingly efficient methods now, such as Bayesian optimization methods. However, while Bayesian optimization can solve the first problem, it is powerless to solve the second problem; in addition, it is difficult to explore models in the Bayesian optimization setting.

The idea of automatically identifying attack patterns is not new, and with recent advances in processing power, it is easier than ever to do so.

Problem Setting

One way to think about hyperparameter optimization is as a “meta-learning problem”.

Can we create an algorithm that can be used to judge whether a network is performing well?

Note: I will continue to use the term “meta-learning” in the following, even though describing this problem as “meta-learning” is a bit confusing, but we must not confuse it with some methods related to “learning”.

Our goal is to define the number of hidden layers (green) in the network and the parameters of each hidden layer.

Specifically, it is to explore the model architecture and the parameter space of the model to optimize its performance on a given data set. This problem is complex and difficult to solve, and the rewards are sparse. The reason why it is sparse is that we need to train the network sufficiently and evaluate it; and after the training and evaluation are completed, we only get some scores as rewards. These scores reflect the performance of the entire system, and this type of reward is not a differentiable function! Speaking of this, does it remind you of something? Yes, this is a typical "reinforcement learning" scenario!

Wikipedia defines "reinforcement learning":

"Reinforcement learning" (RL) is an important machine learning method, which is inspired by the behaviorist theory of psychology. Specifically, "reinforcement learning" is about how an organism (agent) can maximize the cumulative reward under the stimulation of the environment (environment).

The difference between "reinforcement learning" and standard supervised learning is that it does not require the correct input or output pairs to appear, nor does it require precise correction of the secondary optimization behavior. In addition, "online performance" is also the focus of "reinforcement learning", that is, finding a balance between the exploration of unknown areas and the development of existing knowledge.

The agent in the scenario above is a model, and the environment is the dataset we use for training and evaluation. The interpreter is the process of analyzing each behavior and setting the state of the agent (in our scenario, the interpreter sets the network parameters).

Typically, the "reinforcement learning" problem is defined as a Markov decision process. Its goal is to optimize the total reward of the organism. At each step, you need to make a decision to optimize the model output or explore a new behavior. Under the stimulation of the environment, the organism will form an adjustment policy based on the feedback it receives and continuously improve its behavior.

Note: This topic is beyond the scope of this article, and Introduction to Reinforcement Learning by R.Sutton and A. Barto is probably the best introductory book on the subject.

Evolutionary Algorithms

Another approach to solving the "reinforcement learning" problem is "evolutionary algorithms". Inspired by biological evolution, evolutionary algorithms create a set of solutions to find the solution space; then, they evaluate each solution and continuously adjust the solution set based on the evaluation score. "Evolution" in biological evolution involves the selection and mutation of the best members of a population. Therefore, our solution set will also continue to evolve to improve its overall adaptability and provide a feasible solution to the problem.

The left side of the above picture introduces the process of evolution. Designing an "evolutionary algorithm" involves two parts - "selection" and the "cross-border" or "mutation" strategy that needs to be followed.

"Selection": For "selection", we usually select the best individuals and some random individuals to achieve diversity. A more advanced selection method is to set up different "sub-groups" under the population, that is, "species"; then select the best individuals in the species to protect its diversity. Another popular method is "competitive selection", that is, randomly select some individuals to participate in the competition and select the winner (the individual with superior genes).

"Crossover": "Crossover" is also called "crossover", which refers to the crossover and mixing of two or more sets of parents to produce offspring. "Crossover" is highly dependent on the way the problem is structured. A common approach is to describe the parents with a list of items (usually numerical values), and then select arbitrary parts from the parents to generate new genetic combinations.

"Variation": "Variation" or "mutation" refers to the process of arbitrarily changing the genome. This is a major development factor and helps maintain the diversity of the population.

Implementation

The implementation of "Evolutionary Algorithms" uses PyTorch to build an agent that will explore DNNs for a simple classification task. This experiment uses MNIST because it is small and fast, and can be trained even on a CPU. We will build a set of DNN models and evolve them for N steps.

The topic of "evolution" we are talking about is actually the implementation of "natural selection". The complete high-level "evolutionary algorithm" is as follows:

 new_population = []
  while size(new_population) < population_size:
  choose k(tournament) individuals from the population at random
  choose the best from pool/tournament with probability p1
  choose the second best individual with probability p2
  choose the third best individual with probability p3
  mutate and append selected to the new_population

Side note: The crossover issue gets quite complicated when it comes to merging architectures. How exactly do you merge two parent architectures? What are the effects of defect patterns and context-integrated training? A recent paper by Miikkulainen et al. proposes a solution called CoDeepNEAT. Based on the Evolino theory of evolution, an architecture is composed of some unit modules, each of which is subject to evolution. This architecture is an ideal blueprint that merges all the components. In this context, it makes sense to mix the components of the parents, because the components are a complete mini-network. To keep the article concise and easy to understand, I avoid the crossover issue in this algorithm implementation and simply introduce solutions like NEAT (or CoDeepNEAT). (I plan to introduce these solutions in detail in the next article.)

Basic building blocks

The first thing we need to define is the solution space for each model, where each individual represents an architecture. For simplicity, we stack n layers, each of which has three parameters: a) the number of hidden units; b) the activation type; c) the dropout rate. For the common parameters, we choose between different optimizers, learning rates, weight decay, and number of layers.

 # definition of a space
 # lower bound - upper bound, type param, mutation rate
 LAYER_SPACE = dict()
 LAYER_SPACE['nb_units'] = (128, 1024, 'int', 0.15)
 LAYER_SPACE['dropout_rate'] = (0.0, 0.7, 'float', 0.2)
 LAYER_SPACE['activation'] =\
   (0, ['linear', 'tanh', 'relu', 'sigmoid', 'elu'], 'list', 0.2)
 NET_SPACE = dict()
 NET_SPACE['nb_layers'] = (1, 3, 'int', 0.15)
 NET_SPACE['lr'] = (0.0001, 0.1, 'float', 0.15)
 NET_SPACE['weight_decay'] = (0.00001, 0.0004, 'float', 0.2)
 NET_SPACE['optimizer'] =\
   (0, ['sgd', 'adam', 'adadelta', 'rmsprop'], 'list', 0.2)

After completing the above operations, we have defined the model space. Next, we need to establish three basic functions:

Randomly select a network

 def random_value(space):
   """Sample random value from the given space."""
 val = None
 if space[2] == 'int':
       val = random.randint(space[0], space[1])
 if space[2] == 'list':
       val = random.sample(space[1], 1)[0]
 if space[2] == 'float':
       val = ((space[1] - space[0]) * random.random()) + space[0]
   return {'val': val, 'id': random.randint(0, 2**10)}
 def randomize_network(bounded=True):
   """Create a random network."""
   global NET_SPACE, LAYER_SPACE
 net = dict()
 for k in NET_SPACE.keys():
       net[k] = random_value(NET_SPACE[k])
 if bounded:
       net['nb_layers']['val'] = min(net['nb_layers']['val'], 1)
 layers = []
   for i in range(net['nb_layers']['val']):
 layer = dict()
       for k in LAYER_SPACE.keys():
           layer[k] = random_value(LAYER_SPACE[k])
 layers.append(layer)
 net['layers'] = layers
 return net

First, we arbitrarily sample the number of layers and the parameters of each layer, with the sample values falling on the edge of a predefined range. When initializing a parameter, we also generate an arbitrary parameter id. It is not usable yet, but we can keep track of all the layers. When a new model is mutated, the old layers are fine-tuned and only the mutated layers are initialized. This should significantly speed up and stabilize the solution.

Note: Depending on the nature of the problem, we may need different constraints, such as the total number of parameters or the total number of layers.

Mutating the network

 def mutate_net(net):
 """Mutate a network."""
   global NET_SPACE, LAYER_SPACE
 # mutate optimizer
   for k in ['lr', 'weight_decay', 'optimizer']:
       if random.random() < NET_SPACE[k][-1]:
           net[k] = random_value(NET_SPACE[k])
 # mutate layers
   for layer in net['layers']:
       for k in LAYER_SPACE.keys():
           if random.random() < LAYER_SPACE[k][-1]:
               layer[k] = random_value(LAYER_SPACE[k])
   # mutate number of layers -- RANDOMLY ADD
   if random.random() < NET_SPACE['nb_layers'][-1]:
       if net['nb_layers']['val'] < NET_SPACE['nb_layers'][1]:
           if random.random()< 0.5:
 layer = dict()
               for k in LAYER_SPACE.keys():
                   layer[k] = random_value(LAYER_SPACE[k])
               net['layers'].append(layer)
               # value & id update
               net['nb_layers']['val'] = len(net['layers'])
               net['nb_layers']['id'] +=1
 else:
               if net['nb_layers']['val'] > 1:
                   net['layers'].pop()
                   net['nb_layers']['val'] = len(net['layers'])
                   net['nb_layers']['id'] -=1
 return net

Each network element has the possibility of mutation, and each mutation will resample the parameter space, thereby causing the parameters to change.

Build a network

 class CustomModel():
   def __init__(self, build_info, CUDA=True):
       previous_units = 28 * 28
       self.model = nn.Sequential()
       self.model.add_module('flatten', Flatten())
       for i, layer_info in enumerate(build_info['layers']):
 i = str(i)
           self.model.add_module(
 'fc_' + i,
               nn.Linear(previous_units, layer_info['nb_units']['val'])
 )
           self.model.add_module(
               'dropout_' + i,
               nn.Dropout(p=layer_info['dropout_rate']['val'])
 )
           if layer_info['activation']['val'] == 'tanh':
               self.model.add_module(
 'tanh_'+i,
 nn.Tanh()
 )
           if layer_info['activation']['val'] == 'relu':
               self.model.add_module(
 'relu_'+i,
 nn.ReLU()
 )
           if layer_info['activation']['val'] == 'sigmoid':
               self.model.add_module(
 'sigm_'+i,
                   nn.Sigmoid()
 )
           if layer_info['activation']['val'] == 'elu':
               self.model.add_module(
 'elu_'+i,
 nn.ELU()
 )
           previous_units = layer_info['nb_units']['val']
 self.model.add_module(
           'classification_layer',
           nn.Linear(previous_units, 10)
 )
       self.model.add_module('sofmax', nn.LogSoftmax())
 self.model.cpu()
       if build_info['optimizer']['val'] == 'adam':
           optimizer = optim.Adam(self.model.parameters(),
                               lr=build_info['weight_decay']['val'],
                               weight_decay=build_info['weight_decay']['val'])
       elif build_info['optimizer']['val'] == 'adadelta':
           optimizer = optim.Adadelta(self.model.parameters(),
                                   lr=build_info['weight_decay']['val'],
                                   weight_decay=build_info['weight_decay']['val'])
       elif build_info['optimizer']['val'] == 'rmsprop':
           optimizer = optim.RMSprop(self.model.parameters(),
                                   lr=build_info['weight_decay']['val'],
                                   weight_decay=build_info['weight_decay']['val'])
 else:
           optimizer = optim.SGD(self.model.parameters(),
                               lr=build_info['weight_decay']['val'],
                               weight_decay=build_info['weight_decay']['val'],
                               momentum=0.9)
       self.optimizer = optimizer
 self.cuda = False
 if CUDA:
 self.model.cuda()
 self.cuda = True

The above class will instantiate the "genome" of your model.

Now that we have the basic building blocks to build an arbitrary network, change its architecture, and train it, the next step is to build the "genetic algorithm" that will select and mutate the individuals. Each model is trained independently, without any information about the other organisms. This allows the optimization process to scale linearly with the available processing nodes.

Coding of GP optimizer

 """Genetic programming algorithms."""
 from __future__ import absolute_import
 import random
 import numpy as np
 from operator import itemgetter
 import torch.multiprocessing as mp
 from net_builder import randomize_network
 import copy
 from worker import CustomWorker, Scheduler
 class TournamentOptimizer:
   """Define a tournament play selection process."""
   def __init__(self, population_sz, init_fn, mutate_fn, nb_workers=2, use_cuda=True):
 """
 Initialize optimizer.
 params::
               init_fn: initialize a model
               mutate_fn: mutate function - mutates a model
               nb_workers: number of workers
 """
 self.init_fn = init_fn
       self.mutate_fn = mutate_fn
       self.nb_workers = nb_workers
       self.use_cuda = use_cuda
 # population
       self.population_sz = population_sz
       self.population = [init_fn() for i in range(population_sz)]
       self.evaluations = np.zeros(population_sz)
 #bookkeeping
 self.elite = []
 self.stats = []
 self.history = []
 def step(self):
       """Tournament evolution step."""
       print('\nPopulation sample:')
       for i in range(0,self.population_sz,2):
           print(self.population[i]['nb_layers'],
                 self.population[i]['layers'][0]['nb_units'])
 self.evaluate()
 children = []
       print('\nPopulation mean:{} max:{}'.format(
           np.mean(self.evaluations), np.max(self.evaluations)))
 n_elite = 2
       sorted_pop = np.argsort(self.evaluations)[::-1]
       elite = sorted_pop[:n_elite]
       # print top@n_elite scores
       # elites always included in the next population
 self.elite = []
       print('\nTop performers:')
       for i,e in enumerate(elite):
           self.elite.append((self.evaluations[e], self.population[e]))
           print("{}-score:{}".format( str(i), self.evaluations[e]))
           children.append(self.population[e])
       #tournament probabilities:
 # first p
 # second p*(1-p)
 # third p*((1-p)^2)
 # etc...
       p = 0.85 # winner probability
 tournament_size = 3
       probs = [p*((1-p)**i) for i in range(tournament_size-1)]
       # a little trick to certify that probs is adding up to 1.0
       probs.append(1-np.sum(probs))
       while len(children) < self.population_sz:
           pop = range(len(self.population))
           sel_k = random.sample(pop, k=tournament_size)
           fitness_k = list(np.array(self.evaluations)[sel_k])
           selected = zip(sel_k, fitness_k)
           rank = sorted(selected, key=itemgetter(1), reverse=True)
           pick = np.random.choice(tournament_size, size=1, p=probs)[0]
           best = rank[pick][0]
           model = self.mutate_fn(self.population[best])
           children.append(model)
       self.population = children
       # if we want to do a completely completely random search per epoch
       # self.population = [randomize_network(bounded=False) for i in range(self.population_sz) ]
 def evaluate(self):
       """evaluate the models."""
       workerids = range(self.nb_workers)
       workerpool = Scheduler(workerids, self.use_cuda)
       self.population, returns = workerpool.start(self.population)
       self.evaluations = returns
       self.stats.append(copy.deepcopy(returns))
       self.history.append(copy.deepcopy(self.population))

“Evolutionary algorithms” seem pretty simple, right? They are! They can be very successful, especially if you define good mutations or cross-domain functions for the individuals.

The repository also contains some additional usage classes, such as worker and scheduler classes, which enable the GP optimizer to complete model training and evaluation independently and in parallel.

Run the code

Follow the above steps to run.

 """Tournament play experiment."""
 from __future__ import absolute_import
 import net_builder
 import gp
 import cPickle
 # Use cuda ?
 CUDA_ = True
 if __name__ == '__main__':
 # setup a tournament!
 nb_evolution_steps = 10
 tournament = \
       gp.TournamentOptimizer(
 population_sz=50,
           init_fn=net_builder.randomize_network,
           mutate_fn=net_builder.mutate_net,
 nb_workers=3,
 use_cuda=True)
   for i in range(nb_evolution_steps):
       print('\nEvolution step:{}'.format(i))
       print('================')
 tournament.step()
       # keep track of the experiment results & corresponding architectures
       name = "tourney_{}".format(i)
       cPickle.dump(tournament.stats, open(name + '.stats','wb'))
       cPickle.dump(tournament.history, open(name +'.pop','wb'))

Next, let’s take a look at the results of the operation!

Here are the scores for 50 solutions, with a competition size of 3. The models were trained on only 10,000 examples and then evaluated. At first glance, the evolutionary algorithm doesn't seem to be doing much, as the solution is close to optimal in the first evolution, and in the seventh stage, the solution reaches its peak performance. In the figure below, we use a box plot to describe each of the quarters of these solutions. We see that most solutions perform well, but the box plot shrinks as the solutions evolve.

The box in the figure shows one quarter of the solutions, and its whiskers extend to show the distribution of the remaining three quarters of the solutions. The black dot represents the average value of the solution, and we can see from the figure that the average value is rising.

To further understand the performance and behavior of this method, we first compare it to a completely random population search. No evolution is required between each stage, and each solution is reset to a random state.

The performance of the evolutionary algorithm is better in a relatively small percentage (93.66% vs 93.22%). While the random population search seems to generate some good solutions, the variance of the model is greatly increased. This means that resources are wasted in searching for suboptimal architectures. Comparing this to the evolutionary graph, we can see that evolution does generate more useful solutions, and it successfully evolves those structures, which in turn achieve better performance.

MNIST is a fairly simple dataset, and even a single-layer network can achieve high accuracy.
Optimizers like ADAM are less sensitive to learning rates and can only find good solutions when their networks have enough parameters.
During training, the model only looks at 10,000 examples (1/5 of the total training data). If we train longer, a good architecture may achieve higher accuracy.
Limiting the number of samples is also very important for the number of layers we learn, as deeper models require more samples. To address this, we also add a layer to remove mutations, allowing the population to regulate the number of layers.

The scale of this experiment is not large enough to highlight the advantages of this method; the experiments used in these articles are larger and have more complex datasets.

We have just completed a simple evolutionary algorithm that illustrates the theme of "natural selection" very well. Our algorithm only selects the winning solution and then mutates it to produce more offspring. Next, all we need to do is use more advanced methods to generate and evolve a population of solutions. Here are some suggestions for improvement:

Reuse parent weights for common layers
Merge layers from two potential parents
The architecture does not have to be continuous, you can explore more different connections between layers (dispersed or merged, etc.)
Add extra layers on top and then make micro adjustments.

All of the above are topics in the field of artificial intelligence research. One of the more popular methods is NEAT and its extensions. The EAT variant uses evolutionary algorithms to set the weights of the network while developing it. In a typical reinforcement learning scenario, the evolution of agent weights is very possible. However, when (x,y) input pairs are available, gradient descent methods perform better.

Related articles

Evolino: Hybrid Neuroevolution / Optimal Linear Search for Sequence Learning

Evolving Deep Neural Networks — This is a very interesting approach of co-evolving whole networks and blocks within the network, it's very similar to the Evolino method but for CNNs.

Large-Scale Evolution of Image Classifiers

Convolution by Evolution

This article is reproduced from Leiphone.com. If you need to reprint it, please go to Leiphone.com official website to apply for authorization.

<<: Deep understanding of pathfinding algorithms in games

>>: [PHP Kernel Exploration] Hash Table in PHP

In order to prevent you from spitting out the seeds, fruits work so hard!

Qualcomm announced the 5G patent fee standards in a high-profile manner, and netizens offered suggestions for domestic mobile phone manufacturers

Blog

Why electric vehicles make China the global automotive center of the 21st century

Blog

What is this saltwater fish that has a strange appearance, a difficult name to write, and is difficult to tell male from female?

Blog

How to Evolve Neural Networks with Automatic Machine Learning

Problem Setting

Implementation

Basic building blocks

Run the code

In order to prevent you from spitting out the seeds, fruits work so hard!

Want to get rid of odor? Garlic lovers may try drinking yogurt

"I have a fever again" and "I have a worse cough". This virus is spreading out of season this year! Be careful -

This documentary film has some background to compete with Zhang Yimou's "The Great Wall" at the box office

Google re-emphasizes upgrade policy, most Android devices continue to be excluded from updates

Profit is more important than scale in color TV business

Lack of sleep is quietly destroying your memory, and even catching up on sleep can't fix it

Qualcomm announced the 5G patent fee standards in a high-profile manner, and netizens offered suggestions for domestic mobile phone manufacturers

Why electric vehicles make China the global automotive center of the 21st century

What is this saltwater fish that has a strange appearance, a difficult name to write, and is difficult to tell male from female?

Recommend

The first step in product marketing: How to identify competitors?

Discussion on optimization scheme for opening the first screen of mobile H5 within seconds

iOS permanent version Xunlei is here. The most complete tutorial for BT and magnet link downloads

Which method should I choose to declare infant and child care expenses? What are the deduction ratios and standards?

What is it like to have a "SPA" in a cabin at -180℃?

In addition to cleanliness, giving the elderly a bath has unexpected benefits!

iOS 8.2 will be officially released next week, and employees/carriers have received the final beta version

Technology News | Research finds genetic link between fingerprints and limb development

Hua Chunying responded to the Harvard University COVID-19 paper full of loopholes. Let’s take a look at the details!

China Automobile Dealers Association: China's imported automobile market situation in April 2022

How to write good copy? Here’s a [universal little formula] for you!

Why do I still feel tired after sleeping for a long time?

The secret to making Tik Tok video ads a hit!

"Jiang Ziya" is scheduled to be released! The first film withdrawn from the Spring Festival will be released again

A complete guide to selling goods in private domain "welfare groups"