How to Evolve Neural Networks with Automatic Machine Learning

How to Evolve Neural Networks with Automatic Machine Learning

For most people working in machine learning, designing a neural network is like creating a work of art. Neural networks usually start with a common architecture, and then we need to constantly adjust and optimize the parameters until we find a good combination of layers, activation functions, regularizers, and optimization parameters. Under the guidance of some well-known neural network architectures such as VGG, Inception, ResNets, DenseNets, etc., we need to repeat the operation of the network's variables until the network reaches the speed and accuracy we expect. As network processing power continues to improve, it is becoming more and more feasible to automate the network optimization process.

In shallow models like Random Forests and SVMs, we have been able to automate the optimization of hyperparameters. Some common toolkits, such as sk-learn, provide methods for searching the hyperparameter space. In its simplest and most basic form, "hyperparameters" are the parameters we search for in all possible parameters, or we randomly sample from the parameter distribution. (Click this link for details.) Both of these methods face two problems: first, they waste resources when searching in the wrong parameter area; second, they are inefficient when dealing with a large number of dynamic feature parameter sets. Therefore, it becomes quite difficult to change the architecture of the processor. Although there are many seemingly efficient methods now, such as Bayesian optimization methods. However, while Bayesian optimization can solve the first problem, it is powerless to solve the second problem; in addition, it is difficult to explore models in the Bayesian optimization setting.

The idea of ​​automatically identifying attack patterns is not new, and with recent advances in processing power, it is easier than ever to do so.

Problem Setting

One way to think about hyperparameter optimization is as a “meta-learning problem”.

Can we create an algorithm that can be used to judge whether a network is performing well?

Note: I will continue to use the term “meta-learning” in the following, even though describing this problem as “meta-learning” is a bit confusing, but we must not confuse it with some methods related to “learning”.

Our goal is to define the number of hidden layers (green) in the network and the parameters of each hidden layer.

Specifically, it is to explore the model architecture and the parameter space of the model to optimize its performance on a given data set. This problem is complex and difficult to solve, and the rewards are sparse. The reason why it is sparse is that we need to train the network sufficiently and evaluate it; and after the training and evaluation are completed, we only get some scores as rewards. These scores reflect the performance of the entire system, and this type of reward is not a differentiable function! Speaking of this, does it remind you of something? Yes, this is a typical "reinforcement learning" scenario!

Wikipedia defines "reinforcement learning":

"Reinforcement learning" (RL) is an important machine learning method, which is inspired by the behaviorist theory of psychology. Specifically, "reinforcement learning" is about how an organism (agent) can maximize the cumulative reward under the stimulation of the environment (environment).

The difference between "reinforcement learning" and standard supervised learning is that it does not require the correct input or output pairs to appear, nor does it require precise correction of the secondary optimization behavior. In addition, "online performance" is also the focus of "reinforcement learning", that is, finding a balance between the exploration of unknown areas and the development of existing knowledge.

The agent in the scenario above is a model, and the environment is the dataset we use for training and evaluation. The interpreter is the process of analyzing each behavior and setting the state of the agent (in our scenario, the interpreter sets the network parameters).

Typically, the "reinforcement learning" problem is defined as a Markov decision process. Its goal is to optimize the total reward of the organism. At each step, you need to make a decision to optimize the model output or explore a new behavior. Under the stimulation of the environment, the organism will form an adjustment policy based on the feedback it receives and continuously improve its behavior.

Note: This topic is beyond the scope of this article, and Introduction to Reinforcement Learning by R.Sutton and A. Barto is probably the best introductory book on the subject.

Evolutionary Algorithms

Another approach to solving the "reinforcement learning" problem is "evolutionary algorithms". Inspired by biological evolution, evolutionary algorithms create a set of solutions to find the solution space; then, they evaluate each solution and continuously adjust the solution set based on the evaluation score. "Evolution" in biological evolution involves the selection and mutation of the best members of a population. Therefore, our solution set will also continue to evolve to improve its overall adaptability and provide a feasible solution to the problem.

The left side of the above picture introduces the process of evolution. Designing an "evolutionary algorithm" involves two parts - "selection" and the "cross-border" or "mutation" strategy that needs to be followed.

"Selection": For "selection", we usually select the best individuals and some random individuals to achieve diversity. A more advanced selection method is to set up different "sub-groups" under the population, that is, "species"; then select the best individuals in the species to protect its diversity. Another popular method is "competitive selection", that is, randomly select some individuals to participate in the competition and select the winner (the individual with superior genes).

"Crossover": "Crossover" is also called "crossover", which refers to the crossover and mixing of two or more sets of parents to produce offspring. "Crossover" is highly dependent on the way the problem is structured. A common approach is to describe the parents with a list of items (usually numerical values), and then select arbitrary parts from the parents to generate new genetic combinations.

"Variation": "Variation" or "mutation" refers to the process of arbitrarily changing the genome. This is a major development factor and helps maintain the diversity of the population.

Implementation

The implementation of "Evolutionary Algorithms" uses PyTorch to build an agent that will explore DNNs for a simple classification task. This experiment uses MNIST because it is small and fast, and can be trained even on a CPU. We will build a set of DNN models and evolve them for N steps.

The topic of "evolution" we are talking about is actually the implementation of "natural selection". The complete high-level "evolutionary algorithm" is as follows:

  1. new_population = []
  2. while size(new_population) < population_size:
  3. choose k(tournament) individuals from the population at random
  4. choose the best from pool/tournament with probability p1
  5. choose the second best individual with probability p2
  6. choose the third best individual with probability p3
  7. mutate and append selected to the new_population

Side note: The crossover issue gets quite complicated when it comes to merging architectures. How exactly do you merge two parent architectures? What are the effects of defect patterns and context-integrated training? A recent paper by Miikkulainen et al. proposes a solution called CoDeepNEAT. Based on the Evolino theory of evolution, an architecture is composed of some unit modules, each of which is subject to evolution. This architecture is an ideal blueprint that merges all the components. In this context, it makes sense to mix the components of the parents, because the components are a complete mini-network. To keep the article concise and easy to understand, I avoid the crossover issue in this algorithm implementation and simply introduce solutions like NEAT (or CoDeepNEAT). (I plan to introduce these solutions in detail in the next article.)

Basic building blocks

The first thing we need to define is the solution space for each model, where each individual represents an architecture. For simplicity, we stack n layers, each of which has three parameters: a) the number of hidden units; b) the activation type; c) the dropout rate. For the common parameters, we choose between different optimizers, learning rates, weight decay, and number of layers.

  1. # definition of a space
  2. # lower bound - upper bound, type param, mutation rate
  3. LAYER_SPACE = dict()
  4. LAYER_SPACE['nb_units'] = (128, 1024, 'int', 0.15)
  5. LAYER_SPACE['dropout_rate'] = (0.0, 0.7, 'float', 0.2)
  6. LAYER_SPACE['activation'] =\
  7. (0, ['linear', 'tanh', 'relu', 'sigmoid', 'elu'], 'list', 0.2)
  8. NET_SPACE = dict()
  9. NET_SPACE['nb_layers'] = (1, 3, 'int', 0.15)
  10. NET_SPACE['lr'] = (0.0001, 0.1, 'float', 0.15)
  11. NET_SPACE['weight_decay'] = (0.00001, 0.0004, 'float', 0.2)
  12. NET_SPACE['optimizer'] =\
  13. (0, ['sgd', 'adam', 'adadelta', 'rmsprop'], 'list', 0.2)

After completing the above operations, we have defined the model space. Next, we need to establish three basic functions:

Randomly select a network

  1. def random_value(space):
  2. """Sample random value from the given space."""
  3. val = None
  4. if space[2] == 'int':
  5. val = random.randint(space[0], space[1])
  6. if space[2] == 'list':
  7. val = random.sample(space[1], 1)[0]
  8. if space[2] == 'float':
  9. val = ((space[1] - space[0]) * random.random()) + space[0]
  10. return {'val': val, 'id': random.randint(0, 2**10)}
  11. def randomize_network(bounded=True):
  12. """Create a random network."""
  13. global NET_SPACE, LAYER_SPACE
  14. net = dict()
  15. for k in NET_SPACE.keys():
  16. net[k] = random_value(NET_SPACE[k])
  17. if bounded:
  18. net['nb_layers']['val'] = min(net['nb_layers']['val'], 1)
  19. layers = []
  20. for i in range(net['nb_layers']['val']):
  21. layer = dict()
  22. for k in LAYER_SPACE.keys():
  23. layer[k] = random_value(LAYER_SPACE[k])
  24. layers.append(layer)
  25. net['layers'] = layers
  26. return net

First, we arbitrarily sample the number of layers and the parameters of each layer, with the sample values ​​falling on the edge of a predefined range. When initializing a parameter, we also generate an arbitrary parameter id. It is not usable yet, but we can keep track of all the layers. When a new model is mutated, the old layers are fine-tuned and only the mutated layers are initialized. This should significantly speed up and stabilize the solution.

Note: Depending on the nature of the problem, we may need different constraints, such as the total number of parameters or the total number of layers.

Mutating the network

  1. def mutate_net(net):
  2. """Mutate a network."""
  3. global NET_SPACE, LAYER_SPACE
  4. # mutate optimizer
  5. for k in ['lr', 'weight_decay', 'optimizer']:
  6. if random.random() < NET_SPACE[k][-1]:
  7. net[k] = random_value(NET_SPACE[k])
  8. # mutate layers
  9. for layer in net['layers']:
  10. for k in LAYER_SPACE.keys():
  11. if random.random() < LAYER_SPACE[k][-1]:
  12. layer[k] = random_value(LAYER_SPACE[k])
  13. # mutate number of layers -- RANDOMLY ADD
  14. if random.random() < NET_SPACE['nb_layers'][-1]:
  15. if net['nb_layers']['val'] < NET_SPACE['nb_layers'][1]:
  16. if random.random()< 0.5:
  17. layer = dict()
  18. for k in LAYER_SPACE.keys():
  19. layer[k] = random_value(LAYER_SPACE[k])
  20. net['layers'].append(layer)
  21. # value & id update
  22. net['nb_layers']['val'] = len(net['layers'])
  23. net['nb_layers']['id'] +=1
  24. else:
  25. if net['nb_layers']['val'] > 1:
  26. net['layers'].pop()
  27. net['nb_layers']['val'] = len(net['layers'])
  28. net['nb_layers']['id'] -=1
  29. return net

Each network element has the possibility of mutation, and each mutation will resample the parameter space, thereby causing the parameters to change.

Build a network

  1. class CustomModel():
  2. def __init__(self, build_info, CUDA=True):
  3. previous_units = 28 * 28
  4. self.model = nn.Sequential()
  5. self.model.add_module('flatten', Flatten())
  6. for i, layer_info in enumerate(build_info['layers']):
  7. i = str(i)
  8. self.model.add_module(
  9. 'fc_' + i,
  10. nn.Linear(previous_units, layer_info['nb_units']['val'])
  11. )
  12. self.model.add_module(
  13. 'dropout_' + i,
  14. nn.Dropout(p=layer_info['dropout_rate']['val'])
  15. )
  16. if layer_info['activation']['val'] == 'tanh':
  17. self.model.add_module(
  18. 'tanh_'+i,
  19. nn.Tanh()
  20. )
  21. if layer_info['activation']['val'] == 'relu':
  22. self.model.add_module(
  23. 'relu_'+i,
  24. nn.ReLU()
  25. )
  26. if layer_info['activation']['val'] == 'sigmoid':
  27. self.model.add_module(
  28. 'sigm_'+i,
  29. nn.Sigmoid()
  30. )
  31. if layer_info['activation']['val'] == 'elu':
  32. self.model.add_module(
  33. 'elu_'+i,
  34. nn.ELU()
  35. )
  36. previous_units = layer_info['nb_units']['val']
  37. self.model.add_module(
  38. 'classification_layer',
  39. nn.Linear(previous_units, 10)
  40. )
  41. self.model.add_module('sofmax', nn.LogSoftmax())
  42. self.model.cpu()
  43. if build_info['optimizer']['val'] == 'adam':
  44. optimizer = optim.Adam(self.model.parameters(),
  45. lr=build_info['weight_decay']['val'],
  46. weight_decay=build_info['weight_decay']['val'])
  47. elif build_info['optimizer']['val'] == 'adadelta':
  48. optimizer = optim.Adadelta(self.model.parameters(),
  49. lr=build_info['weight_decay']['val'],
  50. weight_decay=build_info['weight_decay']['val'])
  51. elif build_info['optimizer']['val'] == 'rmsprop':
  52. optimizer = optim.RMSprop(self.model.parameters(),
  53. lr=build_info['weight_decay']['val'],
  54. weight_decay=build_info['weight_decay']['val'])
  55. else:
  56. optimizer = optim.SGD(self.model.parameters(),
  57. lr=build_info['weight_decay']['val'],
  58. weight_decay=build_info['weight_decay']['val'],
  59. momentum=0.9)
  60. self.optimizer = optimizer
  61. self.cuda = False
  62. if CUDA:
  63. self.model.cuda()
  64. self.cuda = True

The above class will instantiate the "genome" of your model.

Now that we have the basic building blocks to build an arbitrary network, change its architecture, and train it, the next step is to build the "genetic algorithm" that will select and mutate the individuals. Each model is trained independently, without any information about the other organisms. This allows the optimization process to scale linearly with the available processing nodes.

Coding of GP optimizer

  1. """Genetic programming algorithms."""
  2. from __future__ import absolute_import
  3. import random
  4. import numpy as np
  5. from operator import itemgetter
  6. import torch.multiprocessing as mp
  7. from net_builder import randomize_network
  8. import copy
  9. from worker import CustomWorker, Scheduler
  10. class TournamentOptimizer:
  11. """Define a tournament play selection process."""
  12. def __init__(self, population_sz, init_fn, mutate_fn, nb_workers=2, use_cuda=True):
  13. """
  14. Initialize optimizer.
  15. params::
  16. init_fn: initialize a model
  17. mutate_fn: mutate function - mutates a model
  18. nb_workers: number of workers
  19. """
  20. self.init_fn = init_fn
  21. self.mutate_fn = mutate_fn
  22. self.nb_workers = nb_workers
  23. self.use_cuda = use_cuda
  24. # population
  25. self.population_sz = population_sz
  26. self.population = [init_fn() for i in range(population_sz)]
  27. self.evaluations = np.zeros(population_sz)
  28. #bookkeeping
  29. self.elite = []
  30. self.stats = []
  31. self.history = []
  32. def step(self):
  33. """Tournament evolution step."""
  34. print('\nPopulation sample:')
  35. for i in range(0,self.population_sz,2):
  36. print(self.population[i]['nb_layers'],
  37. self.population[i]['layers'][0]['nb_units'])
  38. self.evaluate()
  39. children = []
  40. print('\nPopulation mean:{} max:{}'.format(
  41. np.mean(self.evaluations), np.max(self.evaluations)))
  42. n_elite = 2
  43. sorted_pop = np.argsort(self.evaluations)[::-1]
  44. elite = sorted_pop[:n_elite]
  45. # print top@n_elite scores
  46. # elites always included in the next population
  47. self.elite = []
  48. print('\nTop performers:')
  49. for i,e in enumerate(elite):
  50. self.elite.append((self.evaluations[e], self.population[e]))
  51. print("{}-score:{}".format( str(i), self.evaluations[e]))
  52. children.append(self.population[e])
  53. #tournament probabilities:
  54. # first p
  55. # second p*(1-p)
  56. # third p*((1-p)^2)
  57. # etc...
  58. p = 0.85 # winner probability
  59. tournament_size = 3
  60. probs = [p*((1-p)**i) for i in range(tournament_size-1)]
  61. # a little trick to certify that probs is adding up to 1.0
  62. probs.append(1-np.sum(probs))
  63. while len(children) < self.population_sz:
  64. pop = range(len(self.population))
  65. sel_k = random.sample(pop, k=tournament_size)
  66. fitness_k = list(np.array(self.evaluations)[sel_k])
  67. selected = zip(sel_k, fitness_k)
  68. rank = sorted(selected, key=itemgetter(1), reverse=True)
  69. pick = np.random.choice(tournament_size, size=1, p=probs)[0]
  70. best = rank[pick][0]
  71. model = self.mutate_fn(self.population[best])
  72. children.append(model)
  73. self.population = children
  74. # if we want to do a completely completely random search per epoch
  75. # self.population = [randomize_network(bounded=False) for i in range(self.population_sz) ]
  76. def evaluate(self):
  77. """evaluate the models."""
  78. workerids = range(self.nb_workers)
  79. workerpool = Scheduler(workerids, self.use_cuda)
  80. self.population, returns = workerpool.start(self.population)
  81. self.evaluations = returns
  82. self.stats.append(copy.deepcopy(returns))
  83. self.history.append(copy.deepcopy(self.population))

“Evolutionary algorithms” seem pretty simple, right? They are! They can be very successful, especially if you define good mutations or cross-domain functions for the individuals.

The repository also contains some additional usage classes, such as worker and scheduler classes, which enable the GP optimizer to complete model training and evaluation independently and in parallel.

Run the code

Follow the above steps to run.

  1. """Tournament play experiment."""
  2. from __future__ import absolute_import
  3. import net_builder
  4. import gp
  5. import cPickle
  6. # Use cuda ?
  7. CUDA_ = True
  8. if __name__ == '__main__':
  9. # setup a tournament!
  10. nb_evolution_steps = 10
  11. tournament = \
  12. gp.TournamentOptimizer(
  13. population_sz=50,
  14. init_fn=net_builder.randomize_network,
  15. mutate_fn=net_builder.mutate_net,
  16. nb_workers=3,
  17. use_cuda=True)
  18. for i in range(nb_evolution_steps):
  19. print('\nEvolution step:{}'.format(i))
  20. print('================')
  21. tournament.step()
  22. # keep track of the experiment results & corresponding architectures
  23. name = "tourney_{}".format(i)
  24. cPickle.dump(tournament.stats, open(name + '.stats','wb'))
  25. cPickle.dump(tournament.history, open(name +'.pop','wb'))

Next, let’s take a look at the results of the operation!

Here are the scores for 50 solutions, with a competition size of 3. The models were trained on only 10,000 examples and then evaluated. At first glance, the evolutionary algorithm doesn't seem to be doing much, as the solution is close to optimal in the first evolution, and in the seventh stage, the solution reaches its peak performance. In the figure below, we use a box plot to describe each of the quarters of these solutions. We see that most solutions perform well, but the box plot shrinks as the solutions evolve.

The box in the figure shows one quarter of the solutions, and its whiskers extend to show the distribution of the remaining three quarters of the solutions. The black dot represents the average value of the solution, and we can see from the figure that the average value is rising.

To further understand the performance and behavior of this method, we first compare it to a completely random population search. No evolution is required between each stage, and each solution is reset to a random state.

The performance of the evolutionary algorithm is better in a relatively small percentage (93.66% vs 93.22%). While the random population search seems to generate some good solutions, the variance of the model is greatly increased. This means that resources are wasted in searching for suboptimal architectures. Comparing this to the evolutionary graph, we can see that evolution does generate more useful solutions, and it successfully evolves those structures, which in turn achieve better performance.

  • MNIST is a fairly simple dataset, and even a single-layer network can achieve high accuracy.

  • Optimizers like ADAM are less sensitive to learning rates and can only find good solutions when their networks have enough parameters.

  • During training, the model only looks at 10,000 examples (1/5 of the total training data). If we train longer, a good architecture may achieve higher accuracy.

  • Limiting the number of samples is also very important for the number of layers we learn, as deeper models require more samples. To address this, we also add a layer to remove mutations, allowing the population to regulate the number of layers.

The scale of this experiment is not large enough to highlight the advantages of this method; the experiments used in these articles are larger and have more complex datasets.

We have just completed a simple evolutionary algorithm that illustrates the theme of "natural selection" very well. Our algorithm only selects the winning solution and then mutates it to produce more offspring. Next, all we need to do is use more advanced methods to generate and evolve a population of solutions. Here are some suggestions for improvement:

  • Reuse parent weights for common layers

  • Merge layers from two potential parents

  • The architecture does not have to be continuous, you can explore more different connections between layers (dispersed or merged, etc.)

  • Add extra layers on top and then make micro adjustments.

All of the above are topics in the field of artificial intelligence research. One of the more popular methods is NEAT and its extensions. The EAT variant uses evolutionary algorithms to set the weights of the network while developing it. In a typical reinforcement learning scenario, the evolution of agent weights is very possible. However, when (x,y) input pairs are available, gradient descent methods perform better.

Related articles

Evolino: Hybrid Neuroevolution / Optimal Linear Search for Sequence Learning

Evolving Deep Neural Networks — This is a very interesting approach of co-evolving whole networks and blocks within the network, it's very similar to the Evolino method but for CNNs.

Large-Scale Evolution of Image Classifiers

Convolution by Evolution

This article is reproduced from Leiphone.com. If you need to reprint it, please go to Leiphone.com official website to apply for authorization.

<<:  Deep understanding of pathfinding algorithms in games

>>:  [PHP Kernel Exploration] Hash Table in PHP

Recommend

Feel like your brain isn't working anymore? Here are 11 brain-boosting foods →

In the pursuit of a healthy life, we are increasi...

3 practical methods of user fission growth!

Let me share with you how to achieve fission grow...

Meituan takeaway channel operation skills!

When the mini program was not yet launched, and w...

Spring Festival holiday arrangements for 12 app stores

360 Mobile Assistant Application & Game Revie...

Is your soy milk really cooked? Beware of "false boiling"!

□ Ruan Guangfeng Many people advocate starting th...

The orange peel I bought turned blue. Is it moldy? Can I still eat it?

Wogan is a citrus fruit that everyone likes. Its ...

WeChat data report was questioned for monitoring users, Tencent responded

WeChat recently released its 2018 WeChat Data Rep...

Falkirk Ship: A "special elevator" designed for ships

With the advent of elevators, our lives have beco...

How to build a brand? Brand building methodology!

Without a brand, your product will be trapped in ...

Pinduoduo's activity operation routine gameplay!

The most "showy" one in this Double Ele...