This article will be centered around a practical question: How can the community of applied reinforcement learning go one step further from a collection of scripts and single cases to implement a reinforcement learning API - a tf-learn or skikit-learn for reinforcement learning? Before discussing the TensorForce framework, we will talk about the observations and ideas that inspired this project. If you just want to learn about the API, you can skip this part. We want to emphasize that this article does not contain an introduction to deep reinforcement learning itself, nor does it propose any new models or talk about the latest best algorithms, so for pure researchers, this article may not be that interesting. Start the engineAssuming you are a researcher in computer systems, natural language processing, or other application areas, you must have some basic understanding of reinforcement learning and are interested in applying deep reinforcement learning (deep RL) to control some aspects of your system. There are already many introductory articles on deep reinforcement learning, DQN, vanilla policy gradients, A3C, etc. For example, Karpathy's article (http://karpathy.github.io/2016/05/31/rl/) describes the intuition behind policy gradient methods very well. In addition, you can also find a lot of code to help you get started, such as OpenAI's starter agents (https://github.com/openai/baselines), rllab (https://github.com/openai/rllab), and many specific algorithms on GitHub. However, we found that there is still a huge gap between the development of reinforcement learning research frameworks and practical applications. In practical applications, we may face the following problems:
To be clear, these questions are not meant to criticize researchers for writing code that is not intended to be used as an API or for other applications. What we are presenting here are the perspectives of researchers who want to apply reinforcement learning to different domains. TensorForce APITensorForce provides a declarative interface that is a robust implementation of deep reinforcement learning algorithms. It can be used as a library in applications that want to use deep reinforcement learning, allowing users to experiment with different configurations and network architectures without having to worry about all the underlying design. We fully understand that current deep reinforcement learning methods tend to be brittle and require a lot of fine-tuning, but this does not mean that we cannot build a general software infrastructure for reinforcement learning solutions. TensorForce is not a collection of raw implementations, because this is not a research simulation, and it will take a lot of work to use raw implementations in real-world applications. Any such framework will inevitably contain some architectural decisions that make non-standard things more annoying (leaky abstractions). This is why hardcore reinforcement learning researchers may prefer to build their models from scratch. With TensorForce, our goal is to capture the overall direction of the current best research, including emerging insights and standards. Next, we'll dive into fundamental aspects of the TensorForce API and discuss our design choices. Creating and configuring agents We first start by creating a reinforcement learning agent using the TensorForce API: from tensorforce import Configuration from tensorforce.agents import DQNAgent from tensorforce.core.networks import layered_network_builder # Define a network builder from an ordered list of layers layers = [dict(type='dense', size=32), dict(type='dense', size=32)] network = layered_network_builder(layers_config=layers) # Define a state states = dict(shape=(10,), type='float') # Define an action (models internally assert whether # they support continuous and/or discrete control) actions = dict(continuous=False, num_actions=5) # The agent is configured with a single configuration object agent_config = Configuration( batch_size=8, learning_rate=0.001, memory_capacity=800, first_update=80, repeat_update=4, target_update_frequency=20, states=states, actions=actions, network=network ) agent = DQNAgent(config=agent_config) The states and actions in this example are short-forms of more general states/actions. For example, a multi-modal input consisting of an image and a description can be defined as follows. Similarly, multi-output actions can also be defined. Note that throughout the code, the short-form of a single state/action must be used continuously for communication with the agent. states = dict( image=dict(shape=(64, 64, 3), type='float'), caption=dict(shape=(20,), type='int') ) Configuration parameters depend on the base agent and model used. A full list of parameters for each agent can be found in this example config: https://github.com/reinforceio/tensorforce/tree/master/examples/configs TensorForce currently provides the following reinforcement learning algorithms:
The last item means that there is no such thing as A3CAgent, because A3C actually describes a mechanism for asynchronous updates, not a specific agent. Therefore, the asynchronous update mechanism using distributed TensorFlow is part of the common Model base class, from which all agents are derived. As described in the paper "Asynchronous Methods for Deep Reinforcement Learning", A3C is implicitly implemented by setting the distributed flag for VPGAgent. It should be noted that A3C is not the optimal distributed update strategy for every model (and even completely meaningless for some models), and we will discuss implementing other methods (such as PAAC) at the end of this article. The important point is to conceptually separate the issue of agent and update semantics from execution semantics. We also want to talk about the difference between models and agents. The Agent class defines the interface for using reinforcement learning as an API, managing the various tasks such as passing in observations, preprocessing, exploration, etc. Two key methods are agent.act(state) and agent.observe(reward, terminal). agent.act(state) returns an action, while agent.observe(reward, terminal) updates the model based on the agent's mechanism, such as off-policy memory replay (MemoryAgent) or on-policy batching (BatchAgent). Note that these functions must be called alternately for the agent's internal mechanism to work correctly. The Model class implements the core reinforcement learning algorithms and provides the necessary interface through get_action and update methods that the agent can call internally at relevant points. For example, DQNAgent is a MemoryAgent agent with a DQNModel and an extra line (for target network updates). def observe(self, reward, terminal): super(DQNAgent, self).observe(reward, terminal) if self.timestep >= self.first_update \ and self.timestep % self.target_update_frequency == 0: self.model.update_target() Neural network configuration A key problem in reinforcement learning is designing an effective value function. Conceptually, we think of the model as a description of the update mechanism, which is distinct from what is actually updated - in the case of deep reinforcement learning, one (or more) neural networks. Therefore, there is no hard-coded network in the model, but rather different instantiations depending on the configuration. In the example above, we programmatically created a network configuration as a list of dictionaries describing each layer. Such a configuration can also be given via JSON, which is then turned into a network constructor using a utility function. Here is an example of a JSON network specification: [ { "type": "conv2d", "size": 32, "window": 8, "stride": 4 }, { "type": "conv2d", "size": 64, "window": 4, "stride": 2 }, { "type": "flatten" }, { "type": "dense", "size": 512 } ] As before, this configuration must be added to the agent's configuration object: from tensorforce.core.networks import from_json agent_config = Configuration( ... network=from_json('configs/network_config.json') ... ) The default activation layer is relu, but other activation functions are available (currently elu, selu, softmax, tanh, and sigmoid). You can also modify other properties of the layer. For example, you can change a dense layer to look like this: [ { "type": "dense", "size": 64, "bias": false, "activation": "selu", "l2_regularization": 0.001 } ] We chose not to use existing layer implementations (e.g. from tf.layers) so that we could exert explicit control over the internal operations and ensure that they would integrate correctly with the rest of TensorForce. We wanted to avoid dependencies on dynamic wrapper libraries and thus only depend on lower-level TensorFlow operations. Our layer library currently provides only a very small number of basic layer types, but it will be expanded in the future. In addition, you can easily integrate your own layers. Here is an example of a batch normalization layer: def batch_normalization(x, variance_epsilon=1e-6): mean, variance = tf.nn.moments(x, axes=tuple(range(x.shape.ndims - 1))) x = tf.nn.batch_normalization(x, mean=mean, variance=variance, variance_epsilon=variance_epsilon) return x { "type": "[YOUR_MODULE].batch_normalization", "variance_epsilon": 1e-9 } So far, we've shown TensorForce's ability to create layered networks, that is, a network that takes a single input state tensor, and has a sequence of layers that produce a single output tensor. However, in some cases, it may be necessary or more appropriate to deviate from this layer stacking structure. This is most notably required when there are multiple input states to process, which cannot be naturally accomplished using a single sequence of processing layers. We currently do not provide a higher-level configuration interface for automatically creating the corresponding network builder. Therefore, for such cases, you must define its network builder function programmatically and add it to the agent configuration as before. For example, in the previous multimodal input (image and caption) example, we can define a network as follows: def network_builder(inputs): image = inputs['image'] # 64x64x3-dim, float caption = inputs['caption'] # 20-dim, int with tf.variable_scope('cnn'): weights = tf.Variable(tf.random_normal(shape=(3, 3, 3, 16), stddev=0.01)) image = tf.nn.conv2d(image, filter=weights, strides=(1, 1, 1, 1)) image = tf.nn.relu(image) image = tf.nn.max_pool(image, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1)) weights = tf.Variable(tf.random_normal(shape=(3, 3, 16, 32), stddev=0.01)) image = tf.nn.conv2d(image, filter=weights, strides=(1, 1, 1, 1)) image = tf.nn.relu(image) image = tf.nn.max_pool(image, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1)) image = tf.reshape(image, shape=(-1, 16 * 16, 32)) image = tf.reduce_mean(image, axis=1) with tf.variable_scope('lstm'): weights = tf.Variable(tf.random_normal(shape=(30, 32), stddev=0.01)) caption = tf.nn.embedding_lookup(params=weights, ids=caption) lstm = tf.contrib.rnn.LSTMCell(num_units=64) caption, _ = tf.nn.dynamic_rnn(cell=lstm, inputs=caption, dtype=tf.float32) caption = tf.reduce_mean(caption, axis=1) return tf.multiply(image, caption) agent_config = Configuration( ... network=network_builder ... ) Internal State and Episode Management Unlike the classic supervised learning setting, where instances and neural network calls are considered independent, in reinforcement learning the time steps in an episode depend on previous actions, and also affect subsequent states. So in addition to its state input and action output at each time step, it is conceivable that the neural network may have internal states within the episode that correspond to inputs/outputs at each time step. The following diagram shows how such a network works over time: The management of these internal states (i.e., propagating them forward between time steps and resetting them when starting a new episode) can be handled entirely by TensorForce's agent and model classes. Note that this handles all relevant use cases (one episode within a batch, multiple episodes within a batch, and episodes without terminals within a batch). So far, the LSTM layer type takes advantage of this functionality: [ { "type": "dense", "size": 32 }, { "type": "lstm" } ] In this example architecture, the output of the dense layer is fed into an LSTM cell, which then produces the final output for that time step. When the LSTM is advanced one step, its internal state is updated and gives the internal state output here. For the next time step, the network gets the new state input and this internal state, then advances the LSTM one more step and outputs the actual output and the new internal LSTM state, and so on... For custom implementations of layers with internal state, the function must return not only the output of the layer, but also a list of internal state input placeholders, the corresponding internal state output tensors, and a list of internal state initialization tensors (all of the same length and in that order). The following code snippet shows our LSTM layer implementation (a simplified version) and illustrates how a custom layer with internal state is defined: def lstm(x): size = x.get_shape()[1].value internal_input = tf.placeholder(dtype=tf.float32, shape=(None, 2, size)) lstm = tf.contrib.rnn.LSTMCell(num_units=size) state = tf.contrib.rnn.LSTMStateTuple(internal_input[:, 0, :], internal_input[:, 1, :]) x, state = lstm(inputs=x, state=state) internal_output = tf.stack(values=(state.c, state.h), axis=1) internal_init = np.zeros(shape=(2, size)) return x, [internal_input], [internal_output], [internal_init] Preprocessing status We can define preprocessing steps to be applied to these states (or multiple states if specified as a dictionary of lists), for example, to downsample the visual input. The following example is from the Arcade Learning Environment preprocessor, which is used by most DQN implementations: config = Configuration( ... preprocessing=[ dict( type='image_resize', kwargs=dict(width=84, height=84) ), dict( type='grayscale' ), dict( type='center' ), dict( type='sequence', kwargs=dict( length=4 ) ) ] ... ) Each preprocessor in the stack has a type, and an optional list of args and/or a dictionary of kwargs. For example, the sequence preprocessor takes the four most recent states (i.e. frames) and stacks them together to simulate the Markov property. By the way: this is obviously not necessary when using, for example, the LSTM layer mentioned above, as the LSTM layer can model and communicate temporal dependencies through its internal state. explore Heuristics can be defined in a configuration object, which can be applied by the agent to the actions at which its model decides (to handle multiple actions, again, a specification dictionary is given). For example, to use the Ornstein-Uhlenbeck heuristic for continuous action output, the following specification would be added to the configuration. config = Configuration( ... exploration = dict( type='OrnsteinUhlenbeckProcess', kwargs=dict( sigma=0.1, mu=0, theta=0.1 ) ) ... ) The following lines of code add an epsilon heuristic for discrete actions that decays over time to a final value: config = Configuration( ... exploration = dict( type='EpsilonDecay', kwargs=dict( epsilon=1, epsilon_final=0.01, epsilon_timesteps=1e6 ) ) ... ) Using the Agent with the Runner Utility Function Let's use an agent, this code is an agent running on our test environment: https://github.com/reinforceio/tensorforce/blob/master/tensorforce/environments/minimal_test.py, we use it for continuous integration - a minimal environment to verify the action, observation and update mechanism for a given agent/model. Note that all our environment implementations (OpenAI Gym, OpenAI Universe, DeepMind Lab) use the same interface, so it is straightforward to run tests using another environment. The Runner utility function facilitates the running of an agent on an environment. Given any agent and environment instance, it manages the number of episodes, the maximum length of each episode, the termination condition, etc. The Runner also accepts a cluster_spec argument, which allows it to manage distributed execution (TensorFlow supervisors/sessions/etc). With the optional episode_finished argument, you can also report results periodically, and give an indicator to stop the execution before the maximum number of episodes. environment = MinimalTest(continuous=False) network_config = [ dict(type='dense', size=32) ] agent_config = Configuration( batch_size=8, learning_rate=0.001, memory_capacity=800, first_update=80, repeat_update=4, target_update_frequency=20, states=environment.states, actions=environment.actions, network=layered_network_builder(network_config) ) agent = DQNAgent(config=agent_config) runner = Runner(agent=agent, environment=environment) def episode_finished(runner): if runner.episode % 100 == 0: print(sum(runner.episode_rewards[-100:]) / 100) return runner.episode < 100 \ or not all(reward >= 1.0 for reward in runner.episode_rewards[-100:]) runner.run(episodes=1000, episode_finished=episode_finished) For completeness, we explicitly give the minimal loop for running an agent on an environment: episode = 0 episode_rewards = list() while True: state = environment.reset() agent.reset() timestep = 0 episode_reward = 0 while True: action = agent.act(state=state) state, reward, terminal = environment.execute(action=action) agent.observe(reward=reward, terminal=terminal) timestep += 1 episode_reward += reward if terminal or timestep == max_timesteps: break episode += 1 episode_rewards.append(episode_reward) if all(reward >= 1.0 for reward in episode_rewards[-100:]) \ or episode == max_episodes: break As mentioned in the introduction, the use of a runner class in a given application scenario depends on the flow of control. If using reinforcement learning allows us to reasonably query state information in TensorForce (such as through a queue or network service) and return actions (to another queue or service), then it can be used to implement the environment interface and thus use (or extend) the runner utility function. A more common situation may be that TensorForce is used as an external application library to drive control, so an environment handle cannot be provided. For researchers, this may not be a big deal, but in fields such as computer systems, this is a typical deployment problem, which is also the root cause of most research scripts that can only be used for simulation and cannot be applied in practice. Another point worth mentioning is that the declarative central configuration object allows us to directly configure interfaces for all components of the reinforcement learning model with hyperparameter optimization, especially the network architecture. Further Thoughts We hope you find TensorForce useful. So far, our focus has been on getting the architecture in place first, which we think will allow us to more consistently implement different RL concepts and new methods, and avoid the inconvenience of exploring deep RL use cases in new domains. In such a rapidly evolving field, it can be difficult to decide what functionality to include in an actual library. There are a plethora of algorithms and concepts out there, and it seems like every week new ideas are achieving better results on a subset of the Arcade Learning Environment (ALE) environments. There is a problem, however: many ideas only work in environments that are easily parallelizable or have a certain episode structure - we don't yet have a good idea of the properties of the environment and how they relate to different approaches. However, we can see some clear trends:
In general, we are tracking these developments and will include existing techniques that we have missed before (there should be a lot); and once we believe that a new idea has the potential to become a robust standard method, we will also include it. In this sense, we are not explicitly competing with the research framework, but rather providing a higher level of coverage. This article is reproduced from Machine Heart, the original text comes from reinforce, the author is Michael Schaarschmidt, Alexander Kuhnle, and Kai Fricke. |
<<: Let’s review the RecyclerView of that year from the beginning
>>: Tech Neo June Issue: Enterprise-Level Operations and Maintenance
Spring has arrived again. At the beginning of Mar...
Fenghe River, also known as "Fengshui" ...
Unexpectedly! Gutter oil has become a hot commodi...
1. Tesla and the owner of the Model 3 brake accid...
Event planning is an important part of marketing ...
Products endorsed by celebrities can bring good s...
As the undisputed leader in China's new energ...
01. Market performance Overall performance of the...
Many people have the habit of charging their phon...
Changsha High-Quality Tea Tasting Selection (185~7...
The most profitable industry nowadays is the &quo...
According to foreign media reports, at the beginn...
How many "Apple fans" have not experien...
If ordinary fish are economy class, then striped ...
"Components" are a very important part ...