Building a reinforcement learning API based on TensorFlow: How was TensorForce created?

Building a reinforcement learning API based on TensorFlow: How was TensorForce created?

This article will be centered around a practical question: How can the community of applied reinforcement learning go one step further from a collection of scripts and single cases to implement a reinforcement learning API - a tf-learn or skikit-learn for reinforcement learning? Before discussing the TensorForce framework, we will talk about the observations and ideas that inspired this project. If you just want to learn about the API, you can skip this part. We want to emphasize that this article does not contain an introduction to deep reinforcement learning itself, nor does it propose any new models or talk about the latest best algorithms, so for pure researchers, this article may not be that interesting.

Start the engine

Assuming you are a researcher in computer systems, natural language processing, or other application areas, you must have some basic understanding of reinforcement learning and are interested in applying deep reinforcement learning (deep RL) to control some aspects of your system.

There are already many introductory articles on deep reinforcement learning, DQN, vanilla policy gradients, A3C, etc. For example, Karpathy's article (http://karpathy.github.io/2016/05/31/rl/) describes the intuition behind policy gradient methods very well. In addition, you can also find a lot of code to help you get started, such as OpenAI's starter agents (https://github.com/openai/baselines), rllab (https://github.com/openai/rllab), and many specific algorithms on GitHub.

However, we found that there is still a huge gap between the development of reinforcement learning research frameworks and practical applications. In practical applications, we may face the following problems:

  • Tight coupling of RL logic to simulation handles: Simulation environment APIs are very convenient, for example, they allow us to create an environment object and use it in a for loop while managing its internal update logic (e.g. by collecting output features). This is reasonable if our goal is to evaluate an RL idea, but separating RL code from the simulation environment is much harder. It also involves the question of flow control: when the environment is ready, does the RL code call it? Or when the environment needs a decision, does it call the RL agent? For applied RL libraries implemented in many domains, we often need the latter.
  • Fixed network architecture: Most implementations include hard-coded neural network architectures. This is usually not a big problem, as it is straightforward to add or remove different network layers as needed. However, it would be much better if there was an RL library that provided functionality with a declarative interface without having to modify the library code. Also, there are cases where changing the architecture is (surprisingly) much harder, such as when managing internal state (see below).
  • Incompatible state/action interfaces: A lot of early open source code used the popular OpenAI Gym environment, with a simple interface of flat state inputs and a single discrete or continuous action output. But DeepMind Lab used a dictionary format, typically with multiple states and actions. And OpenAI Universe uses named key events. Ideally, we would like to have reinforcement learning agents that can handle an arbitrary number of states and actions, of potentially different types and shapes. For example, one of the authors at TensorForce was using reinforcement learning for NLP and wanted to handle multimodal inputs, where a state conceptually consists of two inputs - an image and a corresponding description.
  • Opaque execution setup and performance issues: When writing TensorFlow code, it is natural to focus on the logic first. This can lead to a lot of repeated/unnecessary operations or unnecessary intermediate values. In addition, the goals of distributed/asynchronous/parallel reinforcement learning are a bit fluid, and distributed TensorFlow requires a certain degree of manual tuning for specific hardware setups. Similarly, it would be nice if there was eventually an execution configuration that only required declaring the available devices or machines, and everything else was handled internally, such as two machines with different IPs running asynchronous VPGs.

To be clear, these questions are not meant to criticize researchers for writing code that is not intended to be used as an API or for other applications. What we are presenting here are the perspectives of researchers who want to apply reinforcement learning to different domains.

TensorForce API

TensorForce provides a declarative interface that is a robust implementation of deep reinforcement learning algorithms. It can be used as a library in applications that want to use deep reinforcement learning, allowing users to experiment with different configurations and network architectures without having to worry about all the underlying design. We fully understand that current deep reinforcement learning methods tend to be brittle and require a lot of fine-tuning, but this does not mean that we cannot build a general software infrastructure for reinforcement learning solutions.

TensorForce is not a collection of raw implementations, because this is not a research simulation, and it will take a lot of work to use raw implementations in real-world applications. Any such framework will inevitably contain some architectural decisions that make non-standard things more annoying (leaky abstractions). This is why hardcore reinforcement learning researchers may prefer to build their models from scratch. With TensorForce, our goal is to capture the overall direction of the current best research, including emerging insights and standards.

Next, we'll dive into fundamental aspects of the TensorForce API and discuss our design choices.

Creating and configuring agents

We first start by creating a reinforcement learning agent using the TensorForce API:

from tensorforce import Configuration
from tensorforce.agents import DQNAgent
from tensorforce.core.networks import layered_network_builder

# Define a network builder from an ordered list of layers
layers = [dict(type='dense', size=32),
          dict(type='dense', size=32)]
network = layered_network_builder(layers_config=layers)

# Define a state
states = dict(shape=(10,), type='float')

# Define an action (models internally assert whether
# they support continuous and/or discrete control)
actions = dict(continuous=False, num_actions=5)

# The agent is configured with a single configuration object
agent_config = Configuration(
    batch_size=8,
    learning_rate=0.001,
    memory_capacity=800,
    first_update=80,
    repeat_update=4,
    target_update_frequency=20,
    states=states,
    actions=actions,
    network=network
)
agent = DQNAgent(config=agent_config)

The states and actions in this example are short-forms of more general states/actions. For example, a multi-modal input consisting of an image and a description can be defined as follows. Similarly, multi-output actions can also be defined. Note that throughout the code, the short-form of a single state/action must be used continuously for communication with the agent.

states = dict(
    image=dict(shape=(64, 64, 3), type='float'),
    caption=dict(shape=(20,), type='int')
)

Configuration parameters depend on the base agent and model used. A full list of parameters for each agent can be found in this example config: https://github.com/reinforceio/tensorforce/tree/master/examples/configs

TensorForce currently provides the following reinforcement learning algorithms:

  • Random Agent Baseline (RandomAgent)
  • Vanilla policy gradient with generalized advantage estimation (VPGAgent)
  • Trust Region Policy Optimization (TRPOAgent)
  • Deep Q-Learning/Dual Deep Q-Learning (DQNAgent)
  • Normalized advantage function (NAFAgent)
  • Deep Q-Learning from Expert Demonstrations (DQFDAgent)
  • Asynchronous Advantage Actor-Critic (A3C) (can be used implicitly via distributed)

The last item means that there is no such thing as A3CAgent, because A3C actually describes a mechanism for asynchronous updates, not a specific agent. Therefore, the asynchronous update mechanism using distributed TensorFlow is part of the common Model base class, from which all agents are derived. As described in the paper "Asynchronous Methods for Deep Reinforcement Learning", A3C is implicitly implemented by setting the distributed flag for VPGAgent. It should be noted that A3C is not the optimal distributed update strategy for every model (and even completely meaningless for some models), and we will discuss implementing other methods (such as PAAC) at the end of this article. The important point is to conceptually separate the issue of agent and update semantics from execution semantics.

We also want to talk about the difference between models and agents. The Agent class defines the interface for using reinforcement learning as an API, managing the various tasks such as passing in observations, preprocessing, exploration, etc. Two key methods are agent.act(state) and agent.observe(reward, terminal). agent.act(state) returns an action, while agent.observe(reward, terminal) updates the model based on the agent's mechanism, such as off-policy memory replay (MemoryAgent) or on-policy batching (BatchAgent). Note that these functions must be called alternately for the agent's internal mechanism to work correctly. The Model class implements the core reinforcement learning algorithms and provides the necessary interface through get_action and update methods that the agent can call internally at relevant points. For example, DQNAgent is a MemoryAgent agent with a DQNModel and an extra line (for target network updates).

def observe(self, reward, terminal):
    super(DQNAgent, self).observe(reward, terminal)
    if self.timestep >= self.first_update \
            and self.timestep % self.target_update_frequency == 0:
        self.model.update_target()

Neural network configuration

A key problem in reinforcement learning is designing an effective value function. Conceptually, we think of the model as a description of the update mechanism, which is distinct from what is actually updated - in the case of deep reinforcement learning, one (or more) neural networks. Therefore, there is no hard-coded network in the model, but rather different instantiations depending on the configuration.

In the example above, we programmatically created a network configuration as a list of dictionaries describing each layer. Such a configuration can also be given via JSON, which is then turned into a network constructor using a utility function. Here is an example of a JSON network specification:

[
    {
        "type": "conv2d",
        "size": 32,
        "window": 8,
        "stride": 4
    },
    {
        "type": "conv2d",
        "size": 64,
        "window": 4,
        "stride": 2
    },
    {
        "type": "flatten"
    },
    {
        "type": "dense",
        "size": 512
    }
]

As before, this configuration must be added to the agent's configuration object:

from tensorforce.core.networks import from_json

agent_config = Configuration(
    ...
    network=from_json('configs/network_config.json')
    ...
)

The default activation layer is relu, but other activation functions are available (currently elu, selu, softmax, tanh, and sigmoid). You can also modify other properties of the layer. For example, you can change a dense layer to look like this:

[
    {
        "type": "dense",
        "size": 64,
        "bias": false,
        "activation": "selu",
        "l2_regularization": 0.001
    }
]

We chose not to use existing layer implementations (e.g. from tf.layers) so that we could exert explicit control over the internal operations and ensure that they would integrate correctly with the rest of TensorForce. We wanted to avoid dependencies on dynamic wrapper libraries and thus only depend on lower-level TensorFlow operations.

Our layer library currently provides only a very small number of basic layer types, but it will be expanded in the future. In addition, you can easily integrate your own layers. Here is an example of a batch normalization layer:

def batch_normalization(x, variance_epsilon=1e-6):
    mean, variance = tf.nn.moments(x, axes=tuple(range(x.shape.ndims - 1)))
    x = tf.nn.batch_normalization(x, mean=mean, variance=variance,
                                  variance_epsilon=variance_epsilon)
    return x
{
    "type": "[YOUR_MODULE].batch_normalization",
    "variance_epsilon": 1e-9
}

So far, we've shown TensorForce's ability to create layered networks, that is, a network that takes a single input state tensor, and has a sequence of layers that produce a single output tensor. However, in some cases, it may be necessary or more appropriate to deviate from this layer stacking structure. This is most notably required when there are multiple input states to process, which cannot be naturally accomplished using a single sequence of processing layers.

We currently do not provide a higher-level configuration interface for automatically creating the corresponding network builder. Therefore, for such cases, you must define its network builder function programmatically and add it to the agent configuration as before. For example, in the previous multimodal input (image and caption) example, we can define a network as follows:

def network_builder(inputs):
    image = inputs['image'] # 64x64x3-dim, float
    caption = inputs['caption'] # 20-dim, int

    with tf.variable_scope('cnn'):
        weights = tf.Variable(tf.random_normal(shape=(3, 3, 3, 16), stddev=0.01))
        image = tf.nn.conv2d(image, filter=weights, strides=(1, 1, 1, 1))
        image = tf.nn.relu(image)
        image = tf.nn.max_pool(image, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1))

        weights = tf.Variable(tf.random_normal(shape=(3, 3, 16, 32), stddev=0.01))
        image = tf.nn.conv2d(image, filter=weights, strides=(1, 1, 1, 1))
        image = tf.nn.relu(image)
        image = tf.nn.max_pool(image, ksize=(1, 2, 2, 1), strides=(1, 2, 2, 1))

        image = tf.reshape(image, shape=(-1, 16 * 16, 32))
        image = tf.reduce_mean(image, axis=1)

    with tf.variable_scope('lstm'):
        weights = tf.Variable(tf.random_normal(shape=(30, 32), stddev=0.01))
        caption = tf.nn.embedding_lookup(params=weights, ids=caption)
        lstm = tf.contrib.rnn.LSTMCell(num_units=64)
        caption, _ = tf.nn.dynamic_rnn(cell=lstm, inputs=caption, dtype=tf.float32)
        caption = tf.reduce_mean(caption, axis=1)

    return tf.multiply(image, caption)


agent_config = Configuration(
    ...
    network=network_builder
    ...
)

Internal State and Episode Management

Unlike the classic supervised learning setting, where instances and neural network calls are considered independent, in reinforcement learning the time steps in an episode depend on previous actions, and also affect subsequent states. So in addition to its state input and action output at each time step, it is conceivable that the neural network may have internal states within the episode that correspond to inputs/outputs at each time step. The following diagram shows how such a network works over time:

The management of these internal states (i.e., propagating them forward between time steps and resetting them when starting a new episode) can be handled entirely by TensorForce's agent and model classes. Note that this handles all relevant use cases (one episode within a batch, multiple episodes within a batch, and episodes without terminals within a batch). So far, the LSTM layer type takes advantage of this functionality:

[
    {
        "type": "dense",
        "size": 32
    },
    {
        "type": "lstm"
    }
]

In this example architecture, the output of the dense layer is fed into an LSTM cell, which then produces the final output for that time step. When the LSTM is advanced one step, its internal state is updated and gives the internal state output here. For the next time step, the network gets the new state input and this internal state, then advances the LSTM one more step and outputs the actual output and the new internal LSTM state, and so on...

For custom implementations of layers with internal state, the function must return not only the output of the layer, but also a list of internal state input placeholders, the corresponding internal state output tensors, and a list of internal state initialization tensors (all of the same length and in that order). The following code snippet shows our LSTM layer implementation (a simplified version) and illustrates how a custom layer with internal state is defined:

def lstm(x):
    size = x.get_shape()[1].value
    internal_input = tf.placeholder(dtype=tf.float32, shape=(None, 2, size))
    lstm = tf.contrib.rnn.LSTMCell(num_units=size)
    state = tf.contrib.rnn.LSTMStateTuple(internal_input[:, 0, :],
                                          internal_input[:, 1, :])
    x, state = lstm(inputs=x, state=state)
    internal_output = tf.stack(values=(state.c, state.h), axis=1)
    internal_init = np.zeros(shape=(2, size))
    return x, [internal_input], [internal_output], [internal_init]

Preprocessing status

We can define preprocessing steps to be applied to these states (or multiple states if specified as a dictionary of lists), for example, to downsample the visual input. The following example is from the Arcade Learning Environment preprocessor, which is used by most DQN implementations:

config = Configuration(
    ...
    preprocessing=[
        dict(
            type='image_resize',
            kwargs=dict(width=84, height=84)
        ),
        dict(
            type='grayscale'
        ),
        dict(
            type='center'
        ),
        dict(
            type='sequence',
            kwargs=dict(
                length=4
            )
        )
    ]
    ...
)

Each preprocessor in the stack has a type, and an optional list of args and/or a dictionary of kwargs. For example, the sequence preprocessor takes the four most recent states (i.e. frames) and stacks them together to simulate the Markov property. By the way: this is obviously not necessary when using, for example, the LSTM layer mentioned above, as the LSTM layer can model and communicate temporal dependencies through its internal state.

explore

Heuristics can be defined in a configuration object, which can be applied by the agent to the actions at which its model decides (to handle multiple actions, again, a specification dictionary is given). For example, to use the Ornstein-Uhlenbeck heuristic for continuous action output, the following specification would be added to the configuration.

config = Configuration(
    ...
    exploration = dict(
        type='OrnsteinUhlenbeckProcess',
        kwargs=dict(
            sigma=0.1,
            mu=0,
            theta=0.1
        )
    )
    ...
)

The following lines of code add an epsilon heuristic for discrete actions that decays over time to a final value:

config = Configuration(
    ...
    exploration = dict(
        type='EpsilonDecay',
        kwargs=dict(
            epsilon=1,
            epsilon_final=0.01,
            epsilon_timesteps=1e6
        )
    )
    ...
)

Using the Agent with the Runner Utility Function

Let's use an agent, this code is an agent running on our test environment: https://github.com/reinforceio/tensorforce/blob/master/tensorforce/environments/minimal_test.py, we use it for continuous integration - a minimal environment to verify the action, observation and update mechanism for a given agent/model. Note that all our environment implementations (OpenAI Gym, OpenAI Universe, DeepMind Lab) use the same interface, so it is straightforward to run tests using another environment.

The Runner utility function facilitates the running of an agent on an environment. Given any agent and environment instance, it manages the number of episodes, the maximum length of each episode, the termination condition, etc. The Runner also accepts a cluster_spec argument, which allows it to manage distributed execution (TensorFlow supervisors/sessions/etc). With the optional episode_finished argument, you can also report results periodically, and give an indicator to stop the execution before the maximum number of episodes.

environment = MinimalTest(continuous=False)

network_config = [
    dict(type='dense', size=32)
]
agent_config = Configuration(
    batch_size=8,
    learning_rate=0.001,
    memory_capacity=800,
    first_update=80,
    repeat_update=4,
    target_update_frequency=20,
    states=environment.states,
    actions=environment.actions,
    network=layered_network_builder(network_config)
)

agent = DQNAgent(config=agent_config)
runner = Runner(agent=agent, environment=environment)

def episode_finished(runner):
    if runner.episode % 100 == 0:
        print(sum(runner.episode_rewards[-100:]) / 100)
    return runner.episode < 100 \
        or not all(reward >= 1.0 for reward in runner.episode_rewards[-100:])

runner.run(episodes=1000, episode_finished=episode_finished)

For completeness, we explicitly give the minimal loop for running an agent on an environment:

episode = 0
episode_rewards = list()

while True:
    state = environment.reset()
    agent.reset()

    timestep = 0
    episode_reward = 0
    while True:
        action = agent.act(state=state)
        state, reward, terminal = environment.execute(action=action)
        agent.observe(reward=reward, terminal=terminal)

        timestep += 1
        episode_reward += reward

        if terminal or timestep == max_timesteps:
            break

    episode += 1
    episode_rewards.append(episode_reward)

    if all(reward >= 1.0 for reward in episode_rewards[-100:]) \
            or episode == max_episodes:
        break

As mentioned in the introduction, the use of a runner class in a given application scenario depends on the flow of control. If using reinforcement learning allows us to reasonably query state information in TensorForce (such as through a queue or network service) and return actions (to another queue or service), then it can be used to implement the environment interface and thus use (or extend) the runner utility function.

A more common situation may be that TensorForce is used as an external application library to drive control, so an environment handle cannot be provided. For researchers, this may not be a big deal, but in fields such as computer systems, this is a typical deployment problem, which is also the root cause of most research scripts that can only be used for simulation and cannot be applied in practice.

Another point worth mentioning is that the declarative central configuration object allows us to directly configure interfaces for all components of the reinforcement learning model with hyperparameter optimization, especially the network architecture.

Further Thoughts

We hope you find TensorForce useful. So far, our focus has been on getting the architecture in place first, which we think will allow us to more consistently implement different RL concepts and new methods, and avoid the inconvenience of exploring deep RL use cases in new domains.

In such a rapidly evolving field, it can be difficult to decide what functionality to include in an actual library. There are a plethora of algorithms and concepts out there, and it seems like every week new ideas are achieving better results on a subset of the Arcade Learning Environment (ALE) environments. There is a problem, however: many ideas only work in environments that are easily parallelizable or have a certain episode structure - we don't yet have a good idea of ​​the properties of the environment and how they relate to different approaches. However, we can see some clear trends:

  • Hybrids of Policy Gradient and Q-learning methods for sample efficiency (PGQ, Q-Prop, etc.): This is a logical thing to do, and we think this will become the next "standard approach", although we don't know yet which hybrid strategy will prevail. We are very interested in understanding the usefulness of these methods in different application domains (data-rich/data-sparse). Our very subjective view is that most applied researchers tend to use variants of vanilla policy gradient because they are easy to understand, implement, and more importantly, more robust than new algorithms, which may require a lot of fine-tuning to deal with potential numerical instabilities. A different view is that non-RL researchers may just not know about the relevant new methods, or are unwilling to go through the trouble of implementing them. This is what motivated the development of TensorForce. Finally, it is worth considering that the update mechanism in the application domain is often not as important as modeling states, actions and rewards, and the network architecture.
  • Better use of GPUs and other devices that can be used for parallel/one-step/distributed methods (PAAC, GA3C, etc.): One problem with methods in this area is the implicit assumptions about the time it takes to collect data and update. In non-simulation domains, these assumptions may not hold, and understanding how environmental properties affect device execution semantics requires more research. We are still using feed_dicts, but are also considering improving the performance of input processing.
  • Exploration modes (e.g. count-based exploration, parameter space noise, etc.)
  • Large discrete action spaces, hierarchical models, and decomposition of subgoals. For example, Dulac-Arnold et al.'s Deep Reinforcement Learning in Large Discrete Action Spaces. Complex discrete spaces (e.g., many state-dependent sub-options) are highly relevant in application domains, but are currently difficult to consume through APIs. We expect a lot of work in the next few years.
  • Internal modules and novel model-based approaches for state prediction: for example, the paper The Predictron: End-To-End Learning and Planning.
  • Bayesian Deep Reinforcement Learning and Reasoning Under Uncertainty

In general, we are tracking these developments and will include existing techniques that we have missed before (there should be a lot); and once we believe that a new idea has the potential to become a robust standard method, we will also include it. In this sense, we are not explicitly competing with the research framework, but rather providing a higher level of coverage.

This article is reproduced from Machine Heart, the original text comes from reinforce, the author is Michael Schaarschmidt, Alexander Kuhnle, and Kai Fricke.

<<:  Let’s review the RecyclerView of that year from the beginning

>>:  Tech Neo June Issue: Enterprise-Level Operations and Maintenance

Recommend

From PPT-onlyism to self-licking philosophy, how did Hisense crush Samsung?

Spring has arrived again. At the beginning of Mar...

10 dimensions of event planning!

Event planning is an important part of marketing ...

How can small companies promote their apps without celebrity endorsements?

Products endorsed by celebrities can bring good s...

Your phone battery is broken after charging overnight? Don't be fooled again

Many people have the habit of charging their phon...