Developer’s statement: This is how I understand reinforcement learning

definition

Reinforcement learning is an important branch of machine learning and a product of the intersection of multiple disciplines and fields. Its essence is to solve the decision-making problem, that is, to make decisions automatically and continuously.

It mainly consists of four elements: agent, environment state, action, and reward. The goal of reinforcement learning is to obtain the most cumulative rewards.

Let's use a child learning to walk as an example:

The child wants to walk, but before that, he needs to stand up first, and then keep his balance. Then he needs to take one leg out, either the left leg or the right leg, and after taking one step, he needs to take the next step.

The child is the agent, he tries to manipulate the environment (the walking surface) by taking actions (i.e. walking) and transitioning from one state to another (i.e. every step he takes), when he completes a subtask of the task (i.e. walking a few steps), the child is rewarded (gives chocolate to eat), and when he cannot walk, he is not given chocolate.

The difference between supervised learning and unsupervised learning

In machine learning, we are more familiar with supervised learning and unsupervised learning. In addition, there is another major category, which is reinforcement learning:

The difference between reinforcement learning and supervised learning:

Supervised learning is like having a tutor guiding you while you study. The tutor knows what is right and what is wrong. However, in many practical problems, such as chess and Go, there are thousands of combinations, and it is impossible for a tutor to know all possible results.

At this time, reinforcement learning will try to perform some actions without any labels to get a result, and adjust the previous behavior through feedback on whether the result is right or wrong. By constantly adjusting in this way, the algorithm can learn what kind of behavior to choose under what circumstances to get the best result.

It's like you have an untrained puppy. Every time it messes up the house, you reduce the amount of delicious food (punishment), and every time it performs well, you double the amount of delicious food (reward). The puppy will eventually learn that messing up the living room is a bad behavior.

Both learning methods will learn a mapping from input to output. Supervised learning learns the relationship between inputs and outputs, which can tell the algorithm what kind of input corresponds to what kind of output. Reinforcement learning learns the reward function given to the machine, which is used to judge whether the behavior is good or bad.

In addition, there is a delay in the feedback of reinforcement learning results. Sometimes it may take many steps to know whether a previous choice was good or bad. However, if a supervised learning choice is made, the algorithm will be fed back immediately.

Moreover, the inputs faced by reinforcement learning are always changing. Every time the algorithm takes an action, it affects the input of the next decision, while the inputs of supervised learning are independent and identically distributed.

Through reinforcement learning, an agent can make a trade-off between exploration and exploitation and choose the one that maximizes the reward.

Exploration is trying lots of different things to see if they work better than what has been tried before.

Exploitation involves trying the behaviors that have worked best in the past.

General supervised learning algorithms do not consider this balance and are just exploitative.

The difference between reinforcement learning and unsupervised learning:

Unsupervised learning does not learn the mapping from input to output, but the pattern. For example, in the task of recommending news articles to users, unsupervised learning will find similar articles that users have read before and recommend one to them, while reinforcement learning will recommend a small amount of news to users first, and continuously obtain feedback from users, and finally build a "knowledge graph" of articles that users may like.

Main algorithms and classifications

From the perspective of several elements of reinforcement learning, the methods can be divided into the following categories:

Policy based, the focus is on finding the optimal policy.
Value based, the focus is on finding the optimal sum of rewards.
Action based, the focus is on the optimal action at each step.

We can use the most well-known example of traveling salesman.

We want to go from A to F. Each point between two points represents the cost of this road. We want to choose a path with the lowest cost possible:

The major elements are:

states, which are nodes {A, B, C, D, E, F}
action, which is moving from one point to the next {A -> B, C -> D, etc}
The reward function is the cost on the edge
policy is the entire path to complete the task {A -> C -> F}

There is a way to go like this: when you are at A, you can choose (B, C, D, E). If you find that D is the best, you go to D. At this time, you can choose (B, C, F). If you find that F is the best, you go to F, and the task is completed.

This algorithm is a type of reinforcement learning, called epsilon greedy, and is a policy based method. Of course, this path is not the optimal one.

In addition, the classification can be made more detailed from different angles:

The four classification methods shown in the figure below correspond to the corresponding main algorithms:

Model-free : Do not try to understand the environment, accept what the environment gives you, wait for feedback from the real world step by step, and then take the next action based on the feedback.

Model-based : First understand what the real world is like, and build a model to simulate the feedback of the real world, and then use imagination to predict all the situations that will happen next, and then choose the best one among these imagined situations, and take the next strategy based on this situation. It has an additional virtual environment and imagination compared to Model-free.

Policy based : Through sensory analysis of the environment, it directly outputs the probability of various actions to be taken next, and then takes action based on the probability.

Value based : Outputs the value of all actions, and selects actions based on the highest value. This method cannot select consecutive actions.

Monte-Carlo update : After the game starts, wait until the game ends, then summarize all the turning points in this round and update the code of conduct.

Temporal-difference update : Updates are made at every step during the game, so you don’t have to wait for the game to end, so you can learn while playing.

On-policy : The student must be present and must be learning while playing.

Off-policy : You can choose to play by yourself, or you can choose to watch others play and learn other people's behavioral codes by watching them play.

The main algorithms are as follows, which are briefly described today:

1. Sarsa

Q is the action-utility function, which is used to evaluate the pros and cons of taking an action under a specific state. It can be understood as the brain of the agent.

SARSA uses the Markov property and only uses the next step information, allowing the system to explore according to the strategy guidance and update the state value at each step of the exploration. The update formula is as follows:

s is the current state, a is the current action, s' is the next state, a' is the action taken in the next state, r is the reward obtained by the system, α is the learning rate, and γ is the decay factor.

2. Q learning

The algorithm framework of Q Learning is similar to that of SARSA, which also allows the system to explore according to the strategy guidance and update the state value at each step of the exploration. The key is that the update formula of Q Learning and SARSA is different. The update formula of Q Learning is as follows:

3. Policy Gradients

The system starts from a fixed or random starting state. The policy gradient allows the system to explore the environment and generate a state-action-reward sequence from the starting state to the terminal state, s1, a1, r1,....., sT, aT, rT. At the tth moment, we let gt=rt+γrt+1+... equal to q(st,a), thereby solving the policy gradient optimization problem.

4. Actor-Critic

The algorithm is divided into two parts: Actor and Critic. Actor updates the strategy, and Critic updates the value. Critic can use the SARSA or Q Learning algorithm introduced earlier.

5. Monte-Carlo learning

Explore a complete state-action-reward sequence using the current policy:

s1,a1,r1,....,sk,ak,rk～π

When a state s is encountered for the first time or each time in the sequence, its decay reward is calculated:

Last updated status value:

6. Deep-Q-Network

The main approach of the DQN algorithm is Experience Replay, which stores the data obtained by the system's exploration of the environment, and then randomly samples samples to update the parameters of the deep neural network. It also achieves the maximum reward under each action and environment state, but the difference is that some improvements have been made, including experience replay and duel network architecture.

Application Examples

Reinforcement learning has many applications. In addition to driverless cars, AlphaGo, and playing games, there are also practical examples in the following projects:

1. Manufacturing

For example, a Japanese company, Fanuc, has a factory robot that captures a video of the process when it picks up an object, remembering each of its actions, whether the operation was successful or failed, accumulating experience so that it can act faster and more accurately next time.

2. Inventory Management

In inventory management, management is a difficult problem due to obstacles such as large inventory volume, large fluctuations in inventory demand, and slow inventory replenishment. We can reduce inventory turnover time and improve space utilization by establishing a reinforcement learning algorithm.

3. Dynamic pricing

Q-learning in reinforcement learning can be used to deal with dynamic pricing problems.

4. Customer Delivery

Manufacturers want to reduce the total cost of their fleet while meeting all customer needs when transporting goods to each customer. Through multi-agents system and Q-learning, the time and number of vehicles can be reduced.

5. ECommerce Personalization

In e-commerce, reinforcement learning algorithms can also be used to learn and analyze customer behavior and customize products and services to meet the personalized needs of customers.

6. Ad Serving

For example, the algorithm LinUCB (a type of reinforcement learning algorithm bandit) will try to serve a wider range of ads, even though they have not been viewed much in the past, and can better estimate the actual click-through rate.

For example, in the Double 11 recommendation scenario, Alibaba used deep reinforcement learning and adaptive online learning. Through continuous machine learning and model optimization, it established a decision engine, analyzed massive user behaviors and tens of billions of product features in real time, helped every user quickly discover products, and improved the efficiency of matching people and products. In addition, reinforcement learning was used to increase the click-through rate of mobile phone users by 10-20%.

7. Financial Investment Decisions

For example, the company Pit.ai applies reinforcement learning to evaluate trading strategies, which can help users establish trading strategies and help them achieve their investment goals.

8. Medical Industry

Dynamic treatment regimen (DTR) is a topic in medical research, which aims to find effective treatments for patients. For example, for cancer treatment, which requires long-term medication, reinforcement learning algorithms can use various clinical indicators of patients as input to develop treatment strategies.

Study Materials

The above briefly introduces the concept, differences, and main algorithms of reinforcement learning. Here are some learning resources for reference:

Udacity courses: Machine Learning: Reinforcement Learning, Reinforcement Learning;
Classic textbook: Sutton & Barto Textbook: Reinforcement Learning: An Introduction Cited more than 20,000 times
http://t.cn/Raif2sl
A classic introductory course assignment developed by UC Berkeley - Programming to play the "Pac-Man" game: Berkeley Pac-Man Project (CS188 Intro to AI)
Stanford developed an introductory course assignment - Simplified version of self-driving car driving: Car Tracking (CS221 AI: Principles and Techniques)
5. CS 294: Deep Reinforcement Learning, Fall 2015 CS 294 Deep Reinforcement Learning, Fall 2015.
David Silver Reinforcement Learning:
http://t.cn/Rw0rwtU

References

http://www.jianshu.com/p/14625de78455

http://www.jianshu.com/p/2100cc577a46

https://www.marutitech.com/businesses-reinforcement-learning/

https://www.analyticsvidhya.com/blog/2017/01/introduction-to-reinforcement-learning-implementation/

https://morvanzhou.github.io/tutorials/machine-learning/ML-intro/4-02-RL-methods/

https://www.zhihu.com/question/41775291

http://www.algorithmdog.com/reinforcement-learning-model-free-learning

This article is reproduced from Leifeng.com. If you need to reprint, please go to Leifeng.com official website to apply for authorization. This article is written by Yang Xi, and the original text comes from his personal blog.

<<: Development and deployment under microservice architecture

>>: Understand the meaning of matrix rank and determinant in one article