Although it is the weekend, I still keep recharging my batteries. Today I will take a look at reinforcement learning. However, I am not going to use it to play games. Instead, I think it has great applications in manufacturing, inventory, e-commerce, advertising, recommendations, finance, medical care and other fields that are closely related to our lives. Of course, I have to learn about it. This article is organized as follows:
1. DefinitionReinforcement learning is an important branch of machine learning and a product of the intersection of multiple disciplines and fields. Its essence is to solve the decision-making problem, that is, to make decisions automatically and continuously. It mainly consists of four elements: agent, environment state, action, and reward. The goal of reinforcement learning is to obtain the most cumulative rewards. Let's use a child learning to walk as an example: The child wants to walk, but before that, he needs to stand up first, and then keep his balance. Then he needs to take one leg out, either the left leg or the right leg, and after taking one step, he needs to take the next step. The child is the agent, he tries to manipulate the environment (the walking surface) by taking actions (i.e. walking) and transitioning from one state to another (i.e. every step he takes), when he completes a subtask of the task (i.e. walking a few steps), the child is rewarded (gives chocolate to eat), and when he cannot walk, he is not given chocolate. 2. Differences between supervised learning and unsupervised learningIn machine learning, we are more familiar with supervised learning and unsupervised learning. In addition, there is another major category, which is reinforcement learning: The difference between reinforcement learning and supervised learning: Supervised learning is like having a tutor guiding you while you study. The tutor knows what is right and what is wrong. However, in many practical problems, such as chess and Go, there are thousands of combinations, and it is impossible for a tutor to know all possible results. At this time, reinforcement learning will try to perform some actions without any labels to get a result, and adjust the previous behavior through feedback on whether the result is right or wrong. By constantly adjusting in this way, the algorithm can learn what kind of behavior to choose under what circumstances to get the best result. It's like you have an untrained puppy. Every time it messes up the house, you reduce the amount of delicious food (punishment), and every time it performs well, you double the amount of delicious food (reward). The puppy will eventually learn that messing up the living room is a bad behavior. Both learning methods will learn a mapping from input to output. Supervised learning learns the relationship between inputs and outputs, which can tell the algorithm what kind of input corresponds to what kind of output. Reinforcement learning learns the reward function given to the machine, which is used to judge whether the behavior is good or bad. In addition, there is a delay in the feedback of reinforcement learning results. Sometimes it may take many steps to know whether a previous choice was good or bad. However, if a supervised learning choice is made, the algorithm will be fed back immediately. Moreover, the inputs faced by reinforcement learning are always changing. Every time the algorithm takes an action, it affects the input of the next decision, while the inputs of supervised learning are independent and identically distributed. Through reinforcement learning, an agent can make a trade-off between exploration and exploitation and choose the one that maximizes the reward. General supervised learning algorithms do not consider this balance and are just exploitative. The difference between reinforcement learning and unsupervised learning: Unsupervised learning does not learn the mapping from input to output, but the pattern. For example, in the task of recommending news articles to users, unsupervised learning will find similar articles that users have read before and recommend one to them, while reinforcement learning will recommend a small amount of news to users first, and continuously obtain feedback from users, and finally build a "knowledge graph" of articles that users may like. 3. Main algorithms and classificationsFrom the perspective of several elements of reinforcement learning, the methods can be divided into the following categories:
We can use the most well-known example of traveling salesman. We want to go from A to F. Each point between two points represents the cost of this road. We want to choose a path with the lowest cost possible: The major elements are:
There is a way to go like this: when you are at A, you can choose (B, C, D, E). If you find that D is the best, you go to D. At this time, you can choose (B, C, F). If you find that F is the best, you go to F, and the task is completed. In addition, the classification can be made more detailed from different angles:The four classification methods shown in the figure below correspond to the corresponding main algorithms:
The main algorithms are as follows, which are briefly described today:1. Sarsa Q is the action-utility function, which is used to evaluate the pros and cons of taking an action under a specific state. It can be understood as the brain of the agent. SARSA uses the Markov property and only uses the next step information, allowing the system to explore according to the strategy guidance and update the state value at each step of the exploration. The update formula is as follows: s is the current state, a is the current action, s' is the next state, a' is the action taken in the next state, r is the reward obtained by the system, α is the learning rate, and γ is the decay factor. 2. Q learning The algorithm framework of Q Learning is similar to that of SARSA, which also allows the system to explore according to the strategy guidance and update the state value at each step of the exploration. The key is that the update formula of Q Learning and SARSA is different. The update formula of Q Learning is as follows: 3. Policy Gradients The system starts from a fixed or random starting state. The policy gradient allows the system to explore the environment and generate a state-action-reward sequence from the starting state to the terminal state, s1, a1, r1,....., sT, aT, rT. At the tth moment, we let gt=rt+γrt+1+... equal to q(st,a), thereby solving the policy gradient optimization problem. 4. Actor-Critic The algorithm is divided into two parts: Actor and Critic. Actor updates the strategy, and Critic updates the value. Critic can use the SARSA or Q Learning algorithm introduced earlier. 5. Monte-Carlo learning Use the current policy to explore and generate a complete state-action-reward sequence: When a state s is encountered for the first time or each time in the sequence, its decay reward is calculated: Last updated status value: 6. Deep-Q-Network The main approach of the DQN algorithm is Experience Replay, which stores the data obtained by the system's exploration of the environment, and then randomly samples samples to update the parameters of the deep neural network. It also achieves the maximum reward under each action and environment state, but the difference is that some improvements have been made, including experience replay and duel network architecture. 4. Application ExamplesReinforcement learning has many applications. In addition to driverless cars, AlphaGo, and playing games, there are also practical examples in the following projects: 1. Manufacturing For example, a Japanese company, Fanuc, has a factory robot that captures a video of the process when it picks up an object, remembering each of its actions, whether the operation was successful or failed, accumulating experience so that it can act faster and more accurately next time. 2. Inventory Management In inventory management, management is a difficult problem due to obstacles such as large inventory volume, large fluctuations in inventory demand, and slow inventory replenishment. We can reduce inventory turnover time and improve space utilization by establishing a reinforcement learning algorithm. 3. Dynamic pricing Q-learning in reinforcement learning can be used to deal with dynamic pricing problems. 4. Customer Delivery Manufacturers want to reduce the total cost of their fleet while meeting all customer needs when transporting goods to each customer. Through multi-agents system and Q-learning, the time and number of vehicles can be reduced. 5. ECommerce Personalization In e-commerce, reinforcement learning algorithms can also be used to learn and analyze customer behavior and customize products and services to meet the personalized needs of customers. 6. Ad Serving For example, the algorithm LinUCB (a type of reinforcement learning algorithm bandit) will try to serve a wider range of ads, even though they have not been viewed much in the past, and can better estimate the actual click-through rate. For example, in the Double 11 recommendation scenario, Alibaba used deep reinforcement learning and adaptive online learning. Through continuous machine learning and model optimization, it established a decision engine, analyzed massive user behaviors and tens of billions of product features in real time, helped every user quickly discover products, and improved the efficiency of matching people and products. In addition, reinforcement learning was used to increase the click-through rate of mobile phone users by 10-20%. 7. Financial Investment Decisions For example, the company Pit.ai applies reinforcement learning to evaluate trading strategies, which can help users establish trading strategies and help them achieve their investment goals. 8. Medical Industry Dynamic treatment regimen (DTR) is a topic in medical research, which aims to find effective treatments for patients. For example, for cancer treatment, which requires long-term medication, reinforcement learning algorithms can use various clinical indicators of patients as input to develop treatment strategies. The above briefly introduces the concept, differences, and main algorithms of reinforcement learning. Here are some learning resources for reference:
Reference articles: TensorFlow-11 - Policy Network: Use Tensorflow to create a policy network-based agent to solve the CartPole problem. What is reinforcement learning: a simple illustration of DQN https://www.marutitech.com/businesses-reinforcement-learning/ This article is reproduced from Leiphone.com. If you need to reprint it, please go to Leiphone.com official website to apply for authorization. |
<<: Talk about the routines in data mining competitions and the limitations of deep learning
>>: How to quickly and comprehensively build your own big data knowledge system?
[Scholars from the Saiwai] 20220521 Dragon Contro...
1. Definition of Data Data is actually a bunch of...
There are more and more smart devices, and it is ...
If a personal Douyin account is successfully crea...
Bidding OCPC is based on the long-term accumulati...
Question, March 8, 2023 What day is it? Answer: W...
Can the star system closest to the sun become hum...
After more than a decade of rapid development and...
A few years ago, a neighbor sent me some fruits. ...
Do you know how to do the "Jellyfish Hand Ge...
In February 2019, Alibaba separated the live broa...
100 Golden Problem-solving Models for High School...
Vision is an important way for people to obtain i...
Previously, some parents gave their children sex ...
Should we use the original version of our iOS sys...