Dear friends, after artificial intelligence (AI) conquered chess, Go, and Dota, the skill of pen spinning has also been learned by AI robots. The robot above, which can spin a pen very smoothly, is thanks to an intelligent agent called Eureka, which is a research project from NVIDIA, the University of Pennsylvania, the California Institute of Technology and the University of Texas at Austin. With Eureka's "guidance", robots can also open drawers and cabinets, throw and catch balls, or use scissors. According to Nvidia, Eureka has 10 different types and can perform 29 different tasks. You should know that before, the pen spinning function alone could not be achieved so smoothly by manual programming by human experts alone. Robot plate walnuts Eureka can independently write reward algorithms to train robots, and its coding ability is strong: its self-written reward program surpasses human experts in 83% of tasks and can improve the performance of robots by an average of 52%. Eureka has pioneered a new way of gradient-free learning from human feedback . It can easily absorb rewards and textual feedback provided by humans to further improve its own reward generation mechanism. Specifically, Eureka leverages OpenAI’s GPT-4 to program reward for the robot’s trial-and-error learning, which means the system doesn’t rely on human-specific task prompts or preset reward patterns. Eureka uses GPU-accelerated simulation in Isaac Gym to quickly evaluate the pros and cons of a large number of candidate rewards, thereby achieving more efficient training. Eureka then generates a summary of key statistical information of the training results and guides the LLM (Language Model) to improve the generation of the reward function. In this way, the AI agent can independently improve the instructions to the robot. Eureka Framework The researchers also found that the more complex the task, the more GPT-4's instructions outperformed human instructions from so-called "reward engineers." The researchers involved in the study even called Eureka a "superhuman reward engineer." Ureka successfully bridges the gap between high-level reasoning (encoding) and low-level motion control. It uses a so-called "hybrid gradient architecture": a pure reasoning black box LLM (Language Model) guides a learnable neural network. In this architecture, the outer loop runs GPT-4 to optimize the reward function (no gradient), while the inner loop runs reinforcement learning to train the robot's controller (based on gradient). - Linxi "Jim" Fan, senior research scientist at NVIDIA Eureka can incorporate human feedback to better adjust rewards to better match developer expectations. Nvidia calls this process "in-context RLHF" (contextual learning from human feedback). It is worth noting that Nvidia's research team has open-sourced Eureka's AI algorithm library. This will enable individuals and institutions to explore and experiment with these algorithms through Nvidia Isaac Gym. Isaac Gym is built on the Nvidia Omniverse platform, a development framework based on the Open USD framework for creating 3D tools and applications. Paper link: https://arxiv.org/pdf/2310.12931.pdf Project link: https://eureka-research.github.io/ Code link: https://github.com/eureka-research/Eureka How do you evaluate? Reinforcement learning has achieved great success in the past decade, but we must acknowledge that there are still persistent challenges. Although there have been attempts to introduce similar techniques before, Eureka stands out compared to L2R (Learning to Reward) which uses a language model (LLM) to assist in reward design because it eliminates the need for specific task prompts. Eureka is better than L2R because it can create a freely expressive reward algorithm and use the environment source code as background information. The NVIDIA research team conducted an investigation to explore whether starting with a human reward function could provide some advantages. The goal of the experiment was to see if you could successfully replace the original human reward function with the output of an initial Eureka iteration. In the test, the NVIDIA research team optimized all final reward functions in the context of each task using the same reinforcement learning algorithm and the same hyperparameters. To test whether these task-specific hyperparameters were well-tuned to ensure the effectiveness of the manually designed rewards, they used a fully tuned proximal policy optimization (PPO) implementation based on previous work without any modifications. For each reward, the researchers conducted five independent PPO training runs and reported the average of the maximum task metric value achieved by the policy checkpoint as a measure of reward performance. The results show that human designers generally have a good understanding of the relevant state variables, but may lack certain proficiency in designing effective rewards. This groundbreaking research by Nvidia has opened up new frontiers in the field of reinforcement learning and reward design. Their general reward design algorithm Eureka leverages the power of large language models and contextual evolutionary search to generate human-level rewards across a wide range of robotics tasks without the need for task-specific prompts or human intervention, which has largely changed our understanding of AI and machine learning. |
>>: International Freshwater Dolphin Day | Let’s celebrate the holiday for these river elves!
Apple may be developing an alternative to Google ...
One of the topics that elders talk about when the...
[[396720]] Since Ant Group's IPO was suspende...
Science Times reporter epic Did you know that our...
As the country vigorously supports rural economic ...
Under the influence of the Internet, brand market...
Ocean blue holes are rare natural geographical ph...
"People who sweat when they move have weak b...
2019 is coming to an end. Looking back on social ...
When writing a product introduction, user cases a...
In daily life, many of our actions or habits can ...
Ancient Alliance Douyin account raising skills ed...
Audit expert: Li Weiyang Well-known science write...
After Double Eleven, Baidu also started the “buy,...