Robots can spin pens and roll walnuts! GPT-4 helps robots perform better with more complex tasks

Dear friends, after artificial intelligence (AI) conquered chess, Go, and Dota, the skill of pen spinning has also been learned by AI robots.

The robot above, which can spin a pen very smoothly, is thanks to an intelligent agent called Eureka, which is a research project from NVIDIA, the University of Pennsylvania, the California Institute of Technology and the University of Texas at Austin.

With Eureka's "guidance", robots can also open drawers and cabinets, throw and catch balls, or use scissors. According to Nvidia, Eureka has 10 different types and can perform 29 different tasks.

You should know that before, the pen spinning function alone could not be achieved so smoothly by manual programming by human experts alone.

Robot plate walnuts

Eureka can independently write reward algorithms to train robots, and its coding ability is strong: its self-written reward program surpasses human experts in 83% of tasks and can improve the performance of robots by an average of 52%.

Eureka has pioneered a new way of gradient-free learning from human feedback . It can easily absorb rewards and textual feedback provided by humans to further improve its own reward generation mechanism.

Specifically, Eureka leverages OpenAI’s GPT-4 to program reward for the robot’s trial-and-error learning, which means the system doesn’t rely on human-specific task prompts or preset reward patterns.

Eureka uses GPU-accelerated simulation in Isaac Gym to quickly evaluate the pros and cons of a large number of candidate rewards, thereby achieving more efficient training. Eureka then generates a summary of key statistical information of the training results and guides the LLM (Language Model) to improve the generation of the reward function. In this way, the AI agent can independently improve the instructions to the robot.

Eureka Framework

The researchers also found that the more complex the task, the more GPT-4's instructions outperformed human instructions from so-called "reward engineers." The researchers involved in the study even called Eureka a "superhuman reward engineer."

Ureka successfully bridges the gap between high-level reasoning (encoding) and low-level motion control. It uses a so-called "hybrid gradient architecture": a pure reasoning black box LLM (Language Model) guides a learnable neural network. In this architecture, the outer loop runs GPT-4 to optimize the reward function (no gradient), while the inner loop runs reinforcement learning to train the robot's controller (based on gradient). - Linxi "Jim" Fan, senior research scientist at NVIDIA

Eureka can incorporate human feedback to better adjust rewards to better match developer expectations. Nvidia calls this process "in-context RLHF" (contextual learning from human feedback).

It is worth noting that Nvidia's research team has open-sourced Eureka's AI algorithm library. This will enable individuals and institutions to explore and experiment with these algorithms through Nvidia Isaac Gym. Isaac Gym is built on the Nvidia Omniverse platform, a development framework based on the Open USD framework for creating 3D tools and applications.

Paper link: https://arxiv.org/pdf/2310.12931.pdf

Project link: https://eureka-research.github.io/

Code link: https://github.com/eureka-research/Eureka

How do you evaluate?

Reinforcement learning has achieved great success in the past decade, but we must acknowledge that there are still persistent challenges. Although there have been attempts to introduce similar techniques before, Eureka stands out compared to L2R (Learning to Reward) which uses a language model (LLM) to assist in reward design because it eliminates the need for specific task prompts. Eureka is better than L2R because it can create a freely expressive reward algorithm and use the environment source code as background information.

The NVIDIA research team conducted an investigation to explore whether starting with a human reward function could provide some advantages. The goal of the experiment was to see if you could successfully replace the original human reward function with the output of an initial Eureka iteration.

In the test, the NVIDIA research team optimized all final reward functions in the context of each task using the same reinforcement learning algorithm and the same hyperparameters. To test whether these task-specific hyperparameters were well-tuned to ensure the effectiveness of the manually designed rewards, they used a fully tuned proximal policy optimization (PPO) implementation based on previous work without any modifications. For each reward, the researchers conducted five independent PPO training runs and reported the average of the maximum task metric value achieved by the policy checkpoint as a measure of reward performance.

The results show that human designers generally have a good understanding of the relevant state variables, but may lack certain proficiency in designing effective rewards.

This groundbreaking research by Nvidia has opened up new frontiers in the field of reinforcement learning and reward design. Their general reward design algorithm Eureka leverages the power of large language models and contextual evolutionary search to generate human-level rewards across a wide range of robotics tasks without the need for task-specific prompts or human intervention, which has largely changed our understanding of AI and machine learning.

<<: A "Huashan Sword Contest" in the organic chemistry community has produced fruitful scientific and spiritual fruits

>>: International Freshwater Dolphin Day | Let’s celebrate the holiday for these river elves!

Zhao Yuping's "Understanding Water Margin" analyzes the employment strategy of Water Margin

Recommend

Is it "flying sulfuric acid" that will disfigure your face if it crawls over it? Don't take photos of this bug if you see it!

Author: Han Zaiming, Assistant Researcher, School...

How to lose weight on your legs? Yan Jie's "7-step leg slimming training" teaches you how to slim your legs and get slender, straight and simple legs

Training course video lecture content introductio...

Robots can spin pens and roll walnuts! GPT-4 helps robots perform better with more complex tasks

Zhao Yuping's "Understanding Water Margin" analyzes the employment strategy of Water Margin

Will digital currency replace Alipay and WeChat? Insiders: There is no comparison at all!

How to segment users? 6 models and 5 dimensions

How much does it cost to develop a dance class appointment mini program or a dance check-in mini program?

Qvod has not paid the 260 million yuan fine: the late payment fee is nearly 40 million yuan

To brush or not to brush is not just a matter of "face"

How to effectively promote and attract new users on YouTube?

During product iteration, how to use data to drive user growth?

The evolution of programmer communication in the post-IT era

Wireless charging, say goodbye to the "wired" life!

Recommend

Is it "flying sulfuric acid" that will disfigure your face if it crawls over it? Don't take photos of this bug if you see it!

Can drinking vinegar soften blood vessels? The right way to fight vascular hardening is...

It’s getting cold, mantis, it’s time to take medicine…

What to do if your computer becomes slow after long-term use? Experienced drivers will teach you a few tricks

Where did Gou Jian, who slept on firewood and tasted gall, live? The capital of Yue State has been discovered!

Why do you feel empty after the "more than ten seconds of pleasure" that makes you want more?

Goodbye! Another major function of QQ Mail is announced to be offline soon

Useful Information | What does a high-conversion information flow ad look like?

How to lose weight on your legs? Yan Jie's "7-step leg slimming training" teaches you how to slim your legs and get slender, straight and simple legs

Re-analysis: The tactics behind the screen-sweeping mini-games tested by NetEase and Snowball!

Hot search! Can a small belly protect the uterus and ovaries? Excessive abdominal fat is very harmful!

How to use custom menus in WeChat Enterprise Account Development

Why does lactose intolerance not affect drinking yogurt?

How can a newbie make money with Huajiao Live? How much can a newbie make in a day with Huajiao Live?

Alibaba's global sales champion personally teaches sales secrets with a million-dollar annual salary