[Editor's Note] Human scientists and entrepreneurs have been training artificial intelligence (AI) systems according to the way humans think, hoping that machines can learn like humans and solve real-life problems for humans. However, in the view of Will Whitney, a research scientist at Google DeepMind, treating the model as a person greatly limits the way we think about interacting with it and hinders us from exploring the full potential of large models. In an article titled "Computing inside an AI", Whitney analyzed the shortcomings of "model-as-person" and proposed the concept of "model-as-computer". He compared the differences between the two using the example of building a website: under the metaphor of "model as person", the big model is regarded as a contractor or collaborator, who will constantly "modify" the style, content and functions of the website in a long list of increasingly picky requirements, and the communication cost is high; while the interaction method of "model as computer" is different, the big model will not directly build the website, but provide a "generative user interface" that interacts in real time and is generated on demand, helping you communicate more effectively and have more control over the final product in a shorter time. Some of the views are as follows: Model-as-human creates distance between the user and the model, like a communication gap between two people, which can be narrowed but never completely closed. With the overhead of communication, model-as-human systems are most useful when they can complete a whole piece of work on their own. A good tool tells a human what it can be used to do, and it is faster to manipulate it directly than to write a request in words. Under the metaphor of the model-as-computer, a "computer application" will be a way for the model to reveal itself to us, and you can have more control over the final product in a shorter time. Generative user interfaces have the potential to completely replace operating systems, generating and managing interfaces and windows on the fly as needed. Academic Headlines has edited some of the interview content without changing the main idea of the original article. The content is as follows: Since the launch of ChatGPT, the industry's exploration in two directions in the field of artificial intelligence (AI) has reached a climax. The first direction is technical capability. How big a model can we train? How well can it answer SAT (Scholastic Assessment Test) questions? How efficiently can we serve it? The second direction is interaction design. How do we communicate with the model? How do we use it to do useful work? What metaphors do we use to reason about it? The first direction has received a lot of attention and investment, and for good reason: advances in technological capabilities are fundamental to all possible applications. But the second direction is equally critical to the field, and it holds huge unknowns. We are only a few years into the era of big models. How likely are we to have figured out the best way to use them? I proposed a new interaction model, where the model plays the role of a computer (such as a mobile phone) application: providing a graphical interface, interpreting user input, and updating its state. In this model, AI is no longer an "agent" that uses computers on behalf of humans, but can provide us with a richer and more powerful computing environment. Interactive Metaphor At the heart of interaction are metaphors, which guide users’ expectations of a system. Early computing translated metaphors like “desktop,” “typewriter,” “spreadsheet,” and “letter” into digital equivalents that let users reason about their actions. You could put things on your desk and come back to them; you needed an address to mail a letter. As our cultural knowledge of these devices evolved, the need for these particular metaphors disappeared, and with them the skeuomorphic interface designs that reinforced them. Like a trash can or a pencil, the computer is now a metaphor. Today, the dominant metaphor for large models is "model-as-person." This is an effective metaphor because people have a wide range of abilities, and we have strong intuitions about these abilities. This means that we can talk to the model and ask it questions; the model can work with us to complete a document or a piece of code; we can give it a task and let it complete it on its own. However, thinking of models as people greatly limits the way we think about interacting with them. Human interaction is inherently slow and linear, limited by the bandwidth of speech and the nature of turn-taking. We’ve all experienced conversations where communicating complex ideas is difficult and information is lost. When we strive for precision, we turn to tools, using direct manipulation and high-bandwidth visual interfaces to make diagrams, write code, and design CAD models. Because we conceptualize models as people, we use them through slow conversations, even though they are perfectly capable of accepting fast, direct input and producing visual results. The metaphors we use limit the experiences we build, and “models as people” is preventing us from exploring the full potential of large models. For many use cases, especially production work, I believe in another metaphor: "model-as-computer". Using AI as a computer Under the metaphor of "the model is a computer", we will interact with the big model based on our intuition of a computer application (be it a desktop, tablet, phone...). Note that this does not mean that the model will become a traditional application. A "computer application" will be a way for the model to present itself to us. The model will no longer act like a "person", but like a "computer". And running like a computer means producing a graphical interface. Instead of the charming teletype linear stream of text provided by ChatGPT, the "model-as-computer" system will generate something similar to a modern application interface: buttons, sliders, tabs, images, drawings, and everything else. This solves the main limitations of the "model-as-human" chat interface: Discoverability. A good tool tells humans what it can be used for. When the only interface is an empty text box, it’s the user’s responsibility to figure out what to do and understand the boundaries of the system. The editing sidebar in Lightroom is a great way to learn photo editing because it tells you not only what the program can do with a photo, but also what you might want to do. Similarly, DALL-E’s “model-as-computer” interface can bring new possibilities to your image generation. If you ask for a sketch-style image, it can generate radio buttons for drawing medium (pencil, marker, pastel…), sliders for the level of detail in the sketch, switching between color and black and white, and some illustrated buttons for choosing perspective (2D, isomorphic, two-point perspective…). Efficiency. Direct manipulation is faster than writing a request in words. Continuing with the Lightroom example, it’s unthinkable to tell someone which slider to move and by how much to edit a photo. You’ll spend all day asking for a lower exposure and a higher vividness, just to see what the result is. In the “model-as-computer” metaphor, models can create tools that allow you to express your ideas more effectively, thus completing tasks faster. In the case of DALL-E, by clicking on these options and dragging these sliders, you can explore the space of generated sketches in real time. Unlike traditional applications, this graphical interface is generated on demand by the model. This means that every part of the interface you see is relevant to what you are doing right now, including the specific content of your work (the subject of this painting, the tone of this text). It also means that if you want more or a different interface, you can just ask for it. You can ask DALL-E to make some editable presets for its settings, which are inspired by famous sketch artists. When you click on the Da Vinci preset, it sets the sliders to a highly detailed black ink perspective drawing. If you click on Charles Schulz, you will choose a low-detailed, technological color 2D comic. The Changing Bicycle of Thought The model-as-human has a strange tendency to create distance between the user and the model, like a communication gap between two people that can be narrowed but never completely closed. Because communicating in language is difficult and expensive, people tend to break tasks into large chunks that are as independent as possible. Model-as-human interfaces follow this pattern: it’s not worth telling the model to add a return statement to a function if it’s faster to write it yourself. Given the communication overhead, model-as-human systems are most useful when they can complete a whole chunk of work on their own. They do things for you. This is in stark contrast to how we interact with computers or other tools. Tools produce visual feedback in real time and are controlled through direct manipulation. These tools have little communication overhead, so there is no need to designate a discrete chunk of work. It makes more sense to keep a human in the loop and direct the tool at all times. Like a seven-league boot, tools allow you to go further with each step, but you are still the one doing the work. They allow you to complete tasks faster. Think of the task of building a website using a large mockup. Using today's interfaces, you can think of the mockup as a contractor or collaborator. You write down in words as best you can how you want the website to look, what it says, and how it should function. The mockup generates a first version, you run it, and you get some feedback. You say, "Make the logo bigger," "Put the first hero image in the center," "There should be a login button in the header." You send a long list of increasingly nitpicky requests to get everything exactly how you want it. Interactions with the “model as computer” approach will look different: instead of building the site, the model generates an interface for you to build the site, and every input from the user in that interface activates the big model behind the interface. Maybe when you describe your requirements, it generates an interface with a sidebar and a preview window. At first, the sidebar contains only a few layout sketches that you can choose as a starting point. You can click on each sketch, and the model will write the HTML of the page using that layout and display it in the preview window. Now that you have a page to work with, the sidebar adds other options that affect the entire page, such as font matching and color scheme. The preview is like a WYSIWYG editor, allowing you to grab elements and move them, edit their content, and so on. All of this is powered by the model, which can see these actions of the user and rewrite the page according to the changes made by the user. Because the model can generate an interface to help you communicate more effectively, you can have more control over the final product in a shorter time. "Model as computer" encourages us to think of the model as a real-time interactive tool rather than a collaborator who assigns tasks. Rather than replacing an intern or a tutor, it is a versatile bicycle for thinking that is always tailored to you and the terrain you plan to traverse. A new paradigm in computing? Models that generate interfaces on demand are a whole new frontier in computing. By bypassing existing application models, they could be a new paradigm altogether. Giving end users the ability to create and modify applications on the fly fundamentally changes how we interact with computers. Models will replace monolithic, static applications built by developers, generating applications tailored to the user and their immediate needs. Models will replace business logic implemented in code, interpreting user input and updating the user interface. This generative user interface has the potential to even replace the operating system entirely, generating and managing interfaces and windows on the fly as needed. At first, generative UIs will be a “toy” that’s only really useful for creative exploration and a few other niche applications. After all, no one wants an email app that occasionally sends emails to your ex and lies to you about the status of your inbox. But over time, these models will get better. Even as they push further into the space of entirely new experiences, they’ll gradually become reliable enough to be used for real work. We’re already seeing glimpses of this future. A few years ago, Jonas Degrave showed that ChatGPT could do a decent job emulating a Linux command line. Similarly, websim.ai uses LLMs to generate websites on demand as you browse them. Oasis, GameNGen, and DIAMOND train action-conditioned video models on individual video games, letting you play games like Doom in a larger model. And Genie 2 generates playable video games based on text prompts. Generative UIs may still be a crazy idea, but not that crazy. There are still a lot of unanswered questions about what this will look like. Where will generative UIs come into play first? How will we share the experiences we gain from working with models if they only exist in the context of a larger model? Will we even want to do that? What new experiences will there be? How will this all actually work? Should models generate UIs as code, or directly from raw pixels? I don't know the answers to these yet. We'll have to experiment to find out! Original link: https://willwhitney.com/computing-inside-ai.html Translation: Li Wenjing This article only expresses the author's views and does not represent the position of Academic Headlines. |
>>: Di Renjie, Bao Zheng, and Song Ci, who is better at solving cases?
Produced by: Science Popularization China Author:...
As the traffic dividend has peaked, various indus...
When companies are preparing to build a website, ...
Looking back at the live streaming e-commerce boo...
User loyalty is always a hot topic discussed by o...
Have you tried influencer marketing ? In 2017, in...
Xiao Ming's short video traffic monetization ...
For most people, Apple and its iOS empire give pe...
Recently, some netizens revealed that WeChat is t...
This article does not introduce how to optimize t...
Like any startup, we are experimenting to find th...
Searching for products and brands on social media...
We cannot easily obtain root permissions on every...
IT Home reported on February 7 that recently, Ten...
In many people's impressions, Zhihu's mai...