The most powerful Wenshengtu model! How amazing is the visual beauty of Stable Diffusion 3?

The most powerful Wenshengtu model! How amazing is the visual beauty of Stable Diffusion 3?

Last month, Stability AI released its third-generation text-to-image model, Stable Diffusion 3. The model has demonstrated powerful performance that surpasses existing text-to-image generation systems, bringing a major breakthrough in text-to-image generation technology.

Recently, Stability AI finally released the Stable Diffusion 3 technical report, which helps us take a peek at the technical details behind Stable Diffusion 3. The key points of the report are as follows:

As we all know, Stable Diffusion 3 performs well in aspects such as typography and cue following, surpassing state-of-the-art text-to-image generation systems such as DALL·E 3, Midjourney v6, and Ideogram v1 . Among them:

Compared to other open models and closed source systems, Stable Diffusion 3 excels in areas such as visual aesthetics, cue following, and typography.

Stable Diffusion 3 uses a reweighted rectangular flow form to improve model performance. It is more stable than other rectangular flow forms.

The new Multimodal Diffusion Transformer (MMDiT) architecture uses independent sets of weights to process image and language representations, improving text understanding and spelling capabilities compared to previous versions.

The MMDiT architecture combines DiT and Rectangular Flow (RF) formalisms. It uses two separate transformers to process text and image embeddings and combines the sequences of the two modalities in an attention operation.

The MMDiT architecture is not only suitable for text-to-image generation, but can also be extended to multimodal data such as videos.

Removing the memory-intensive T5 text encoder significantly reduces SD3's memory requirements with only a small performance loss.

Figure | High-resolution samples from the 8B rectified model, demonstrating its capabilities in typography, precise cue following and spatial reasoning, attention to detail, and high image quality in a variety of styles.

Full technical report link:

https://stabilityai-public-packages.s3.us-west-2.amazonaws.com/Stable+Diffusion+3+Paper.pdf

Next, let’s take a look at the technical details behind Stable Diffusion 3 based on the report.

MMDiT architecture: the key technology behind Stable Diffusion 3

The MMDiT architecture is one of the key technologies behind Stable Diffusion 3. Compared with traditional single-modality processing methods, the MMDiT architecture can better handle the relationship between text and images, thereby achieving more accurate and higher-quality image generation.

Figure|Model architecture.

This architecture uses independent sets of weights to process image and language representations, which means that for two different input modalities, text and images, MMDiT uses different weight parameters for encoding and processing, so as to better capture the characteristics and information of each modality.

In the MMDiT architecture, the representations of text and images are encoded separately through pre-trained models. Specifically, MMDiT uses three different text embedders (two CLIP models and a T5 model), as well as an improved auto-encoding model to encode image tokens. These encoders are able to convert text and image inputs into a format that the model can understand and process, providing a basis for the subsequent image generation process.

Figure | T5 is very important for complex prompts, e.g., those involving high levels of detail or long spellings (rows 2 and 3). However, for most prompts, removing T5 at inference time still achieves competitive performance.

In terms of model structure, the MMDiT architecture is built on the basis of Diffusion Transformer (DiT). Since the representation of text and image is conceptually different, MMDiT uses two independent sets of weight parameters to handle these two modalities. In this way, the model can operate separately in the representation space of text and image, while taking into account the correlation between them, so as to achieve better information transfer and integration.

Performance crushes other Wensheng graph models

By comparing the performance with other text-to-image generation models, Stable Diffusion 3 shows a clear advantage. In terms of visual beauty, text compliance, and typography, Stable Diffusion 3 is able to surpass the most advanced systems including DALL·E 3, Midjourney v6, and Ideogram v1.

This advantage is mainly due to the MMDiT architecture's independent processing of image and text representations, which enables the model to better understand and express text prompts and generate high-quality images to match them. Stable Diffusion 3 performs well compared to other models in terms of visual beauty when compared with example outputs provided by human evaluators . Evaluators were asked to select the best result based on the beauty of the image. The results show that Stable Diffusion 3 outperforms other models in terms of the beauty of the generated images.

This is a whimsical and creative image that depicts a creature that is a mix of a waffle and a hippopotamus. This imaginative creature has the unique, bulky body of a hippopotamus, but its appearance is like a golden brown crispy waffle. The creature has waffles on its skin and a syrupy sheen. This is set in a surreal environment that interestingly combines the natural water habitat of the hippopotamus and a breakfast table, including oversized cutlery or plates as the background. The image evokes a sense of playful absurdity and culinary fantasy.

Evaluators evaluate the model's text-following ability based on the consistency of the model output with the given prompt. From the test results, Stable Diffusion 3 performs well in text-following and can more accurately generate corresponding image content based on the prompt.

Typography refers to the layout, formatting, and appearance of text in images generated by the model. According to the evaluators’ choices, Stable Diffusion 3 also performed well in typography, better presenting the text information in a given prompt and making the generated images more readable and attractive.

In addition, Stable Diffusion 3 also demonstrates excellent flexibility in terms of performance on different hardware devices .

For example, on devices such as RTX 4090, the largest model (8B parameters) can generate a 1024x1024 resolution image within 34 seconds during image generation, and can also provide a variety of parameter model options in the initial preview stage, ranging from 800m to 8B parameter model scales to further eliminate hardware limitations.

On consumer-level hardware, Stable Diffusion 3 still has a fast inference speed and high resource utilization.

In addition, the technology provides a variety of model scale options to meet the needs of different users and application scenarios, enhancing its scalability and applicability .

Stable Diffusion 3 not only focuses on the quality of image generation, but also focuses on alignment and consistency with text. Its improved Prompt Following function enables the model to better understand the input text and create images based on it, rather than simply generating images. This flexibility enables Stable Diffusion 3 to generate diverse images based on different input texts to meet different themes and needs.

Stable Diffusion 3 uses an improved Rectified Flow (RF) method to connect data and noise through linear trajectories, making the inference path straighter and thus sampling in a small number of steps. At the same time, Stable Diffusion 3 also introduces a new trajectory sampling schedule that assigns more weight to the middle part of the trajectory, thereby improving the difficulty of the prediction task. This innovative approach improves the performance of the model and achieves better results in text-to-image generation tasks.

In the field of text-to-image generation, the release of Stable Diffusion 3 marks a significant advancement in technology. Through the innovation of the MMDiT architecture, the optimization of Rectified Flow, and the flexible adjustment of hardware devices and model scale, Stable Diffusion 3 has outstanding performance in visual beauty, text compliance, and typography, surpassing the current text-to-image generation system.

The birth of Stable Diffusion 3 not only improves the quality and accuracy of generated images, but also brings new possibilities for future creative industries, personalized content generation, auxiliary creation tools, and augmented reality and virtual reality applications.

In the future, as this technology further develops and becomes more popular, we can expect to see more innovative application scenarios and solutions.

Reference Links:

https://stability.ai/news/stable-diffusion-3-research-paper

<<:  If I keep exercising, will I still have thick, black hair when I am 70?

>>:  Why do some people get carsick easily when going on an outing?

Recommend

How to increase website traffic? 100 Ways to Increase Website Traffic

1. Add a blog to your website. If your website is...

Zheng He's "favorite"! A ship that brings you good fortune

Step into the modern architecture of the China Ma...

Introduction to Humanistic Psychotherapy

Introduction to Humanistic Psychotherapy A humani...

It clearly looks like a big "rat", so why is Capybara so popular?

Hidden in the treasure house of biodiversity on E...

iOS 8 criticized by FBI

FBI Director James Comey said this week that he w...

Case Review | QQ Browser News New User Retention Growth Methodology

By reviewing a growth case I did last year - the ...