iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article

Easy Generative AI on Google Colaboratory: Part 8 - Stable Diffusion 3 Medium for Image Generation

に公開

Introduction

In this article, we will create an environment that makes it easy for beginners to use generative AI on Google Colaboratory.
(This serves strongly as a personal memo for creating a voice dialogue system.)

In Part 8, we will enable the use of image generation AI. The AI model we will use this time is Stable Diffusion 3.
As of July 2024, it is the latest image generation AI model.

Although this is Part 8, the content is written so that it can be understood without having read Part 1 to Part 7.

Since it is image generation AI, it may seem like a technology unrelated to voice dialogue systems, but I am studying it thinking it could be used for things like displaying appropriate images to match the AI's speech.
(I have introduced a voice dialogue system I implemented in the past below. I would be happy if you could take a look if you are interested.)
https://zenn.dev/asap/articles/5b1b7553fcaa76

I am conducting this within the scope of what is possible on the free version of Google Colaboratory for beginners, so please feel free to try it out.

In this article, I also conduct experiments such as reconstructing images from latent representations in the middle of the diffusion process, so I would be grateful if those interested could read until the end!

This time, I am writing with reference to the following article.
https://huggingface.co/blog/sd3

What is Stable Diffusion 3?

Stable Diffusion 3 (SD3) is one of the image generation AIs developed by Stability AI.
As of July 2024, it is the latest model.

There are several models in Stable Diffusion 3 as follows:

  • Stable Diffusion 3 Medium
  • Stable Diffusion 3 Large
  • Stable Diffusion 3 Large Turbo
  • Stable Image Ultra

Stable Diffusion 3 Medium has 2B parameters, and Large is an 8B model.
Large Turbo is, according to the official statement, an 8B parameter model with reduced inference time. I suspect it might be a distilled model where the model parameters are the same but the number of inference steps is reduced. I don't know the details.

And Stable Image Ultra is technically a service name rather than a model name, but it is said to be a model that performs image generation using the highest performance model available at the moment. The details are not well known.

Among the above, the one that can generate the most beautiful images is Stable Image Ultra, but this service is only provided via API, and the same applies to Stable Diffusion 3 Large and Stable Diffusion 3 Large Turbo.
Unfortunately, they cannot be run locally, so this time I would like to try Stable Diffusion 3 Medium, which can be run locally (the model weights are public).

Another point is that when using Stable Diffusion, the mainstream way is to use a WebUI. However, since the use of WebUI is prohibited on Google Colab (due to terms of service), we will generate images within a Python script as before.
(We will use a library called Diffusers.)

In this article, Stable Diffusion 3 Medium will be referred to as SD3 from here on.

Deliverables

Please see the following repository.
https://github.com/personabb/colab_AI_sample/tree/main/colab_SD3_sample

Experiments this Time

Since this is my first time working with SD3, I am conducting various experiments following my curiosity.
(The technologies I introduced before were ones I had used quite a bit, so I only gave simple introductions—my apologies.)

The details of the experiments conducted are listed below. The experimental results are introduced at the end.

  • Experiment 1
    • Try the implementation exactly as in the article.
      • Dropping the T5 encoder
      • guided_scale at 7.0
      • shift at 3.0, etc.
  • Experiment 2
    • Try changing guided_scale to 4.0
  • Experiment 3-3.5
    • Try changing shift to 1.0 or 6.0
  • Experiment 4
    • Try model compile
  • Experiment 5
    • Try introducing T5 with CPU offloading
  • Experiment 6
    • Try introducing T5 with quantization
  • Experiment 7
    • Try compile with T5 introduced
  • Experiment 8
    • Check latent representations at each process
  • Experiment 9-12
    • Decrease the number of generation steps and check the generated images
  • Experiment 13-15
    • Increase the number of generation steps and check the generated images

Preparation

Obtaining and Registering Hugging Face Login Token

To make the SD3 model available locally, you need to obtain a login token from Hugging Face.
Please refer to the following article for how to obtain a login token.
https://zenn.dev/protoout/articles/73-hugging-face-setup

Also, you need to register the obtained login token in Google Colab.
Please refer to the following article for registration.
https://note.com/npaka/n/n79bb63e17685

Please register it with the name HF_LOGIN.

Preparing Access to SD3 Weights

Next, enable access to Stable Diffusion weights on Hugging Face.

https://huggingface.co/stabilityai/stable-diffusion-3-medium

When you access it for the first time, a screen like the one above should appear, so please log in with the account you created.

An input form should then be displayed.
By filling out and submitting that form, you will be able to access the model weights.

Explanation

I will provide an explanation as follows.
First, please clone the repository above.

./
git clone https://github.com/personabb/colab_AI_sample.git

After that, please place the cloned folder "colab_AI_sample" in an appropriate location on your My Drive.

Directory Structure

The directory structure on Google Drive is assumed to be as follows:

MyDrive/
    └ colab_AI_sample/
          └ colab_SD3_sample/
                  ├ configs/
                  |    └ config.ini
                  ├ outputs/
                  ├ module/
                  |    └ module_sd3.py
                  └ StableDiffusion3_sample.ipynb

  • The colab_AI_sample folder name is arbitrary; anything is fine. It does not need to be a single-level directory and can be multiple levels deep as shown below:
    • MyDrive/hogehoge/spamspam/hogespam/colab_AI_sample
  • The outputs folder is where generated images are stored. It is initially empty.
    • If you perform generations consecutively, it is recommended to download the images or change the names beforehand because previous content will be overwritten.

Usage Instructions

Open StableDiffusion3_sample.ipynb with the Google Colaboratory app. If you right-click the file, an "Open with" option will appear; select Google Colaboratory from there.

If it's not available, go to the app store from "Add apps," search for "Google Colaboratory," and install it.

Once opened in the Google Colaboratory app, refer to the notes in StableDiffusion3_sample.ipynb and execute the cells in order from the top. You should be able to run it until the end and generate images without issues.

Also, after running to the end, if you wish to change the parameters and run it again, click "Runtime" -> "Restart session and run all."

Code Explanation

I will mainly explain the important files StableDiffusion3_sample.ipynb and module/module_sd3.py.

StableDiffusion3_sample.ipynb

The corresponding code is below.
https://github.com/personabb/colab_AI_sample/blob/main/colab_SD3_sample/StableDiffusion3_sample.ipynb

I will explain each cell below.

Cell 1

./colab_AI_sample/colab_SD3_sample/StableDiffusion3_sample.ipynb
# Installation of necessary modules for SD3
!pip install -Uqq diffusers transformers ftfy accelerate bitsandbytes

Here, the necessary modules are installed.
Since basic deep learning packages like PyTorch are already installed in Google Colab, you only need to install the ones listed above.

Cell 2

./colab_AI_sample/colab_SD3_sample/StableDiffusion3_sample.ipynb
# Mount Google Drive folder (authentication required)
from google.colab import drive
drive.mount('/content/drive')

from huggingface_hub import login
from google.colab import userdata
HF_LOGIN = userdata.get('HF_LOGIN')
login(HF_LOGIN)

# Change current directory to the directory where this file exists.
import glob
import os
pwd = os.path.dirname(glob.glob('/content/drive/MyDrive/**/colab_SD3_sample/StableDiffusion3_sample.ipynb', recursive=True)[0])
print(pwd)

%cd $pwd
!pwd

Here, the contents of Google Drive are mounted.
By mounting, you can read from and write to files stored in Google Drive.

When mounting, you need to grant permission from Colab.
A popup will appear, so follow the instructions to grant access.

Also, it reads the HF_LOGIN token registered in Google Colab and logs into Hugging Face.

Additionally, the current directory is changed from / to /content/drive/MyDrive/**/colab_SD3_sample.
(** is a wildcard that matches any (multiple) directories).
While changing the current directory isn't strictly necessary, it makes specifying folder paths easier in the subsequent steps.

Cell 3

./colab_AI_sample/colab_SD3_sample/StableDiffusion3_sample.ipynb
# Import modules
from module.module_sd3 import SD3
import time

Import the SD3 class from module/module_sd3.py as a module.
The details of this content will be explained in a later chapter.

The time module is also loaded to measure the execution time.

Cell 4

./colab_AI_sample/colab_SD3_sample/StableDiffusion3_sample.ipynb
# Configure the model.

config_text = """
[SD3]
device = auto
n_steps=28
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 7.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

"""

with open("configs/config.ini", "w", encoding="utf-8") as f:
  f.write(config_text)

In this cell, the content of the configuration file configs/config.ini is overwritten with the content of config_text.
SD3 operates according to these settings.
The various experiments described in later chapters are conducted by slightly modifying this configuration file.

Here is a brief explanation of each setting:

  • device = auto
    • Setting for whether to use GPU or CPU.
    • If set to auto, it will always use the GPU if available, so there is basically no need to change this.
  • n_steps=28
    • Setting for the number of steps in the diffusion process. Default is 28.
  • seed=42
    • The seed value for random numbers. Fixing this value allows the same image to be generated.
  • shift = 3.0
    • One of the values input into FlowMatchEulerDiscreteScheduler.
      • The recommended value suggested is 3.0.
    • A Scheduler is a Solver that solves ordinary differential equations (ODEs).
      • Since the diffusion process itself is consistent with solving ODEs, an ODE Solver can be used to generate images.
    • By the way, I only know the above Solver that can be used with SD3 (Diffusers), so if there are other compatible Solvers, please let me know.
      • I tried DPMSolverMultistepScheduler, but it didn't result in an image.
  • model_path = stabilityai/stable-diffusion-3-medium-diffusers
    • Specifies the model used for SD3.
    • If there is a new model available for SD3 in Diffusers, you can use it by specifying it here.
      • Please let me know if you find a good model.
  • guided_scale = 7.0
    • This is the CFG scale. It represents the prompt adherence in Classifier Free Guidance.
    • While the article uses 7.0, other experts' articles testing SD3 suggest 3.5-4.5 is recommended (to be tested in the experiments described later).
  • width = 1024 and height = 1024
    • Indicates the image resolution. These are the recommended values. It is possible to change them if you want to create portrait images, etc.
  • use_cpu_offload = False
    • Whether to use CPU offloading. Set to True to use it.
      • When using a large Text Encoder like T5, the entire model may not fit in VRAM, so this is used. In that case, a large amount of system RAM is required instead of VRAM.
  • use_text_encoder_3 = False
    • Whether to use the large Text Encoder called T5. Set to True to use it.
      • T5 is a high-performance encoder capable of reading very long prompts.
    • If not used, the remaining two Text Encoders, CLIP (which were used in conventional models), will be used.
      • In that case, the prompt is limited to 77 tokens.
  • use_t5_quantization = False
    • Whether to use the large Text Encoder T5 quantized to 8-bit. Set to True to quantize it.
    • Quantization may allow it to be used even with a small amount of VRAM.
    • If quantized, it is automatically CPU-offloaded, so use_cpu_offload = False will not function.
  • use_model_compile = False
    • Whether to compile the model. Set to True to compile.
    • Compiling optimizes calculations, which can reduce execution time.
    • If CPU offloading is functioning (e.g., when use_cpu_offload = True or use_t5_quantization = True), it (basically) cannot be compiled.
      • There are ways to compile it, which are briefly mentioned in the article.
  • save_latent = False
    • Whether to reconstruct images from latent representations during the diffusion process. Set to True to reconstruct and save.
    • By setting this to True, you can create a GIF like the one at the top of the article.
      • In practice, a number of images equal to n_steps will be saved (from pure noise to the final generated image).

Actually, the setting to set the scheduler's shift to 3.0, which is often recommended, is reportedly already set in StableDiffusion3Pipeline, so there's no need to specify and change it separately. It seems to automatically become 3.0 if you use the default.
(I realized this while conducting the experiments.)

(Looking at the documentation, the default value for shift in FlowMatchEulerDiscreteScheduler is 1.0, which is confusing, but you can confirm that shift is 3.0 by checking as follows.)

(The code below will be introduced later, but I will show it here as a preview.)

./colab_AI_sample/colab_SD3_sample/module/module_sd3.py
class SD3:
    ...
    def preprepare_model(self):
        ...
        if self.use_cpu_offload:
            pipe.enable_model_cpu_offload()
        else:
            pipe = pipe.to("cuda")
            
        print(pipe.scheduler.config)

Cell 5

./colab_AI_sample/colab_SD3_sample/StableDiffusion3_sample.ipynb
# Configure the model.
# Set the prompts to be read.

main_prompt = """
Anime style, An anime digital illustration of young woman with striking red eyes, standing outside surrounded by falling snow and cherry blossoms. Her long, flowing silver hair is intricately braided, adorned an accessory made of lace, white tripetal flowers, and small, white pearls, cascading around her determined face. She wears a high-collared, form-fitting white dress with intricate lace details and cut-out sections on the sleeves, adding to the air of elegance and strength. The background features out-of-focus branches dotted with vibrant pink blossoms, set against a clear blue sky speckled with snow and bokeh, creating a dreamlike, serene atmosphere. The lighting casts a soft, almost divine glow on her, emphasizing her resolute expression. The artwork captures her from the waist up. Image is an anime style digital artwork. trending on artstation, pixiv, anime-style, anime screenshot
"""

negative_prompt=""

In this cell, we specify the prompts for generation.

I am using the prompt shared in the post below. Thank you very much.
https://x.com/Lykon4072/status/1801036331973275843

With SD3, there is no need to specify a negative prompt, so it is set to an empty string.

Also, a very long prompt is specified for main_prompt.
As a result, when you execute the image generation part of the model using this prompt, you will see a warning like the following:

Token indices sequence length is longer than the specified maximum sequence length for this model (188 > 77). Running this sequence through the model will result in indexing errors
The following part of your input was truncated because CLIP can only handle sequences up to 77 tokens: ['with intricate lace details and cut - out sections on the sleeves, adding to the air of elegance and strength. the background features out - of - focus branches dotted with vibrant pink blossoms, set against a clear blue sky speckled with snow and bokeh, creating a dreamlike, serene atmosphere. the lighting casts a soft, almost divine glow on her, emphasizing her resolute expression. the artwork captures her from the waist up. image is an anime style digital artwork. trending on artstation, pixiv, anime - style, anime screenshot']

This is a warning that the CLIP model, used in the first two of the three Text Encoders, can only process 77 tokens and will truncate the latter half of the prompt.
On the other hand, the third Text Encoder, T5, can process very long prompts, so when using T5, the entire prompt can be processed without issues.

I plan to test how the output changes between using and not using T5 in the experiments.

Cell 6

./colab_AI_sample/colab_SD3_sample/StableDiffusion3_sample.ipynb
sd = SD3()

Here, an instance of the SD3 class is created.

Cell 7

./colab_AI_sample/colab_SD3_sample/StableDiffusion3_sample.ipynb
for i in range(3):
      start = time.time()
      image = sd.generate_image(main_prompt, neg_prompt = negative_prompt)
      print("generate image time: ", time.time()-start)
      image.save("./outputs/SD3_result_{}.png".format(i))

Here, images are generated using SD3.
By running the for loop three times, three images are generated and saved in the outputs folder.
Additionally, the time taken to generate each image is measured and displayed.

The execution results (generated images), etc., will be presented in the later section on experimental results.

The images are saved in the outputs folder, but please note that they will be saved with the same names each time you run it, so previous images will be overwritten.
(Unless you change the seed value, which is the random number table value, the exact same image will be generated every time. The seed can be changed in the configuration file.)

module/module_sd3.py

Next, I will explain the contents of the module loaded from StableDiffusion3_sample.ipynb.

The full code is shown below.

Full Code
./colab_AI_sample/colab_SD3_sample/module/module_sd3.py
import torch
from diffusers import StableDiffusion3Pipeline, AutoencoderTiny , FlowMatchEulerDiscreteScheduler, DPMSolverMultistepScheduler
from diffusers.pipelines.stable_diffusion_3.pipeline_output import StableDiffusion3PipelineOutput
from transformers import T5EncoderModel, BitsAndBytesConfig
from PIL import Image

import os
import configparser
# Module for checking file existence
import errno
import time

class SD3config:
    def __init__(self, config_ini_path = './configs/config.ini'):
        # Load the ini file
        self.config_ini = configparser.ConfigParser()
        
        # Raise an error if the specified ini file does not exist
        if not os.path.exists(config_ini_path):
            raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), config_ini_path)
        
        self.config_ini.read(config_ini_path, encoding='utf-8')
        SD3_items = self.config_ini.items('SD3')
        self.SD3_config_dict = dict(SD3_items)

class SD3:
    def __init__(self,device = None, config_ini_path = './configs/config.ini'):
        
        SD3_config = SD3config(config_ini_path = config_ini_path)
        config_dict = SD3_config.SD3_config_dict


        if device is not None:
            self.device = device
        else:
            device = config_dict["device"]

            self.device = "cuda" if torch.cuda.is_available() else "cpu"
            if device != "auto":
                self.device = device
                
        self.n_steps = int(config_dict["n_steps"])
        self.seed = int(config_dict["seed"])
        self.generator = torch.Generator(device=self.device).manual_seed(self.seed)
        self.width = int(config_dict["width"])
        self.height = int(config_dict["height"])
        self.guided_scale = float(config_dict["guided_scale"])
        self.shift = float(config_dict["shift"])
        
        self.use_cpu_offload = config_dict["use_cpu_offload"]
        if self.use_cpu_offload == "True":
            self.use_cpu_offload = True
        else:
            self.use_cpu_offload = False
            
        self.use_text_encoder_3 = config_dict["use_text_encoder_3"]
        if self.use_text_encoder_3 == "True":
            self.use_text_encoder_3 = True
        else:
            self.use_text_encoder_3 = False
            
        self.use_T5_quantization = config_dict["use_t5_quantization"]
        if self.use_T5_quantization == "True":
            self.use_T5_quantization = True
        else:
            self.use_T5_quantization = False
        
        self.use_model_compile = config_dict["use_model_compile"]
        if self.use_model_compile == "True":
            self.use_model_compile = True
        else:
            self.use_model_compile = False
        
        self.save_latent = config_dict["save_latent"]
        if self.save_latent == "True":
            self.save_latent = True
        else:
            self.save_latent = False
            
        self.model_path = config_dict["model_path"]
        
    
        if self.use_model_compile:
            self.pipe  = self.preprepare_compile_model()
        else:
            self.pipe  = self.preprepare_model()
        

    def preprepare_model(self):
        
        pipe = None
        
        sampler = FlowMatchEulerDiscreteScheduler(
                    shift = self.shift
                    )

        
        if self.use_text_encoder_3:
            if self.use_T5_quantization:
                quantization_config = BitsAndBytesConfig(load_in_8bit=True)

                text_encoder = T5EncoderModel.from_pretrained(
                    self.model_path,
                    subfolder="text_encoder_3",
                    quantization_config=quantization_config,
                )
                pipe = StableDiffusion3Pipeline.from_pretrained(
                    self.model_path,
                    scheduler = sampler,
                    text_encoder_3=text_encoder,
                    device_map="balanced",
                    torch_dtype=torch.float16
                )
                
            else:
                pipe = StableDiffusion3Pipeline.from_pretrained(self.model_path, scheduler = sampler, torch_dtype=torch.float16)
        else:
            pipe = StableDiffusion3Pipeline.from_pretrained(
                        self.model_path,
                        scheduler = sampler,
                        text_encoder_3=None,
                        tokenizer_3=None,
                        torch_dtype=torch.float16)
        
        
        if self.use_T5_quantization:
            pass
        elif self.use_cpu_offload:
            pipe.enable_model_cpu_offload()
        else:
            pipe = pipe.to("cuda")
            
        print(pipe.scheduler.config)

        return pipe
    
    
    def preprepare_compile_model(self):

        torch.set_float32_matmul_precision("high")

        torch._inductor.config.conv_1x1_as_mm = True
        torch._inductor.config.coordinate_descent_tuning = True
        torch._inductor.config.epilogue_fusion = False
        torch._inductor.config.coordinate_descent_check_all_directions = True
        
        pipe = None
        sampler = FlowMatchEulerDiscreteScheduler(
                    shift = self.shift
                    )
        
        if self.use_text_encoder_3:
            if self.use_T5_quantization:
                quantization_config = BitsAndBytesConfig(load_in_8bit=True)

                text_encoder = T5EncoderModel.from_pretrained(
                    self.model_path,
                    subfolder="text_encoder_3",
                    quantization_config=quantization_config,
                )
                pipe = StableDiffusion3Pipeline.from_pretrained(
                    self.model_path,
                    scheduler = sampler,
                    text_encoder_3=text_encoder,
                    device_map="balanced",
                    torch_dtype=torch.float16
                )
                
            else:
                pipe = StableDiffusion3Pipeline.from_pretrained(self.model_path, scheduler = sampler, torch_dtype=torch.float16)
        else:
            pipe = StableDiffusion3Pipeline.from_pretrained(
                        self.model_path,
                        scheduler = sampler,
                        text_encoder_3=None,
                        tokenizer_3=None,
                        torch_dtype=torch.float16)
        
        
        if self.use_T5_quantization:
            pass
        elif self.use_cpu_offload:
            pipe.enable_model_cpu_offload()
        else:
            pipe = pipe.to("cuda")
            
        print(pipe.scheduler.config)
            
        pipe.set_progress_bar_config(disable=True)
        pipe.transformer.to(memory_format=torch.channels_last)
        pipe.vae.to(memory_format=torch.channels_last)
        
        pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
        pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

        return pipe
        
            
    
    def generate_image(self, prompt, prompt_2 = None, prompt_3 = None, neg_prompt = "", neg_prompt_2 = None, neg_prompt_3 = None,seed = None):
        
        def decode_tensors(pipe, step, timestep, callback_kwargs):
            latents = callback_kwargs["latents"]
        
            image = latents_to_rgb(latents,pipe)
            gettime = time.time()
            formatted_time_human_readable = time.strftime("%Y%m%d_%H%M%S", time.localtime(gettime))
            image.save(f"./outputs/latent_{formatted_time_human_readable}_{step}_{timestep}.png")
        
        
            return callback_kwargs
            
        def latents_to_rgb(latents,pipe):

            latents = (latents / pipe.vae.config.scaling_factor) + pipe.vae.config.shift_factor

            img = pipe.vae.decode(latents, return_dict=False)[0]
            img = pipe.image_processor.postprocess(img, output_type="pil")
        
            return StableDiffusion3PipelineOutput(images=img).images[0]
        
        if seed is not None:
            self.generator = torch.Generator(device=self.device).manual_seed(seed)
            
        if prompt_2 is None:
            prompt_2 = prompt
        if prompt_3 is None:
            prompt_3 = prompt
        if neg_prompt_2 is None:
            neg_prompt_2 = neg_prompt
        if neg_prompt_3 is None:
            neg_prompt_3 = neg_prompt
            
        image = None
        if self.save_latent:
            image = self.pipe(
                prompt=prompt, 
                prompt_2=prompt_2, 
                prompt_3=prompt_3, 
                negative_prompt=neg_prompt,
                negative_prompt_2 = neg_prompt_2,
                negative_prompt_3 = neg_prompt_3,
                height = self.height,
                width = self.width,
                num_inference_steps=self.n_steps,
                guidance_scale=self.guided_scale,
                generator=self.generator,
                callback_on_step_end=decode_tensors,
                callback_on_step_end_tensor_inputs=["latents"],
                ).images[0]
        else:
            image = self.pipe(
                prompt=prompt, 
                prompt_2=prompt_2, 
                prompt_3=prompt_3, 
                negative_prompt=neg_prompt,
                negative_prompt_2 = neg_prompt_2,
                negative_prompt_3 = neg_prompt_3,
                height = self.height,
                width = self.width,
                num_inference_steps=self.n_steps,
                guidance_scale=self.guided_scale,
                generator=self.generator
                ).images[0]
        
        
        return image

Now, let's explain each part one by one.

SD3config Class

./colab_AI_sample/colab_SD3_sample/module/module_sd3.py

class SD3config:
    def __init__(self, config_ini_path = './configs/config.ini'):
        # Load the ini file
        self.config_ini = configparser.ConfigParser()
        
        # Raise an error if the specified ini file does not exist
        if not os.path.exists(config_ini_path):
            raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), config_ini_path)
        
        self.config_ini.read(config_ini_path, encoding='utf-8')
        SD3_items = self.config_ini.items('SD3')
        self.SD3_config_dict = dict(SD3_items)

Here, the configuration file specified by config_ini_path = './configs/config.ini' is loaded as SD3_config_dict. Since it is loaded as a dictionary type, it becomes possible to read the contents of the configuration file as a Python dictionary.

SD3 Class init Method

./colab_AI_sample/colab_SD3_sample/module/module_sd3.py

class SD3:
    def __init__(self,device = None, config_ini_path = './configs/config.ini'):
        
        SD3_config = SD3config(config_ini_path = config_ini_path)
        config_dict = SD3_config.SD3_config_dict


        if device is not None:
            self.device = device
        else:
            device = config_dict["device"]

            self.device = "cuda" if torch.cuda.is_available() else "cpu"
            if device != "auto":
                self.device = device
                
        self.n_steps = int(config_dict["n_steps"])
        self.seed = int(config_dict["seed"])
        self.generator = torch.Generator(device=self.device).manual_seed(self.seed)
        self.width = int(config_dict["width"])
        self.height = int(config_dict["height"])
        self.guided_scale = float(config_dict["guided_scale"])
        self.shift = float(config_dict["shift"])
        
        self.use_cpu_offload = config_dict["use_cpu_offload"]
        if self.use_cpu_offload == "True":
            self.use_cpu_offload = True
        else:
            self.use_cpu_offload = False
            
        self.use_text_encoder_3 = config_dict["use_text_encoder_3"]
        if self.use_text_encoder_3 == "True":
            self.use_text_encoder_3 = True
        else:
            self.use_text_encoder_3 = False
            
        self.use_T5_quantization = config_dict["use_t5_quantization"]
        if self.use_T5_quantization == "True":
            self.use_T5_quantization = True
        else:
            self.use_T5_quantization = False
        
        self.use_model_compile = config_dict["use_model_compile"]
        if self.use_model_compile == "True":
            self.use_model_compile = True
        else:
            self.use_model_compile = False
        
        self.save_latent = config_dict["save_latent"]
        if self.save_latent == "True":
            self.save_latent = True
        else:
            self.save_latent = False
            
        self.model_path = config_dict["model_path"]
        
    
        if self.use_model_compile:
            self.pipe  = self.preprepare_compile_model()
        else:
            self.pipe  = self.preprepare_model()

First, the contents of the configuration file are stored in config_dict. Since this is a dictionary type, it is possible to retrieve the contents of the configuration file as strings in the form of config_dict["device"]. Note that since everything is retrieved as a string, you need to change the type accordingly if you want it to be an int or bool type.

Next, the following steps are performed:

  • Specify the device to run the model.
  • Retrieve various settings from the configuration file.
  • Define the model.
    • Define the appropriate model according to the configuration file.
    • Define it using either the self.preprepare_model() method or the self.preprepare_compile_model() method based on the use_model_compile setting.

SD3 Class preprepare_model Method

./colab_AI_sample/colab_SD3_sample/module/module_sd3.py
class SD3:
    ...
    def preprepare_model(self):
        
        pipe = None
        
        sampler = FlowMatchEulerDiscreteScheduler(
                    shift = self.shift
                    )

        
        if self.use_text_encoder_3:
            if self.use_T5_quantization:
                quantization_config = BitsAndBytesConfig(load_in_8bit=True)

                text_encoder = T5EncoderModel.from_pretrained(
                    self.model_path,
                    subfolder="text_encoder_3",
                    quantization_config=quantization_config,
                )
                pipe = StableDiffusion3Pipeline.from_pretrained(
                    self.model_path,
                    scheduler = sampler,
                    text_encoder_3=text_encoder,
                    device_map="balanced",
                    torch_dtype=torch.float16
                )
                
            else:
                pipe = StableDiffusion3Pipeline.from_pretrained(self.model_path, scheduler = sampler, torch_dtype=torch.float16)
        else:
            pipe = StableDiffusion3Pipeline.from_pretrained(
                        self.model_path,
                        scheduler = sampler,
                        text_encoder_3=None,
                        tokenizer_3=None,
                        torch_dtype=torch.float16)
        
        
        if self.use_T5_quantization:
            pass
        elif self.use_cpu_offload:
            pipe.enable_model_cpu_offload()
        else:
            pipe = pipe.to("cuda")
            
        print(pipe.scheduler.config)

        return pipe

This method is called by the init method to define and load the model. It is used when use_model_compile = False, in other words, when model compilation is not performed.

It configures the Scheduler, Text Encoder, and StableDiffusion3Pipeline according to the configuration file and loads the model onto the necessary device.

As mentioned earlier in the article (explanation for Cell 4 of the ipynb),

print(pipe.scheduler.config)

allows you to check the Scheduler settings.

The implementation is basically based on the following article:
https://huggingface.co/blog/sd3

SD3 Class preprepare_compile_model Method

./colab_AI_sample/colab_SD3_sample/module/module_sd3.py
class SD3:
    ...
    def preprepare_compile_model(self):

        torch.set_float32_matmul_precision("high")

        torch._inductor.config.conv_1x1_as_mm = True
        torch._inductor.config.coordinate_descent_tuning = True
        torch._inductor.config.epilogue_fusion = False
        torch._inductor.config.coordinate_descent_check_all_directions = True
        
        pipe = None
        sampler = FlowMatchEulerDiscreteScheduler(
                    shift = self.shift
                    )
        
        if self.use_text_encoder_3:
            if self.use_T5_quantization:
                quantization_config = BitsAndBytesConfig(load_in_8bit=True)

                text_encoder = T5EncoderModel.from_pretrained(
                    self.model_path,
                    subfolder="text_encoder_3",
                    quantization_config=quantization_config,
                )
                pipe = StableDiffusion3Pipeline.from_pretrained(
                    self.model_path,
                    scheduler = sampler,
                    text_encoder_3=text_encoder,
                    device_map="balanced",
                    torch_dtype=torch.float16
                )
                
            else:
                pipe = StableDiffusion3Pipeline.from_pretrained(self.model_path, scheduler = sampler, torch_dtype=torch.float16)
        else:
            pipe = StableDiffusion3Pipeline.from_pretrained(
                        self.model_path,
                        scheduler = sampler,
                        text_encoder_3=None,
                        tokenizer_3=None,
                        torch_dtype=torch.float16)
        
        
        if self.use_T5_quantization:
            pass
        elif self.use_cpu_offload:
            pipe.enable_model_cpu_offload()
        else:
            pipe = pipe.to("cuda")
            
        print(pipe.scheduler.config)
            
        pipe.set_progress_bar_config(disable=True)
        pipe.transformer.to(memory_format=torch.channels_last)
        pipe.vae.to(memory_format=torch.channels_last)
        
        pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=True)
        pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=True)

        return pipe

This method is called from the init method when performing model compilation.
Since you cannot compile when using quantization or offloading, it ideally shouldn't be implemented this way, but I found through the experiments mentioned later that compilation isn't possible in those cases, so at this point, it is implemented identically to the non-compiled method.
(When implementing this yourself, please ensure that compilation is not attempted during offloading or quantization, as it will result in an error.)

As usual, I've implemented this with reference to the following article:
https://huggingface.co/blog/sd3

SD3 Class generate_image Method

./colab_AI_sample/colab_SD3_sample/module/module_sd3.py
class SD3:
    ...
    def generate_image(self, prompt, prompt_2 = None, prompt_3 = None, neg_prompt = "", neg_prompt_2 = None, neg_prompt_3 = None,seed = None):
        
        if seed is not None:
            self.generator = torch.Generator(device=self.device).manual_seed(seed)
            
        if prompt_2 is None:
            prompt_2 = prompt
        if prompt_3 is None:
            prompt_3 = prompt
        if neg_prompt_2 is None:
            neg_prompt_2 = neg_prompt
        if neg_prompt_3 is None:
            neg_prompt_3 = neg_prompt
            
        image = None

        image = self.pipe(
                prompt=prompt, 
                prompt_2=prompt_2, 
                prompt_3=prompt_3, 
                negative_prompt=neg_prompt,
                negative_prompt_2 = neg_prompt_2,
                negative_prompt_3 = neg_prompt_3,
                height = self.height,
                width = self.width,
                num_inference_steps=self.n_steps,
                guidance_scale=self.guided_scale,
                generator=self.generator
                ).images[0]
        
        
        return image

This method actually generates an image using the model and settings loaded so far. This model has three Text_Encoders. If only one prompt is specified, the same prompt is input into all encoders, but it is also possible to set different prompts for each encoder individually.

By specifying a seed in the arguments of this method, you can overwrite the configuration file and specify a seed. Therefore, by specifying a random number here, you can generate random images.

Additionally, if a seed is specified in the argument:

self.generator = torch.Generator(device=self.device).manual_seed(seed)

is called. On the other hand, if it is not specified in the argument, the seed from the configuration file is used, and a similar generator is created in the init method. Consequently, the seed is incremented and applied for each generation (e.g., 42 → 43 → 44), allowing the image to change with each generation.

As always, I've implemented this referring to the following article:
https://huggingface.co/blog/sd3

Experimental Results

From here on, I will describe the details of various experiments conducted on Google Colab by changing parameters using the code above.

Experiment 1

First, I conducted an experiment using the default settings in the Dropping the T5 Text Encoder during Inference section mentioned in this article.
(I believe this is the lightest setting that uses the least amount of VRAM.)

Settings

[SD3]
device = auto
n_steps=28
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 7.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time

generate image time:  35.947511196136475
generate image time:  34.840919733047485
generate image time:  33.75922894477844

Since it was run on Google Colaboratory, the execution time fluctuates with each run, so please take this as a reference, but it felt like it could generate images in about 30-40 seconds.

The generated images are shown below. Honestly, I thought the quality of images generated by SDXL was higher.



However, for some reason, the images seem to reflect the latter part of the prompt that should have been dropped by the CLIP encoder.

For example, the following part of the prompt:

The background features out-of-focus branches dotted with vibrant pink blossoms, set against a clear blue sky speckled with snow and bokeh, creating a dreamlike, serene atmosphere. 
(Japanese translation)
Background features out-of-focus branches dotted with vibrant pink blossoms, set against a clear blue sky speckled with snow and bokeh, creating a dreamlike, serene atmosphere.

Information from this part also seems to be reflected in the generated images. I wonder why...
If anyone knows, I would be happy if you could tell me.

Experiment 2

Next, I would like to change the guided_scale from 7, used in the article, to the range of 3.5-4.5 recommended by others.

I conducted the experiment with:

guided_scale = 4.0
shift = 3.0

Settings

[SD3]
device = auto
n_steps=28
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time

generate image time:  36.00275254249573
generate image time:  34.94155263900757
generate image time:  33.83055567741394

The generated images are shown below.



Since this seems to have better image quality, I will conduct subsequent experiments with guided_scale = 4.0.

Experiment 3

Next, I would like to check how the images change with variations in the scheduler's shift value.
While using a value around 3.0 is recommended by many, I would like to see what happens when using 1.0, which is the default specified in the original FlowMatchEulerDiscreteScheduler.

I conducted the experiment with:

guided_scale = 4.0
shift = 1.0

Settings

[SD3]
device = auto
n_steps=28
seed=42
shift = 1.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time

generate image time:  35.75459671020508
generate image time:  34.87065076828003
generate image time:  33.78454518318176

The generated images are shown below.



Surprisingly, there were times when shift=1.0 resulted in better image quality.
It might be best to change this through trial and error according to the image being generated.

Experiment 3.5

Furthermore, I would like to continue checking the image changes based on variations in the scheduler's shift value.
I set it to 6.0, which is said to receive high marks in human evaluations.

I conducted the experiment with:

guided_scale = 4.0
shift = 6.0

Settings

[SD3]
device = auto
n_steps=28
seed=42
shift = 6.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time

generate image time:  33.36610460281372
generate image time:  32.84304141998291
generate image time:  32.25549674034119

The generated images are shown below.



It might seem like the image quality became slightly sharper, but I didn't feel there was a very significant change.
That said, since the generated images do change slightly, I'd like to continue experimenting with these settings through trial and error.

After trying various values, I didn't see a huge difference in image quality, so I will use the default 3.0 for subsequent experiments.

Experiment 4

Next, I would like to try model compile.
The article mentioned above states that it can improve inference latency, so we can expect a reduction in execution time.

I conducted the experiment with:

guided_scale = 4.0
shift = 3.0
use_model_compile = True

Settings

[SD3]
device = auto
n_steps=28
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = True
save_latent = False

Results

Execution time

generate image time:  654.5856740474701
generate image time:  31.83198046684265
generate image time:  29.641842126846313

It seems like compilation is happening during the first generation because the first run takes a very long time, but I found that the third execution was the fastest among all runs so far.

The generated images are shown below.


Checking the images, we can see that the generated images do not change just because they were compiled.
(There might be differences at the pixel level, but I think they look the same.)

I also tested how much the execution time changes when repeating the process many more times.
I ran it 20 times.

Execution time

generate image time:  126.06288695335388
generate image time:  31.938198804855347
generate image time:  30.529218196868896
generate image time:  29.582070112228394
generate image time:  30.36764097213745
generate image time:  30.32581877708435
generate image time:  30.02663564682007
generate image time:  30.31864595413208
generate image time:  30.280086517333984
generate image time:  30.2621488571167
generate image time:  30.286171436309814
generate image time:  30.32650923728943
generate image time:  30.289822578430176
generate image time:  30.310792207717896
generate image time:  30.356089115142822
generate image time:  30.34636688232422
generate image time:  30.17577314376831
generate image time:  30.156757831573486
generate image time:  30.116050958633423
generate image time:  30.14829444885254

As you can see, the images are generated in approximately 30 seconds.
It seems to generate images about 5 seconds faster compared to the non-compiled pattern.

Experiment 5

Next, I will try introducing T5, a large-scale Text Encoder.

I conducted the experiment with:

guided_scale = 4.0
shift = 3.0
use_cpu_offload = True
use_text_encoder_3 = True

Using only use_text_encoder_3 = True results in an Out of Memory error on the free version of Google Colaboratory's GPU, so I am also using CPU offloading.

CPU offloading is a technique where instead of keeping all model data on the GPU, the data is typically stored in system RAM, and only the data required for computation is moved to the GPU's VRAM. Because data is transferred to or swapped within the VRAM for each calculation, the processing time increases.

Settings

[SD3]
device = auto
n_steps=28
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = True
use_text_encoder_3 = True
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

The T5 model was so large that even with CPU offloading, it exhausted the 12.7GB of system RAM on Google Colaboratory and crashed during the execution of the SD3 class's generate_image method, making it impossible to run.

Experiment 6

Since CPU offloading alone was insufficient due to a lack of computational resources, I will also try T5 quantization.

I conducted the experiment with:

guided_scale = 4.0
shift = 3.0
use_text_encoder_3 = True
use_t5_quantization = True

Using only use_text_encoder_3 = True results in an Out of Memory error on the free version of Google Colaboratory's GPU, so I am also using quantization.
Also, since this quantization uses the bitsandbytes library, the model is already stored on the appropriate device (meaning CPU offloading is already happening).
Therefore, I implemented it so that changing the use_cpu_offload setting has no effect in this case.

Settings

[SD3]
device = auto
n_steps=28
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = True
use_t5_quantization = True
use_model_compile = False
save_latent = False

Results

Execution time

generate image time:  74.37092518806458
generate image time:  60.2897834777832
generate image time:  60.057732582092285

Since I am using a large-scale Text Encoder, T5 (quantized version), the execution time is longer compared to the previous experiments where T5 was dropped.

The generated images are shown below.


It seems that having T5 might have improved the adherence to the prompt.
Somehow, the image quality feels higher as well...
I wonder if the Text Encoder was trained with weights that also affect image quality.
If anyone is knowledgeable about this, please let me know.

Experiment 7

Since introducing T5 increases execution time, I will try compile as well, hoping for a reduction in execution time.

I conducted the experiment with:

guided_scale = 4.0
shift = 3.0
use_text_encoder_3 = True
use_t5_quantization = True
use_model_compile = True

In addition to T5 quantization, I will run compilation. I'm eager to see how much the processing time will be shortened.

Settings

[SD3]
device = auto
n_steps=28
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = True
use_t5_quantization = True
use_model_compile = True
save_latent = False

Results

Unfortunately, the following error occurred:

--> 197         pipe.transformer.to(memory_format=torch.channels_last)
    198         pipe.vae.to(memory_format=torch.channels_last)
    199 

/usr/local/lib/python3.10/dist-packages/accelerate/big_modeling.py in wrapper(*args, **kwargs)
    453                 for param in model.parameters():
    454                     if param.device == torch.device("meta"):
--> 455                         raise RuntimeError("You can't move a model that has some modules offloaded to cpu or disk.")
    456                 return fn(*args, **kwargs)
    457 

RuntimeError: You can't move a model that has some modules offloaded to cpu or disk.

As you can see, it seems that if the model is quantized, the data is also offloaded, so compilation cannot be performed.
Therefore, I found that T5 quantization and compilation cannot coexist.

Of course, if you use T5 without quantization, compilation is also possible, but please understand that I cannot experiment with that as I do not have a GPU capable of running it.

Experiment 8

Next, let's change our approach and look at the images in the middle of generation at each step.

However, as SD3 does not apply the diffusion process to the image itself, but rather to the latent representation of the VAE, only the latent representation during generation is obtained at each step.

Since the SD3 latent representation has a shape of (1,16,128,128), it cannot be displayed as an image as is.

Therefore, we will try to display what is restored by inputting the obtained latent representations (including intermediate ones with noise) into the SD3 VAE Decoder.

To do this, we modify the generate_image method of the SD3 class in module/module_sd3.py as follows.
(Please read the explanation only if you are interested.)

Explanation of the changes

The changes are as follows.
(The code stored on GitHub is already modified, so no action is required.)

./colab_AI_sample/colab_SD3_sample/module/module_sd3.py
class SD3:
    ...
    def generate_image(self, prompt, prompt_2 = None, prompt_3 = None, neg_prompt = "", neg_prompt_2 = None, neg_prompt_3 = None,seed = None):
        
        def decode_tensors(pipe, step, timestep, callback_kwargs):
            latents = callback_kwargs["latents"]        
            image = latents_to_rgb(latents,pipe)
            gettime = time.time()
            formatted_time_human_readable = time.strftime("%Y%m%d_%H%M%S", time.localtime(gettime))
            image.save(f"./outputs/latent_{formatted_time_human_readable}_{step}_{timestep}.png")
            return callback_kwargs
            
        def latents_to_rgb(latents,pipe):
            latents = (latents / pipe.vae.config.scaling_factor) + pipe.vae.config.shift_factor
            img = pipe.vae.decode(latents, return_dict=False)[0]
            img = pipe.image_processor.postprocess(img, output_type="pil")
            return StableDiffusion3PipelineOutput(images=img).images[0]
        
        if seed is not None:
            self.generator = torch.Generator(device=self.device).manual_seed(seed)
            
        if prompt_2 is None:
            prompt_2 = prompt
        if prompt_3 is None:
            prompt_3 = prompt
        if neg_prompt_2 is None:
            neg_prompt_2 = neg_prompt
        if neg_prompt_3 is None:
            neg_prompt_3 = neg_prompt
            
        image = None
        if self.save_latent:
            image = self.pipe(
                prompt=prompt, 
                prompt_2=prompt_2, 
                prompt_3=prompt_3, 
                negative_prompt=neg_prompt,
                negative_prompt_2 = neg_prompt_2,
                negative_prompt_3 = neg_prompt_3,
                height = self.height,
                width = self.width,
                num_inference_steps=self.n_steps,
                guidance_scale=self.guided_scale,
                generator=self.generator,
                callback_on_step_end=decode_tensors,
                callback_on_step_end_tensor_inputs=["latents"],
                ).images[0]
        else:
            image = self.pipe(
                prompt=prompt, 
                prompt_2=prompt_2, 
                prompt_3=prompt_3, 
                negative_prompt=neg_prompt,
                negative_prompt_2 = neg_prompt_2,
                negative_prompt_3 = neg_prompt_3,
                height = self.height,
                width = self.width,
                num_inference_steps=self.n_steps,
                guidance_scale=self.guided_scale,
                generator=self.generator
                ).images[0]
        
        
        return image
        

When calling pipe, we add callback_on_step_end and callback_on_step_end_tensor_inputs as new arguments.
callback_on_step_end is an argument that allows you to specify a callback to be called at the end of each step. The callback function specified by this argument is called with the information specified by callback_on_step_end_tensor_inputs as arguments.

For information that can be retrieved from the SD3 model for the callback function, please see the code below.
https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L30
I believe it's possible to obtain latents, prompt_embeds, negative_prompt_embeds, and negative_pooled_prompt_embeds stored in callback_outputs described in the __call__ method of the above code.

The callback function specified here is decode_tensors. Within it, the acquired latents (i.e., the latent representations during the diffusion process) are converted to images and saved.
(The save destination is hardcoded.)

We define latents_to_rgb as a function to convert latent representations into images.
Here, we retrieve the decode method of the SD3 VAE and convert it to an image.
(The code is written with reference to pipeline_stable_diffusion_3.py mentioned above.)

This process is called as a callback function and executed at the end of each diffusion process.

I conducted the experiment with:

save_latent = True

(By using this setting, you can reconstruct and save images from the latent representations during the generation process.)

Settings

[SD3]
device = auto
n_steps=28
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = True

Results

Reconstructed images from the latent representations during generation using VAE are shown below.
Since there are many images, I am displaying a GIF of them arranged in the order of generation.

Experiment 9

Next, I'll look at the changes in images based on the number of steps.
Since SD3 solves linear differential equations, I thought it might be able to generate images without issues even if the step count is reduced. (I'm not entirely sure about this, so please correct me if my understanding is wrong.)

I conducted the experiment with:

n_steps=14

This is half of the recommended step count.

Settings

[SD3]
device = auto
n_steps=14
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time (14 steps)

generate image time:  17.692556381225586
generate image time:  17.215643167495728
generate image time:  16.85261869430542

Since the number of steps was halved, the generation time has also been reduced by exactly about half.

The generated images are shown below. I think the images are quite beautiful even with 14 steps.



I especially like the second one. I felt that the representation of the hands was slightly less precise compared to the 28-step version.

Experiment 10

Next, I'll try 7 steps.
I conducted the experiment with:

n_steps=7

Settings

[SD3]
device = auto
n_steps=7
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time (7 steps)

generate image time:  11.650048971176147
generate image time:  8.618378162384033
generate image time:  8.722979307174683

The generated images are shown below.



They started to feel a bit blurry, but I think they are still beautiful images. I'm a bit surprised that 7 steps can result in such good quality.

Experiment 11

Next is 4 steps. This is the number of steps used in SDXL Turbo.

I conducted the experiment with:

n_steps=4

Settings

[SD3]
device = auto
n_steps=4
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time (4 steps)

generate image time:  7.127336740493774
generate image time:  5.429738998413086
generate image time:  5.420844078063965

The processing time has become considerably shorter. The generated images are shown below.



At 4 steps, the quality is falling apart quite a bit. The second one barely maintains the face, but the background is completely blurred, and the other images are blurry overall.

Experiment 12

I know it won't work well, but since it's the last one...
I conducted the experiment with:

n_steps=1

Settings

[SD3]
device = auto
n_steps=1
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time (1 step)

generate image time:  3.160846710205078
generate image time:  2.0624654293060303
generate image time:  2.028672933578491

The generated images are shown below.



Unfortunately, with 1 step, the results are mostly noise-like images. It can't be helped...

Experiment 13

Next, I'll try increasing the number of steps from the standard 28. First, I'll increase it just a little.

I conducted the experiment with:

n_steps=35

Settings

[SD3]
device = auto
n_steps=35
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time (35 steps)

generate image time: 41.11394023895264
generate image time: 42.15063142776489
generate image time: 43.37374567985535

The generated images are shown below.



Although subtle, the resolution feels slightly higher compared to Experiment 2. It's a matter of whether you accept the approximately 10-second increase in processing time for this improvement in resolution.

Experiment 14

Next, I conducted the experiment with:

n_steps=50

This is the standard step count for SDXL.

Settings

[SD3]
device = auto
n_steps=50
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time (50 steps)

generate image time: 61.70908522605896
generate image time: 60.577409982681274
generate image time: 60.86574172973633

The generated images are shown below.



Hmm? It feels like the quality of the generated images might have actually decreased at 50 steps... or maybe not? It's a bit hard to judge.

Experiment 15

Finally, I conducted the experiment with:

n_steps=80

I just picked a random number here.

Settings

[SD3]
device = auto
n_steps=80
seed=42
shift = 3.0

model_path = stabilityai/stable-diffusion-3-medium-diffusers

guided_scale = 4.0
width = 1024
height = 1024

use_cpu_offload = False
use_text_encoder_3 = False
use_t5_quantization = False
use_model_compile = False
save_latent = False

Results

Execution time (80 steps)

generate image time: 97.74183487892151
generate image time: 96.70283508300781
generate image time: 96.58707213401794

The execution time increased significantly.

The generated images are shown below.



As a result, it turned out to be almost the same as the 50-step version. At least, I cannot tell the difference with my eyes.

Summary

This time, we created an environment to easily use generative AI on Google Colaboratory for beginners.

In Part 8, we enabled the use of Stable Diffusion 3 Medium, one of the generative AI models for image generation.

Since it was my first time using this model, I conducted various experiments.

What I learned through these experiments is:

  • Use a Shift of 3.0 and a guided_scale of 4.0.
  • It is better to use T5 whenever possible, as it results in higher image quality.
  • If you want to prioritize processing time, you can probably reduce the step count to around 7-14 while still maintaining decent image quality.

Please let me know if there are any errors in the descriptions. Thank you!

In the next part, I'd like to try out ControlNet with SD3!

Discussion