iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
🚀

Simple Generative AI on Google Colab [Part 10: SDXL + ANIMAGINE + ControlNet + LoRA (All-in-One)]

に公開

Introduction

In this article, we will create an environment to easily use Generative AI on Google Colaboratory for beginners.

In this Part 10, I think it might be the last one regarding image generation AI.
We will try the highest quality set possible using the methods explained so far. (It's the "full set" with everything included.)
(While SD3 would normally be the highest quality, for the anime illustrations I like, specialized fine-tuned models on SDXL are "still" higher quality than the SD3 base model at this point, so we will use this set.)

For an overview of the technologies used, please refer to:
Stable Diffusion XL (SDXL): Part 6 article
ANIMAGINE XL 3.1: Part 6.5 article
ControlNet: Part 9 article
(As for other articles, you don't necessarily need to read them to follow this one. I will write this post so that it can be understood even if you haven't read the articles above.)

Regarding LoRA, I have provided a brief description below.

What is LoRA?

LoRA (Low-Rank Adaptation) is a technique that allows for high-quality customization of models with minimal training resources.

LoRA is a technology for model fine-tuning.
"ANIMAGINE XL 3.1," which is also used in this article, is a model created through fine-tuning. However, that is what we call "Full parameter tuning," where all weights within the model are made trainable, and the model is trained using a massive amount of images (with captions) and large-scale computing resources.

On the other hand, LoRA does not involve training the model weights themselves. Instead, small trainable modules are connected in parallel to the model weights. While the original model weights are fixed, only the weights of the additionally connected modules are made trainable, allowing for learning from a small number of images.

By doing this, you can add features to the original model, such as adding a specific character or converting the art style to a particular manga or anime style.
Furthermore, because it is something added to the base model, it is a highly flexible technology that can be used not only with the default SDXL base model but also with other fine-tuned models like "ANIMAGINE XL 3.1."

Various LoRAs have already been created. In this article, I will try just one LoRA that I liked intuitively, but other LoRAs are also usable.

The model used this time is below:
https://civitai.com/models/255143/dreamyvibes-artsyle-sdxl-lora

Of course, since compatibility with the model matters, there may be LoRA models that do not produce effective results. Please try them through trial and error.

Deliverables

Please see the repository below.
https://github.com/personabb/colab_AI_sample/tree/main/colab_SDXLControlNet_sample

This Experiment

The details of the experiments conducted are described below. The experimental results are introduced at the end.

  • Experiment 1
    • Experiment without LoRA and without ControlNet
  • Experiment 2
    • Without LoRA
    • Experiments with various ControlNet modes
      • OpenPose
      • OpenPose Face
      • OpenPose Face only
      • OpenPose Full
      • Canny
      • Depth
      • Zoe Depth
      • Tile
  • Experiment 3
    • Introduction of LoRA
    • Experiment without ControlNet
  • Experiment 4
    • Introduction of LoRA
    • Experiment with ControlNet Zoe Depth

Preparation

Saving the LoRA Model to Use

In this experiment, we will use the following LoRA model:
https://civitai.com/models/255143/dreamyvibes-artsyle-sdxl-lora

Feel free to search for and choose any model you like.

There are two important points:

  1. Download this model and place it in the "inputs" folder within the directory structure described later.
  2. Add the "Trigger Words" for this LoRA model to the prompt.

Both can be obtained from the page linked above.

Download it via the button shown below on that page:

Copy the "Trigger Words" using the copy button in the section shown below.
While the text appears in uppercase on the page, it will be copied as something like "Dreamyvibes Artstyle", so please use that.

Downloading the Reference Image

Obtain the image you want to use for ControlNet input and place it in the "inputs" folder within the directory structure described later.

If you want to use the same image as in this article, save the image, name it "refer.webp", and place it in the "inputs" folder.

Explanation

The following section provides the explanation.
First, clone the repository mentioned above.

./
git clone https://github.com/personabb/colab_AI_sample.git

After cloning, place the "colab_AI_sample" folder in a suitable location on your My Drive.

Directory Structure

The following directory structure is assumed on Google Drive:

MyDrive/
    └ colab_AI_sample/
          └ colab_SDXLControlNet_sample/
                  ├ configs/
                  |    └ config.ini
                  ├ inputs/
                  |    | refer.webp
                  |    └ DreamyvibesartstyleSDXL.safetensors
                  ├ outputs/
                  ├ module/
                  |    └ module_sd3c.py
                  └ SDXLControlNet_sample.ipynb

  • The colab_AI_sample folder name is arbitrary. It can be anything. It doesn't have to be at the top level and can be multiple levels deep like this:
    • MyDrive/hogehoge/spamspam/hogespam/colab_AI_sample
  • The outputs folder stores the generated images. It is empty initially.
    • Since continuous generation will overwrite previous results, it is recommended to download them or rename the folder/files.
  • The inputs folder contains the reference images used for ControlNet. Details are provided later.
    • Additionally, place the LoRA model you downloaded earlier here.
      • I renamed it because I felt uncomfortable with spaces in the filename.

How to Use

Open SDXLControlNet_sample.ipynb with the Google Colaboratory app.
Right-click the file to find the "Open with" option, and select the Google Colaboratory app from there.

If it is not available, go to the app store via "Connect more apps," search for "Google Colaboratory," and install it.

After opening it in the Google Colaboratory app, refer to the notes in SDXLControlNet_sample.ipynb and run the cells in order from the top. It should work correctly through to the end, allowing you to generate images.

Additionally, if you want to change parameters and run it again after completion, click "Runtime" -> "Restart session and run all."

Code Explanation

I will mainly explain the key files: SDXLControlNet_sample.ipynb and module/module_sdc.py.

SDXLControlNet_sample.ipynb

The corresponding code can be found at the following link:
https://github.com/personabb/colab_AI_sample/blob/main/colab_SDXLControlNet_sample/SDXLControlNet_sample.ipynb

Below is a cell-by-cell explanation.

First Cell

./colab_AI_sample/colab_SDXLControlNet_sample/SDXLControlNet_sample.ipynb

# Installing necessary modules for SDXL
!pip install -U peft tensorflow-metadata diffusers transformers scikit-learn ftfy accelerate invisible_watermark safetensors controlnet-aux mediapipe timm

In this cell, we install the required modules.
Since basic deep learning packages like PyTorch are already pre-installed in Google Colab, you only need to install the ones listed above.
(Note: I am reusing past installation commands, so there is a high chance that some unnecessary modules are included... I included tensorflow-metadata because version errors occur if it is not present.)

Second Cell

./colab_AI_sample/colab_SDXLControlNet_sample/SDXLControlNet_sample.ipynb

# Mount Google Drive folder (requires authentication)
from google.colab import drive
drive.mount('/content/drive')

# Change the current directory to the directory where this file is located.
import glob
import os
pwd = os.path.dirname(glob.glob('/content/drive/MyDrive/colabzenn/**/colab_SDXLControlNet_sample/SDXLControlNet_sample.ipynb', recursive=True)[0])
print(pwd)

%cd $pwd
!pwd

Here, we mount the contents of your Google Drive.
By mounting, you gain the ability to read from and write to files stored within your Google Drive.

When mounting, you will need to authorize access through Colab. A popup will appear; please follow the instructions to grant the necessary permissions.

Subsequently, we change the current directory from / to /content/drive/MyDrive/**/colab_SDXLControlNet_sample.
(The ** is a wildcard representing any directory or multiple directories.)
While changing the current directory is not strictly required, it makes specifying folder paths much easier for the following steps.

Third Cell

./colab_AI_sample/colab_SDXLControlNet_sample/SDXLControlNet_sample.ipynb
# Import the module
from module.module_sdc import SDXLC
import time

Here, we import the SDXLC class from module/module_sdc.py as a module.
Details of its contents will be explained in a later chapter.

Additionally, the time module is imported to measure execution time.

Fourth Cell

./colab_AI_sample/colab_SDXLControlNet_sample/SDXLControlNet_sample.ipynb

# Configure the model settings.

# Since the SDXL model is loaded with variant = "fp16", use a model with "fp16" in its safetensor name.

config_text = """
[SDXLC]
device = auto
n_steps=28
high_noise_frac=None
seed=42

vae_model_path = None
base_model_path = Asahina2K/Animagine-xl-3.1-diffuser-variant-fp16
refiner_model_path = None

;controlnet_path = xinsir/controlnet-openpose-sdxl-1.0
;controlnet_path = diffusers/controlnet-canny-sdxl-1.0
controlnet_path = diffusers/controlnet-depth-sdxl-1.0
;controlnet_path = diffusers/controlnet-zoe-depth-sdxl-1.0
;controlnet_path = xinsir/controlnet-tile-sdxl-1.0

;control_mode = openpose
;control_mode = openpose_face
;control_mode = openpose_faceonly
;control_mode = openpose_full
;control_mode = canny
control_mode = depth
;control_mode = zoe_depth
;control_mode = tile

lora_weight_path = ./inputs/DreamyvibesartstyleSDXL.safetensors
lora_scale = 1.0

use_karras_sigmas = True
scheduler_algorithm_type = dpmsolver++
solver_order = 2

cfg_scale = 7.0
width = 832
height = 1216
output_type = pil
aesthetic_score = 6
negative_aesthetic_score = 2.5

save_latent_simple = False
save_latent_overstep = False
save_latent_approximation = False

"""

with open("configs/config.ini", "w", encoding="utf-8") as f:
  f.write(config_text)

In this cell, we configure the settings. We take advantage of the fact that lines starting with ";" are treated as comments to include frequently used ControlNet models alongside the active settings.
Note that while we are using the "Animagine-xl-3.1" model this time, I have confirmed that other models, such as the base SDXL model, also work without issues.

For individual explanations of each setting, please refer to previous articles.

The parameters I probably haven't explained before are lora_weight_path and lora_scale.

For lora_weight_path, specify the path to the model stored in the inputs folder. You can also download and use other LoRA models.

lora_scale is a parameter that controls how much the LoRA parameters influence the output, though I don't fully understand the detailed mechanism. (If anyone is knowledgeable about this, please let me know!)

It is likely controlled in the following sections.
(Please expand only if you are interested.)

lora_scale

First, lora_scale is passed to cross_attention_kwargs in StableDiffusionXLPipeline, which is then used in self.unet.

diffusers/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py

class StableDiffusionXLPipeline(
...
    def __call__(
      ...
        noise_pred = self.unet(
                    latent_model_input,
                    t,
                    encoder_hidden_states=prompt_embeds,
                    timestep_cond=timestep_cond,
                    cross_attention_kwargs=self.cross_attention_kwargs,
                    added_cond_kwargs=added_cond_kwargs,
                    return_dict=False,
                )[0]

...

Here, self.unet uses UNet2DConditionModel by default.

Furthermore, it is used in the forward method of the UNet2DConditionModel class as follows.

diffusers/src/diffusers/models/unets/unet_2d_condition.py

class UNet2DConditionModel(
    ...
    def forward(
        ...
        if cross_attention_kwargs is not None:
            cross_attention_kwargs = cross_attention_kwargs.copy()
            lora_scale = cross_attention_kwargs.pop("scale", 1.0)
        else:
            lora_scale = 1.0

        if USE_PEFT_BACKEND:
            # weight the lora layers by setting `lora_scale` for each PEFT layer
            scale_lora_layers(self, lora_scale)

        ...

In the code above, the scale key is retrieved from cross_attention_kwargs.
When calling the pipeline, it is used as shown below, so the value of lora_scale is stored in this scale key.

cross_attention_kwargs={"scale": self.lora_scale},

The resulting value is then used in the scale_lora_layers function. This function is defined here.

diffusers/src/diffusers/utils/peft_utils.py
def scale_lora_layers(model, weight):
    """
    Adjust the weightage given to the LoRA layers of the model.

    Args:
        model (`torch.nn.Module`):
            The model to scale.
        weight (`float`):
            The weight to be given to the LoRA layers.
    """
    from peft.tuners.tuners_utils import BaseTunerLayer

    if weight == 1.0:
        return

    for module in model.modules():
        if isinstance(module, BaseTunerLayer):
            module.scale_layer(weight)

Looking at this, the scale_layer method is executed on the module, which is a class instance of BaseTunerLayer within the model.
This scale_layer method is defined here.

peft/src/peft/tuners/lora/layer.py

class LoraLayer(BaseTunerLayer):
    ...
    def scale_layer(self, scale_factor: float) -> None:
        if scale_factor != 1:
            for active_adapter in self.active_adapters:
                if active_adapter not in self.lora_A.keys():
                    continue
                alpha = self.lora_alpha[active_adapter]
                r = self.r[active_adapter]
                self.scaling[active_adapter] = (alpha / r) * scale_factor

I haven't traced it further than this, but at the bottom of peft/src/peft/tuners/lora/layer.py, there are usage examples for convolutional and fully connected layers. Looking at those, it seems that the result passed through LoRA is multiplied by self.scaling[active_adapter] and then added back to the original process. In other words, if lora_scale is 1.0, the output of the original model and the output of the LoRA module are added together with a 1:1 weight ratio.

scaling = self.scaling[active_adapter]
result += lora_B(lora_A(dropout(x))) * scaling

Fifth Cell

./colab_AI_sample/colab_SDXLControlNet_sample/SDXLControlNet_sample.ipynb

# Set the prompt to be read.

main_prompt = """
1 girl ,Yellowish-white hair ,short hair ,red small ribbon,red eyes,red hat ,school uniform ,solo ,smile ,upper body ,Anime ,Japanese,best quality,high quality,ultra highres,ultra quality
"""

use_lora = False
if use_lora:
  main_prompt += ", Dreamyvibes Artstyle"

negative_prompt="""
nsfw, lowres, (bad), text, error, fewer, extra, missing, worst quality, jpeg artifacts, low quality, watermark, unfinished, displeasing, oldest, early, chromatic aberration, signature, extra digits, artistic error, username, scan, [abstract]
"""

input_refer_image_path = "./inputs/refer.webp"
output_refer_image_path = "./inputs/refer.png"

In this cell, the prompt for generation is specified.
The prompt used is reused from Part 6.5.

Also, I've implemented it to account for whether trigger words should be added depending on whether LoRA is used or not, but frankly, you could just add them to the prompt normally.

In addition, the reference image path is specified with input_refer_image_path.
When using ControlNet, it's necessary to prepare things like OpenPose skeletons or line drawings, but normal users probably don't have those. Therefore, if you specify a regular image here, I've made it so that reference images like OpenPose skeletons or line drawings used in this run are generated from the regular image and saved to the path specified in output_refer_image_path.

Specifically, by inputting an image like the one below, it can be automatically converted into an OpenPose skeleton, line drawing, etc., and used in ControlNet. (The image was created using SD3 Medium)

Reference Image

OpenPose Full

Line Drawing

Depth Map

As you can see from the images above, the reference image is resized.
This resizing is performed according to the resolution set in the 4th cell.
The original reference image resolution is 1024x1024, and the image input into ControlNet is 832x1216.

Sixth Cell

./colab_AI_sample/colab_SDXLControlNet_sample/SDXLControlNet_sample.ipynb

sd = SDXLC()
sd.prepare_referimage(input_refer_image_path = input_refer_image_path, output_refer_image_path = output_refer_image_path, low_threshold = 100, high_threshold = 200)

Here, an instance of the SDXLC class is created.

On top of that, as explained in the 5th cell, the prepare_referimage method is executed to convert a normal reference image into an OpenPose skeleton or line drawing image.
Executing this saves the image to be input into ControlNet at the path output_refer_image_path = "./inputs/refer.png".

The low_threshold = 100, high_threshold = 200 values are the thresholds for obtaining Canny edges from the image for line drawings. It yields reasonably good results for many images even without modification.

Seventh Cell

./colab_AI_sample/colab_SDXLControlNet_sample/SDXLControlNet_sample.ipynb

for i in range(3):
      start = time.time()
      image = sd.generate_image(main_prompt, neg_prompt = negative_prompt,image_path = output_refer_image_path, controlnet_conditioning_scale = 0.5)
      print("generate image time: ", time.time()-start)
      image.save("./outputs/SDXLC_result_{}.png".format(i))

Here, images are generated with SDXL + ControlNet.
By running the for loop 3 times, three images are generated and saved in the outputs folder.
Also, the time taken to generate one image is measured and displayed.

The execution results (generated images) will be presented in the later chapter on experimental results.

The images are saved in the outputs folder, but please note that they will overwrite previously saved images if run again with the same names.
(Unless you change the seed value, which is a value for the random number table, the same image will always be generated each time. The seed can be changed in the configuration file in the 4th cell.)

In addition, by changing the value of the argument controlnet_conditioning_scale = 0.5, you can specify how much the reference image influences the generated image.
1.0 is the maximum value and 0 is the minimum. Specifying 0 means the reference image has no influence at all, and it will be generated based only on the prompt. At 1.0, the influence of the reference image is strong.
If you want the prompt to be reflected well, I recommend setting it around 0.5 to 0.7.

module/module_sdc.py

Next, I will explain the contents of the module loaded from SDXLControlNet_sample.ipynb.

Below is the full code.

Full Code
./colab_AI_sample/colab_SDXLControlNet_sample/module/module_sdc.py
from diffusers import DiffusionPipeline, AutoencoderKL, StableDiffusionXLControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
from diffusers.pipelines.stable_diffusion_xl.pipeline_output import StableDiffusionXLPipelineOutput
import torch
from diffusers.schedulers import DPMSolverMultistepScheduler
from controlnet_aux.processor import Processor

import os
import configparser
# Module for checking file existence
import errno
import cv2
from PIL import Image
import time
import numpy as np

class SDXLCconfig:
    def __init__(self, config_ini_path = './configs/config.ini'):
        # Loading the ini file
        self.config_ini = configparser.ConfigParser()

        # Raise error if the specified ini file does not exist
        if not os.path.exists(config_ini_path):
            raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), config_ini_path)

        self.config_ini.read(config_ini_path, encoding='utf-8')
        SDXLC_items = self.config_ini.items('SDXLC')
        self.SDXLC_config_dict = dict(SDXLC_items)

class SDXLC:
    def __init__(self,device = None, config_ini_path = './configs/config.ini'):

        SDXLC_config = SDXLCconfig(config_ini_path = config_ini_path)
        config_dict = SDXLC_config.SDXLC_config_dict


        if device is not None:
            self.device = device
        else:
            device = config_dict["device"]

            self.device = "cuda" if torch.cuda.is_available() else "cpu"
            if device != "auto":
                self.device = device

        self.last_latents = None
        self.last_step = -1
        self.last_timestep = 1000

        self.n_steps = int(config_dict["n_steps"])
        if not config_dict["high_noise_frac"] == "None":
          self.high_noise_frac = float(config_dict["high_noise_frac"])
        else:
          self.high_noise_frac = None
        self.seed = int(config_dict["seed"])
        self.generator = torch.Generator(device=self.device).manual_seed(self.seed)

        self.controlnet_path = config_dict["controlnet_path"]

        self.control_mode = config_dict["control_mode"]
        if self.control_mode == "None":
            self.control_mode = None

        self.vae_model_path = config_dict["vae_model_path"]
        self.VAE_FLAG = True
        if self.vae_model_path == "None":
            self.vae_model_path = None
            self.VAE_FLAG = False

        self.base_model_path = config_dict["base_model_path"]

        self.REFINER_FLAG = True
        self.refiner_model_path = config_dict["refiner_model_path"]
        if self.refiner_model_path == "None":
            self.refiner_model_path = None
            self.REFINER_FLAG = False


        self.LORA_FLAG = True
        self.lora_weight_path = config_dict["lora_weight_path"]
        if self.lora_weight_path == "None":
          self.lora_weight_path = None
          self.LORA_FLAG = False
        self.lora_scale = float(config_dict["lora_scale"])

        self.use_karras_sigmas = config_dict["use_karras_sigmas"]
        if self.use_karras_sigmas == "True":
            self.use_karras_sigmas = True
        else:
            self.use_karras_sigmas = False
        self.scheduler_algorithm_type = config_dict["scheduler_algorithm_type"]
        if config_dict["solver_order"] != "None":
            self.solver_order = int(config_dict["solver_order"])
        else:
            self.solver_order = None

        self.cfg_scale = float(config_dict["cfg_scale"])
        self.width = int(config_dict["width"])
        self.height = int(config_dict["height"])
        self.output_type = config_dict["output_type"]
        self.aesthetic_score = float(config_dict["aesthetic_score"])
        self.negative_aesthetic_score = float(config_dict["negative_aesthetic_score"])

        self.save_latent_simple = config_dict["save_latent_simple"]
        if self.save_latent_simple == "True":
            self.save_latent_simple = True
            print("use vallback save_latent_simple")
        else:
            self.save_latent_simple = False

        self.save_latent_overstep = config_dict["save_latent_overstep"]
        if self.save_latent_overstep == "True":
            self.save_latent_overstep = True
            print("use vallback save_latent_overstep")
        else:
            self.save_latent_overstep = False

        self.save_latent_approximation = config_dict["save_latent_approximation"]
        if self.save_latent_approximation == "True":
            self.save_latent_approximation = True
            print("use vallback save_latent_approximation")
        else:
            self.save_latent_approximation = False

        self.use_callback = False
        if self.save_latent_simple or self.save_latent_overstep or self.save_latent_approximation:
            self.use_callback = True

        if self.save_latent_simple and self.save_latent_overstep:
            raise ValueError("save_latent_simple and save_latent_overstep cannot be set at the same time")

        self.base , self.refiner = self.preprepare_model()


    def preprepare_model(self):
        controlnet = ControlNetModel.from_pretrained(
                self.controlnet_path,
                use_safetensors=True,
                torch_dtype=torch.float16)

        if self.VAE_FLAG:
            vae = AutoencoderKL.from_pretrained(
                self.vae_model_path,
                torch_dtype=torch.float16)

            base = StableDiffusionXLControlNetPipeline.from_pretrained(
                self.base_model_path,
                controlnet=controlnet,
                vae=vae,
                torch_dtype=torch.float16,
                variant="fp16",
                use_safetensors=True
            )
            base.to(self.device)

            if self.REFINER_FLAG:
                refiner = DiffusionPipeline.from_pretrained(
                    self.refiner_model_path,
                    text_encoder_2=base.text_encoder_2,
                    vae=vae,
                    requires_aesthetics_score=True,
                    torch_dtype=torch.float16,
                    variant="fp16",
                    use_safetensors=True
                )

                refiner.enable_model_cpu_offload()
            else:
                refiner = None

        else:
            base = StableDiffusionXLControlNetPipeline.from_pretrained(
                self.base_model_path,
                controlnet=controlnet,
                torch_dtype=torch.float16,
                variant="fp16",
                use_safetensors=True
            )
            base.to(self.device, torch.float16)

            if self.REFINER_FLAG:
                refiner = DiffusionPipeline.from_pretrained(
                    self.refiner_model_path,
                    text_encoder_2=base.text_encoder_2,
                    requires_aesthetics_score=True,
                    torch_dtype=torch.float16,
                    variant="fp16",
                    use_safetensors=True
                )

                refiner.enable_model_cpu_offload()
            else:
                refiner = None

        if self.LORA_FLAG:
            base.load_lora_weights(self.lora_weight_path)

        if self.solver_order is not None:
            base.scheduler = DPMSolverMultistepScheduler.from_config(
                    base.scheduler.config,
                    use_karras_sigmas=self.use_karras_sigmas,
                    Algorithm_type =self.scheduler_algorithm_type,
                    solver_order=self.solver_order,
                    )
            return base, refiner
        else:
            base.scheduler = DPMSolverMultistepScheduler.from_config(
                    base.scheduler.config,
                    use_karras_sigmas=self.use_karras_sigmas,
                    Algorithm_type =self.scheduler_algorithm_type,
                    )
            return base, refiner

    def prepare_referimage(self,input_refer_image_path,output_refer_image_path, low_threshold = 100, high_threshold = 200):

        mode = None
        if self.control_mode is not None:
            mode = self.control_mode
        else:
            raise ValueError("control_mode is not set")

        def prepare_openpose(input_refer_image_path,output_refer_image_path, mode):

            # Preparing the initial image
            init_image = load_image(input_refer_image_path)
            init_image = init_image.resize((self.width, self.height))

            processor = Processor(mode)
            processed_image = processor(init_image, to_pil=True)

            processed_image.save(output_refer_image_path)




        def prepare_canny(input_refer_image_path,output_refer_image_path, low_threshold = 100, high_threshold = 200):
            init_image = load_image(input_refer_image_path)
            init_image = init_image.resize((self.width, self.height))

            # Method to create the control image
            def make_canny_condition(image, low_threshold = 100, high_threshold = 200):
                image = np.array(image)
                image = cv2.Canny(image, low_threshold, high_threshold)
                image = image[:, :, None]
                image = np.concatenate([image, image, image], axis=2)
                return Image.fromarray(image)

            control_image = make_canny_condition(init_image, low_threshold, high_threshold)
            control_image.save(output_refer_image_path)

        def prepare_depthmap(input_refer_image_path,output_refer_image_path):

            # Preparing the initial image
            init_image = load_image(input_refer_image_path)
            init_image = init_image.resize((self.width, self.height))
            processor = Processor("depth_midas")
            depth_image = processor(init_image, to_pil=True)
            depth_image.save(output_refer_image_path)

        def prepare_zoe_depthmap(input_refer_image_path,output_refer_image_path):

            torch.hub.help(
                "intel-isl/MiDaS",
                "DPT_BEiT_L_384",
                force_reload=True
                )
            model_zoe_n = torch.hub.load(
                "isl-org/ZoeDepth",
                "ZoeD_NK",
                pretrained=True
                ).to("cuda")

            init_image = load_image(input_refer_image_path)
            init_image = init_image.resize((self.width, self.height))

            depth_numpy = model_zoe_n.infer_pil(init_image)  # return: numpy.ndarray

            from zoedepth.utils.misc import colorize
            colored = colorize(depth_numpy) # numpy.ndarray => numpy.ndarray

            # gamma correction
            img = colored / 255
            img = np.power(img, 2.2)
            img = (img * 255).astype(np.uint8)

            Image.fromarray(img).save(output_refer_image_path)


        if "openpose" in mode:
            prepare_openpose(input_refer_image_path,output_refer_image_path, mode)
        elif mode == "canny":
            prepare_canny(input_refer_image_path,output_refer_image_path, low_threshold = low_threshold, high_threshold = high_threshold)
        elif mode == "depth":
            prepare_depthmap(input_refer_image_path,output_refer_image_path)
        elif mode == "zoe_depth":
            prepare_zoe_depthmap(input_refer_image_path,output_refer_image_path)
        elif mode == "tile" or mode == "scribble":
            init_image = load_image(input_refer_image_path)
            init_image.save(output_refer_image_path)
        else:
            raise ValueError("control_mode is not set")


    def generate_image(self, prompt, neg_prompt, image_path, seed = None, controlnet_conditioning_scale = 1.0):
        def decode_tensors(pipe, step, timestep, callback_kwargs):
            if self.save_latent_simple:
                callback_kwargs = decode_tensors_simple(pipe, step, timestep, callback_kwargs)
            elif self.save_latent_overstep:
                callback_kwargs = decode_tensors_residual(pipe, step, timestep, callback_kwargs)
            else:
                raise ValueError("save_latent_simple or save_latent_overstep must be set or 'save_latent_approximation = False'")
            return callback_kwargs


        def decode_tensors_simple(pipe, step, timestep, callback_kwargs):
            latents = callback_kwargs["latents"]
            imege = None
            if self.save_latent_simple and not self.save_latent_approximation:
                image = latents_to_rgb_vae(latents,pipe)
            elif self.save_latent_approximation:
                image = latents_to_rgb_approximation(latents,pipe)
            else:
                raise ValueError("save_latent_simple or save_latent_approximation is not set")
            gettime = time.time()
            formatted_time_human_readable = time.strftime("%Y%m%d_%H%M%S", time.localtime(gettime))
            image.save(f"./outputs/latent_{formatted_time_human_readable}_{step}_{timestep}.png")

            return callback_kwargs

        def decode_tensors_residual(pipe, step, timestep, callback_kwargs):
            latents = callback_kwargs["latents"]
            if step > 0:
                residual = latents - self.last_latents
                goal = self.last_latents + residual * ((self.last_timestep) / (self.last_timestep - timestep))
                #print( ((self.last_timestep) / (self.last_timestep - timestep)))
            else:
                goal = latents

            if self.save_latent_overstep and not self.save_latent_approximation:
                image = latents_to_rgb_vae(goal,pipe)
            elif self.save_latent_approximation:
                image = latents_to_rgb_approximation(goal,pipe)
            else:
                raise ValueError("save_latent_simple or save_latent_approximation is not set")

            gettime = time.time()
            formatted_time_human_readable = time.strftime("%Y%m%d_%H%M%S", time.localtime(gettime))
            image.save(f"./outputs/latent_{formatted_time_human_readable}_{step}_{timestep}.png")

            self.last_latents = latents
            self.last_step = step
            self.last_timestep = timestep

            if timestep == 0:
                self.last_latents = None
                self.last_step = -1
                self.last_timestep = 100

            return callback_kwargs

        def latents_to_rgb_vae(latents,pipe):

            pipe.upcast_vae()
            latents = latents.to(next(iter(pipe.vae.post_quant_conv.parameters())).dtype)
            images = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0]
            images = pipe.image_processor.postprocess(images, output_type='pil')
            pipe.vae.to(dtype=torch.float16)

            return StableDiffusionXLPipelineOutput(images=images).images[0]

        def latents_to_rgb_approximation(latents, pipe):
            weights = (
                (60, -60, 25, -70),
                (60,  -5, 15, -50),
                (60,  10, -5, -35)
            )

            weights_tensor = torch.t(torch.tensor(weights, dtype=latents.dtype).to(latents.device))
            biases_tensor = torch.tensor((150, 140, 130), dtype=latents.dtype).to(latents.device)
            rgb_tensor = torch.einsum("...lxy,lr -> ...rxy", latents, weights_tensor) + biases_tensor.unsqueeze(-1).unsqueeze(-1)
            image_array = rgb_tensor.clamp(0, 255)[0].byte().cpu().numpy()
            image_array = image_array.transpose(1, 2, 0)  # Change the order of dimensions

            return Image.fromarray(image_array)

        if seed is not None:
            self.generator = torch.Generator(device=self.device).manual_seed(seed)

        control_image = load_image(image_path)

        image = None
        if self.use_callback:
            if self.LORA_FLAG:
                if self.REFINER_FLAG:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type="latent",
                        width = self.width,
                        height = self.height,
                        generator=self.generator,
                        cross_attention_kwargs={"scale": self.lora_scale},
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        ).images[0]
                    image = self.refiner(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        cfg_scale=self.cfg_scale,
                        aesthetic_score = self.aesthetic_score,
                        negative_aesthetic_score = self.negative_aesthetic_score,
                        num_inference_steps=self.n_steps,
                        denoising_start=self.high_noise_frac,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        image=image[None, :]
                        ).images[0]
                # If refiner is not used
                else:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type=self.output_type,
                        width = self.width,
                        height = self.height,
                        generator=self.generator,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        cross_attention_kwargs={"scale": self.lora_scale},
                        ).images[0]
            # If LoRA is not used
            else:
                if self.REFINER_FLAG:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type="latent",
                        width = self.width,
                        height = self.height,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        generator=self.generator
                        ).images[0]
                    image = self.refiner(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        cfg_scale=self.cfg_scale,
                        aesthetic_score = self.aesthetic_score,
                        negative_aesthetic_score = self.negative_aesthetic_score,
                        num_inference_steps=self.n_steps,
                        denoising_start=self.high_noise_frac,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        image=image[None, :]
                        ).images[0]
                # If refiner is not used
                else:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type=self.output_type,
                        width = self.width,
                        height = self.height,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        generator=self.generator
                        ).images[0]
        # If latents are not saved
        else:
            if self.LORA_FLAG:
                if self.REFINER_FLAG:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type="latent",
                        width = self.width,
                        height = self.height,
                        generator=self.generator,
                        cross_attention_kwargs={"scale": self.lora_scale},
                        ).images[0]
                    image = self.refiner(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        cfg_scale=self.cfg_scale,
                        aesthetic_score = self.aesthetic_score,
                        negative_aesthetic_score = self.negative_aesthetic_score,
                        num_inference_steps=self.n_steps,
                        denoising_start=self.high_noise_frac,
                        image=image[None, :]
                        ).images[0]
                # If refiner is not used
                else:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type=self.output_type,
                        width = self.width,
                        height = self.height,
                        generator=self.generator,
                        cross_attention_kwargs={"scale": self.lora_scale},
                        ).images[0]
            # If LoRA is not used
            else:
                if self.REFINER_FLAG:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type="latent",
                        width = self.width,
                        height = self.height,
                        generator=self.generator
                        ).images[0]
                    image = self.refiner(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        cfg_scale=self.cfg_scale,
                        aesthetic_score = self.aesthetic_score,
                        negative_aesthetic_score = self.negative_aesthetic_score,
                        num_inference_steps=self.n_steps,
                        denoising_start=self.high_noise_frac,
                        image=image[None, :]
                        ).images[0]
                # If refiner is not used
                else:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type=self.output_type,
                        width = self.width,
                        height = self.height,
                        generator=self.generator
                        ).images[0]

        return image

Now, let's explain each part one by one.

SDXLCconfig Class

./colab_AI_sample/colab_SDXLControlNet_sample/module/module_sdc.py
class SDXLCconfig:
    def __init__(self, config_ini_path = './configs/config.ini'):
        # Loading the ini file
        self.config_ini = configparser.ConfigParser()

        # Raise error if the specified ini file does not exist
        if not os.path.exists(config_ini_path):
            raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), config_ini_path)

        self.config_ini.read(config_ini_path, encoding='utf-8')
        SDXLC_items = self.config_ini.items('SDXLC')
        self.SDXLC_config_dict = dict(SDXLC_items)

Here, the configuration file specified by config_ini_path = './configs/config.ini' is loaded into SDXLC_config_dict. Since it is loaded as a dictionary, the contents of the configuration file can be accessed as a Python dictionary.

init Method of the SDXLC Class

./colab_AI_sample/colab_SDXLControlNet_sample/module/module_sdc.py

class SDXLC:
    def __init__(self,device = None, config_ini_path = './configs/config.ini'):

        SDXLC_config = SDXLCconfig(config_ini_path = config_ini_path)
        config_dict = SDXLC_config.SDXLC_config_dict


        if device is not None:
            self.device = device
        else:
            device = config_dict["device"]

            self.device = "cuda" if torch.cuda.is_available() else "cpu"
            if device != "auto":
                self.device = device

        self.last_latents = None
        self.last_step = -1
        self.last_timestep = 1000

        self.n_steps = int(config_dict["n_steps"])
        if not config_dict["high_noise_frac"] == "None":
          self.high_noise_frac = float(config_dict["high_noise_frac"])
        else:
          self.high_noise_frac = None
        self.seed = int(config_dict["seed"])
        self.generator = torch.Generator(device=self.device).manual_seed(self.seed)

        self.controlnet_path = config_dict["controlnet_path"]

        self.control_mode = config_dict["control_mode"]
        if self.control_mode == "None":
            self.control_mode = None

        self.vae_model_path = config_dict["vae_model_path"]
        self.VAE_FLAG = True
        if self.vae_model_path == "None":
            self.vae_model_path = None
            self.VAE_FLAG = False

        self.base_model_path = config_dict["base_model_path"]

        self.REFINER_FLAG = True
        self.refiner_model_path = config_dict["refiner_model_path"]
        if self.refiner_model_path == "None":
            self.refiner_model_path = None
            self.REFINER_FLAG = False


        self.LORA_FLAG = True
        self.lora_weight_path = config_dict["lora_weight_path"]
        if self.lora_weight_path == "None":
          self.lora_weight_path = None
          self.LORA_FLAG = False
        self.lora_scale = float(config_dict["lora_scale"])

        self.use_karras_sigmas = config_dict["use_karras_sigmas"]
        if self.use_karras_sigmas == "True":
            self.use_karras_sigmas = True
        else:
            self.use_karras_sigmas = False
        self.scheduler_algorithm_type = config_dict["scheduler_algorithm_type"]
        if config_dict["solver_order"] != "None":
            self.solver_order = int(config_dict["solver_order"])
        else:
            self.solver_order = None

        self.cfg_scale = float(config_dict["cfg_scale"])
        self.width = int(config_dict["width"])
        self.height = int(config_dict["height"])
        self.output_type = config_dict["output_type"]
        self.aesthetic_score = float(config_dict["aesthetic_score"])
        self.negative_aesthetic_score = float(config_dict["negative_aesthetic_score"])

        self.save_latent_simple = config_dict["save_latent_simple"]
        if self.save_latent_simple == "True":
            self.save_latent_simple = True
            print("use vallback save_latent_simple")
        else:
            self.save_latent_simple = False

        self.save_latent_overstep = config_dict["save_latent_overstep"]
        if self.save_latent_overstep == "True":
            self.save_latent_overstep = True
            print("use vallback save_latent_overstep")
        else:
            self.save_latent_overstep = False

        self.save_latent_approximation = config_dict["save_latent_approximation"]
        if self.save_latent_approximation == "True":
            self.save_latent_approximation = True
            print("use vallback save_latent_approximation")
        else:
            self.save_latent_approximation = False

        self.use_callback = False
        if self.save_latent_simple or self.save_latent_overstep or self.save_latent_approximation:
            self.use_callback = True

        if self.save_latent_simple and self.save_latent_overstep:
            raise ValueError("save_latent_simple and save_latent_overstep cannot be set at the same time")

        self.base , self.refiner = self.preprepare_model()

First, the contents of the configuration file are stored in config_dict. Since this is a dictionary, the contents of the configuration file can be retrieved as strings using keys like config_dict["device"].
Note that since all values are retrieved as strings, you need to cast them to types like int or bool as necessary.

Next, the following processes are performed:

  • Specify the device to run the model.
  • Retrieve various settings from the configuration file.
  • Define the model.
    • Define the appropriate model according to the configuration file.
    • This is defined in the self.preprepare_model() method.

preprepare_model Method of the SDXLC Class

./colab_AI_sample/colab_SDXLControlNet_sample/module/module_sdc.py

class SDXLC:
    ...
    def preprepare_model(self):
        controlnet = ControlNetModel.from_pretrained(
                self.controlnet_path,
                use_safetensors=True,
                torch_dtype=torch.float16)

        if self.VAE_FLAG:
            vae = AutoencoderKL.from_pretrained(
                self.vae_model_path,
                torch_dtype=torch.float16)

            base = StableDiffusionXLControlNetPipeline.from_pretrained(
                self.base_model_path,
                controlnet=controlnet,
                vae=vae,
                torch_dtype=torch.float16,
                variant="fp16",
                use_safetensors=True
            )
            base.to(self.device)

            if self.REFINER_FLAG:
                refiner = DiffusionPipeline.from_pretrained(
                    self.refiner_model_path,
                    text_encoder_2=base.text_encoder_2,
                    vae=vae,
                    requires_aesthetics_score=True,
                    torch_dtype=torch.float16,
                    variant="fp16",
                    use_safetensors=True
                )

                refiner.enable_model_cpu_offload()
            else:
                refiner = None

        else:
            base = StableDiffusionXLControlNetPipeline.from_pretrained(
                self.base_model_path,
                controlnet=controlnet,
                torch_dtype=torch.float16,
                variant="fp16",
                use_safetensors=True
            )
            base.to(self.device, torch.float16)

            if self.REFINER_FLAG:
                refiner = DiffusionPipeline.from_pretrained(
                    self.refiner_model_path,
                    text_encoder_2=base.text_encoder_2,
                    requires_aesthetics_score=True,
                    torch_dtype=torch.float16,
                    variant="fp16",
                    use_safetensors=True
                )

                refiner.enable_model_cpu_offload()
            else:
                refiner = None

        if self.LORA_FLAG:
            base.load_lora_weights(self.lora_weight_path)

        if self.solver_order is not None:
            base.scheduler = DPMSolverMultistepScheduler.from_config(
                    base.scheduler.config,
                    use_karras_sigmas=self.use_karras_sigmas,
                    Algorithm_type =self.scheduler_algorithm_type,
                    solver_order=self.solver_order,
                    )
            return base, refiner
        else:
            base.scheduler = DPMSolverMultistepScheduler.from_config(
                    base.scheduler.config,
                    use_karras_sigmas=self.use_karras_sigmas,
                    Algorithm_type =self.scheduler_algorithm_type,
                    )
            return base, refiner

Basically, this is the same as the previous articles on SD3 (+ ControlNet) (Part 8, Part 9) and SDXL (Part 6, Part 6.5).
The change here is that LoRA is defined at the following position:

if self.LORA_FLAG:
    base.load_lora_weights(self.lora_weight_path)

prepare_referimage Method of the SDXLC Class

./colab_AI_sample/colab_SDXLControlNet_sample/module/module_sdc.py

class SDXLC:
    ...
    def prepare_referimage(self,input_refer_image_path,output_refer_image_path, low_threshold = 100, high_threshold = 200):

        mode = None
        if self.control_mode is not None:
            mode = self.control_mode
        else:
            raise ValueError("control_mode is not set")

        def prepare_openpose(input_refer_image_path,output_refer_image_path, mode):

            # Preparing the initial image
            init_image = load_image(input_refer_image_path)
            init_image = init_image.resize((self.width, self.height))

            processor = Processor(mode)
            processed_image = processor(init_image, to_pil=True)

            processed_image.save(output_refer_image_path)




        def prepare_canny(input_refer_image_path,output_refer_image_path, low_threshold = 100, high_threshold = 200):
            init_image = load_image(input_refer_image_path)
            init_image = init_image.resize((self.width, self.height))

            # Method to create the control image
            def make_canny_condition(image, low_threshold = 100, high_threshold = 200):
                image = np.array(image)
                image = cv2.Canny(image, low_threshold, high_threshold)
                image = image[:, :, None]
                image = np.concatenate([image, image, image], axis=2)
                return Image.fromarray(image)

            control_image = make_canny_condition(init_image, low_threshold, high_threshold)
            control_image.save(output_refer_image_path)

        def prepare_depthmap(input_refer_image_path,output_refer_image_path):

            # Preparing the initial image
            init_image = load_image(input_refer_image_path)
            init_image = init_image.resize((self.width, self.height))
            processor = Processor("depth_midas")
            depth_image = processor(init_image, to_pil=True)
            depth_image.save(output_refer_image_path)

        def prepare_zoe_depthmap(input_refer_image_path,output_refer_image_path):

            torch.hub.help(
                "intel-isl/MiDaS",
                "DPT_BEiT_L_384",
                force_reload=True
                )
            model_zoe_n = torch.hub.load(
                "isl-org/ZoeDepth",
                "ZoeD_NK",
                pretrained=True
                ).to("cuda")

            init_image = load_image(input_refer_image_path)
            init_image = init_image.resize((self.width, self.height))

            depth_numpy = model_zoe_n.infer_pil(init_image)  # return: numpy.ndarray

            from zoedepth.utils.misc import colorize
            colored = colorize(depth_numpy) # numpy.ndarray => numpy.ndarray

            # gamma correction
            img = colored / 255
            img = np.power(img, 2.2)
            img = (img * 255).astype(np.uint8)

            Image.fromarray(img).save(output_refer_image_path)


        if "openpose" in mode:
            prepare_openpose(input_refer_image_path,output_refer_image_path, mode)
        elif mode == "canny":
            prepare_canny(input_refer_image_path,output_refer_image_path, low_threshold = low_threshold, high_threshold = high_threshold)
        elif mode == "depth":
            prepare_depthmap(input_refer_image_path,output_refer_image_path)
        elif mode == "zoe_depth":
            prepare_zoe_depthmap(input_refer_image_path,output_refer_image_path)
        elif mode == "tile":
            init_image = load_image(input_refer_image_path)
            init_image.save(output_refer_image_path)
        else:
            raise ValueError("control_mode is not set")

This method converts normal images, like the one below, into OpenPose skeletons, line drawings, or depth maps.

Reference Image

OpenPose Full

Line Drawing

Depth Map

First, it sets how the reference image should be converted here:

        mode = None
        if self.control_mode is not None:
            mode = self.control_mode
        else:
            raise ValueError("control_mode is not set")

Then, it generates the reference image based on the value of mode as follows:

if "openpose" in mode:
    prepare_openpose(input_refer_image_path,output_refer_image_path, mode)
elif mode == "canny":
    prepare_canny(input_refer_image_path,output_refer_image_path, low_threshold = low_threshold, high_threshold = high_threshold)
elif mode == "depth":
    prepare_depthmap(input_refer_image_path,output_refer_image_path)
elif mode == "zoe_depth":
    prepare_zoe_depthmap(input_refer_image_path,output_refer_image_path)
elif mode == "tile":
    init_image = load_image(input_refer_image_path)
    init_image.save(output_refer_image_path)
else:
    raise ValueError("control_mode is not set")

For openpose, it saves the OpenPose skeleton image to output_refer_image_path.
For canny, it obtains the Canny edge and saves it similarly.
In this way, it converts the reference image into an image to be input into ControlNet.
At that time, it changes the image by resizing it to the resolution you want to generate.

generate_image Method of the SDXLC Class

./colab_AI_sample/colab_SDXLControlNet_sample/module/module_sdc.py

class SDXLC:
    ...
    def generate_image(self, prompt, neg_prompt, image_path, seed = None, controlnet_conditioning_scale = 1.0):
        def decode_tensors(pipe, step, timestep, callback_kwargs):
            if self.save_latent_simple:
                callback_kwargs = decode_tensors_simple(pipe, step, timestep, callback_kwargs)
            elif self.save_latent_overstep:
                callback_kwargs = decode_tensors_residual(pipe, step, timestep, callback_kwargs)
            else:
                raise ValueError("save_latent_simple or save_latent_overstep must be set or 'save_latent_approximation = False'")
            return callback_kwargs


        def decode_tensors_simple(pipe, step, timestep, callback_kwargs):
            latents = callback_kwargs["latents"]
            imege = None
            if self.save_latent_simple and not self.save_latent_approximation:
                image = latents_to_rgb_vae(latents,pipe)
            elif self.save_latent_approximation:
                image = latents_to_rgb_approximation(latents,pipe)
            else:
                raise ValueError("save_latent_simple or save_latent_approximation is not set")
            gettime = time.time()
            formatted_time_human_readable = time.strftime("%Y%m%d_%H%M%S", time.localtime(gettime))
            image.save(f"./outputs/latent_{formatted_time_human_readable}_{step}_{timestep}.png")

            return callback_kwargs

        def decode_tensors_residual(pipe, step, timestep, callback_kwargs):
            latents = callback_kwargs["latents"]
            if step > 0:
                residual = latents - self.last_latents
                goal = self.last_latents + residual * ((self.last_timestep) / (self.last_timestep - timestep))
                #print( ((self.last_timestep) / (self.last_timestep - timestep)))
            else:
                goal = latents

            if self.save_latent_overstep and not self.save_latent_approximation:
                image = latents_to_rgb_vae(goal,pipe)
            elif self.save_latent_approximation:
                image = latents_to_rgb_approximation(goal,pipe)
            else:
                raise ValueError("save_latent_simple or save_latent_approximation is not set")

            gettime = time.time()
            formatted_time_human_readable = time.strftime("%Y%m%d_%H%M%S", time.localtime(gettime))
            image.save(f"./outputs/latent_{formatted_time_human_readable}_{step}_{timestep}.png")

            self.last_latents = latents
            self.last_step = step
            self.last_timestep = timestep

            if timestep == 0:
                self.last_latents = None
                self.last_step = -1
                self.last_timestep = 100

            return callback_kwargs

        def latents_to_rgb_vae(latents,pipe):

            pipe.upcast_vae()
            latents = latents.to(next(iter(pipe.vae.post_quant_conv.parameters())).dtype)
            images = pipe.vae.decode(latents / pipe.vae.config.scaling_factor, return_dict=False)[0]
            images = pipe.image_processor.postprocess(images, output_type='pil')
            pipe.vae.to(dtype=torch.float16)

            return StableDiffusionXLPipelineOutput(images=images).images[0]

        def latents_to_rgb_approximation(latents, pipe):
            weights = (
                (60, -60, 25, -70),
                (60,  -5, 15, -50),
                (60,  10, -5, -35)
            )

            weights_tensor = torch.t(torch.tensor(weights, dtype=latents.dtype).to(latents.device))
            biases_tensor = torch.tensor((150, 140, 130), dtype=latents.dtype).to(latents.device)
            rgb_tensor = torch.einsum("...lxy,lr -> ...rxy", latents, weights_tensor) + biases_tensor.unsqueeze(-1).unsqueeze(-1)
            image_array = rgb_tensor.clamp(0, 255)[0].byte().cpu().numpy()
            image_array = image_array.transpose(1, 2, 0)  # Change the order of dimensions

            return Image.fromarray(image_array)

        if seed is not None:
            self.generator = torch.Generator(device=self.device).manual_seed(seed)

        control_image = load_image(image_path)

        image = None
        if self.use_callback:
            if self.LORA_FLAG:
                if self.REFINER_FLAG:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type="latent",
                        width = self.width,
                        height = self.height,
                        generator=self.generator,
                        cross_attention_kwargs={"scale": self.lora_scale},
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        ).images[0]
                    image = self.refiner(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        cfg_scale=self.cfg_scale,
                        aesthetic_score = self.aesthetic_score,
                        negative_aesthetic_score = self.negative_aesthetic_score,
                        num_inference_steps=self.n_steps,
                        denoising_start=self.high_noise_frac,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        image=image[None, :]
                        ).images[0]
                # Case when refiner is not used
                else:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type=self.output_type,
                        width = self.width,
                        height = self.height,
                        generator=self.generator,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        cross_attention_kwargs={"scale": self.lora_scale},
                        ).images[0]
            # Case when LoRA is not used
            else:
                if self.REFINER_FLAG:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type="latent",
                        width = self.width,
                        height = self.height,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        generator=self.generator
                        ).images[0]
                    image = self.refiner(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        cfg_scale=self.cfg_scale,
                        aesthetic_score = self.aesthetic_score,
                        negative_aesthetic_score = self.negative_aesthetic_score,
                        num_inference_steps=self.n_steps,
                        denoising_start=self.high_noise_frac,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        image=image[None, :]
                        ).images[0]
                # Case when refiner is not used
                else:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type=self.output_type,
                        width = self.width,
                        height = self.height,
                        callback_on_step_end=decode_tensors,
                        callback_on_step_end_tensor_inputs=["latents"],
                        generator=self.generator
                        ).images[0]
        # Case when latents are not saved
        else:
            if self.LORA_FLAG:
                if self.REFINER_FLAG:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type="latent",
                        width = self.width,
                        height = self.height,
                        generator=self.generator,
                        cross_attention_kwargs={"scale": self.lora_scale},
                        ).images[0]
                    image = self.refiner(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        cfg_scale=self.cfg_scale,
                        aesthetic_score = self.aesthetic_score,
                        negative_aesthetic_score = self.negative_aesthetic_score,
                        num_inference_steps=self.n_steps,
                        denoising_start=self.high_noise_frac,
                        image=image[None, :]
                        ).images[0]
                # Case when refiner is not used
                else:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type=self.output_type,
                        width = self.width,
                        height = self.height,
                        generator=self.generator,
                        cross_attention_kwargs={"scale": self.lora_scale},
                        ).images[0]
            # Case when LoRA is not used
            else:
                if self.REFINER_FLAG:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type="latent",
                        width = self.width,
                        height = self.height,
                        generator=self.generator
                        ).images[0]
                    image = self.refiner(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        cfg_scale=self.cfg_scale,
                        aesthetic_score = self.aesthetic_score,
                        negative_aesthetic_score = self.negative_aesthetic_score,
                        num_inference_steps=self.n_steps,
                        denoising_start=self.high_noise_frac,
                        image=image[None, :]
                        ).images[0]
                # Case when refiner is not used
                else:
                    image = self.base(
                        prompt=prompt,
                        negative_prompt=neg_prompt,
                        image=control_image,
                        cfg_scale=self.cfg_scale,
                        controlnet_conditioning_scale=controlnet_conditioning_scale,
                        num_inference_steps=self.n_steps,
                        denoising_end=self.high_noise_frac,
                        output_type=self.output_type,
                        width = self.width,
                        height = self.height,
                        generator=self.generator
                        ).images[0]

        return image

This method is used to actually generate images using the models and settings loaded so far. When saving latent representations, it branches into different conditions, such as when using LoRA or when using Refiner, to change which Pipeline is executed.

Experimental Results

From here, I will describe the details of various experiments conducted on Google Colab by changing parameters using the code above.

Experiment 1

First, we will conduct an experiment in a normal state without introducing LoRA or ControlNet.
Subsequent experiments will be conducted to compare with the results of this Experiment 1 as a baseline.

Settings


config_text = """
[SDXLC]
device = auto
n_steps=28
high_noise_frac=None
seed=42

vae_model_path = None
base_model_path = Asahina2K/Animagine-xl-3.1-diffuser-variant-fp16
refiner_model_path = None


controlnet_path = xinsir/controlnet-openpose-sdxl-1.0

control_mode = openpose

lora_weight_path = None
lora_scale = 0.0

use_karras_sigmas = True
scheduler_algorithm_type = dpmsolver++
solver_order = 2


cfg_scale = 7.0
width = 832
height = 1216
output_type = pil
aesthetic_score = 6
negative_aesthetic_score = 2.5

save_latent_simple = False
save_latent_overstep = False
save_latent_approximation = False

"""

with open("configs/config.ini", "w", encoding="utf-8") as f:
 f.write(config_text)

use_lora = False
 controlnet_conditioning_scale = 0.0

By specifying controlnet_conditioning_scale = 0.0 as in the 7th cell, ControlNet itself is loaded and used for calculations, but its influence can be made zero.
(Note: Since the calculation time will be longer, it is better to use the content from the Part 6 or 6.5 articles if you really do not intend to use ControlNet.)

Also, by specifying lora_weight_path = None in the 4th cell, the setting will be configured not to use LoRA.

Results

Execution time

generate image time:  36.22908902168274
generate image time:  34.59099340438843
generate image time:  33.817437410354614

The generated images are shown below.



As you can see by comparing them with the Part 6.5 article, I believe exactly the same images have been generated.
Therefore, the images were generated as intended.
From here on, we will experiment with ControlNet and other features based on these images.

Experiment 2

We will try various ControlNet modes without using LoRA.

Settings

4th Cell

config_text = """
[SDXLC]
device = auto
n_steps=28
high_noise_frac=None
seed=42

vae_model_path = None
base_model_path = Asahina2K/Animagine-xl-3.1-diffuser-variant-fp16
refiner_model_path = None

controlnet_path = xinsir/controlnet-openpose-sdxl-1.0
;controlnet_path = diffusers/controlnet-canny-sdxl-1.0
;controlnet_path = diffusers/controlnet-depth-sdxl-1.0
;controlnet_path = diffusers/controlnet-zoe-depth-sdxl-1.0
;controlnet_path = xinsir/controlnet-tile-sdxl-1.0

control_mode = openpose
;control_mode = openpose_face
;control_mode = openpose_faceonly
;control_mode = openpose_full
;control_mode = canny
;control_mode = depth
;control_mode = zoe_depth
;control_mode = tile


lora_weight_path = None
lora_scale = 1.0

use_karras_sigmas = True
scheduler_algorithm_type = dpmsolver++
solver_order = 2

cfg_scale = 7.0
width = 832
height = 1216
output_type = pil
aesthetic_score = 6
negative_aesthetic_score = 2.5

save_latent_simple = False
save_latent_overstep = False
save_latent_approximation = False

"""

with open("configs/config.ini", "w", encoding="utf-8") as f:
  f.write(config_text)
5th Cell
use_lora = False
7th Cell
 controlnet_conditioning_scale = 0.5

Subsequent experiments will be conducted while changing the control_mode and controlnet_path.
Since controlnet_conditioning_scale = 0.5, the generated images will not be overly constrained by ControlNet.

Results 2-1

Settings

controlnet_path = xinsir/controlnet-openpose-sdxl-1.0
control_mode = openpose

Reference Image Before Conversion

Reference Image After Conversion

Execution Time

generate image time:  36.401485204696655
generate image time:  34.710044145584106
generate image time:  33.80799460411072

Generated Images



The pose is roughly similar to the reference image.

Results 2-2

Settings

controlnet_path = xinsir/controlnet-openpose-sdxl-1.0
control_mode = openpose_face

Reference Image Before Conversion

Same as above

Reference Image After Conversion

Execution Time

generate image time:  34.58041715621948
generate image time:  33.101900815963745
generate image time:  34.591469526290894

Generated Images



Compared to the previous result, I think the face is now more directly facing the front.

Results 2-3

Settings

controlnet_path = xinsir/controlnet-openpose-sdxl-1.0
control_mode = openpose_faceonly

Reference Image Before Conversion

Same as above

Reference Image After Conversion

Execution Time

generate image time:  34.97057867050171
generate image time:  33.347087383270264
generate image time:  33.509255170822144

Generated Images



This time, the results were not well-reflected. Although the face was facing forward, the position wasn't really maintained.

Results 2-4

Settings

controlnet_path = xinsir/controlnet-openpose-sdxl-1.0
control_mode = openpose_full

Reference Image Before Conversion

Same as above

Reference Image After Conversion

Execution Time

generate image time:  36.971638202667236
generate image time:  34.228318214416504
generate image time:  33.53951096534729

Generated Images



The pose is captured reasonably well, but the hand shapes were not reflected. In the OpenPose series, this might be the safest one to use.

Results 2-5

Settings

controlnet_path = diffusers/controlnet-canny-sdxl-1.0
control_mode = canny

Reference Image Before Conversion

Same as above

Reference Image After Conversion

Execution Time

generate image time:  35.687371253967285
generate image time:  34.79979658126831
generate image time:  34.74956750869751

Generated Images



It reflected the prompt content while maintaining the reference image. Because of the resizing, the face looks quite long.

Results 2-6

Settings

controlnet_path = diffusers/controlnet-depth-sdxl-1.0
control_mode = depth

Reference Image Before Conversion

Same as above

Reference Image After Conversion

Execution Time

generate image time:  35.34699082374573
generate image time:  35.2352192401886
generate image time:  33.86982488632202

Generated Images



Compared to Canny, this has higher flexibility, resulting in more varied images. I felt this would be the easiest to use if you want to reflect the prompt while keeping the original image's composition. (This is because Canny can be pulled too much by the line drawings, and OpenPose Full sometimes fails to reflect details like hands.)

Results 2-7

Settings

controlnet_path = diffusers/controlnet-zoe-depth-sdxl-1.0
control_mode = zoe_depth

Reference Image Before Conversion

Same as above

Reference Image After Conversion

Execution Time

generate image time:  34.49560046195984
generate image time:  34.84618639945984
generate image time:  33.521161794662476

Generated Images



Compared to standard Depth, I feel like the faces are generated better here, though it might just be a coincidence. The process of creating the ControlNet input image from the reference image takes a little time, which makes it slightly less convenient. If you want to generate even higher-quality images, it might be better to use this instead of Depth.

Results 2-8

Settings

controlnet_path = xinsir/controlnet-tile-sdxl-1.0
control_mode = tile

Reference Image Before Conversion

Same as above

Reference Image After Conversion

Execution Time

generate image time:  35.9168918132782
generate image time:  34.8120903968811
generate image time:  33.69835591316223

Generated Images



The resulting images look like the original image with the prompt's content reflected onto it. This could be useful for tasks like converting a real-life image into an anime-style one.

Experiment 3

Next, we will introduce LoRA.
We will conduct the experiment without using ControlNet.

Settings

4th Cell

config_text = """
[SDXLC]
device = auto
n_steps=28
high_noise_frac=None
seed=42

vae_model_path = None
base_model_path = Asahina2K/Animagine-xl-3.1-diffuser-variant-fp16
refiner_model_path = None

controlnet_path = diffusers/controlnet-depth-sdxl-1.0

control_mode = depth

lora_weight_path = ./inputs/DreamyvibesartstyleSDXL.safetensors
lora_scale = 1.0

use_karras_sigmas = True
scheduler_algorithm_type = dpmsolver++
solver_order = 2

cfg_scale = 7.0
width = 832
height = 1216
output_type = pil
aesthetic_score = 6
negative_aesthetic_score = 2.5

save_latent_simple = False
save_latent_overstep = False
save_latent_approximation = False

"""

with open("configs/config.ini", "w", encoding="utf-8") as f:
  f.write(config_text)

5th Cell
use_lora = True
7th Cell
 controlnet_conditioning_scale = 0.0

By specifying controlnet_conditioning_scale = 0.0 as in the 7th cell, ControlNet itself is loaded and used for calculations, but its influence can be made zero.

Also, by specifying lora_weight_path = ./inputs/DreamyvibesartstyleSDXL.safetensors in the 4th cell, we use the LoRA model stored in the specified path.

Furthermore, by setting use_lora = True in the 5th cell, the trigger words are added to the existing prompt.

Results

Execution time

generate image time:  43.232848167419434
generate image time:  40.214457988739014
generate image time:  40.839759826660156

The generated images are shown below.


I feel like the images have become even cuter than those from the original model.

Experiment 4

Next, we will try one ControlNet mode with LoRA introduced.

Settings

4th Cell

config_text = """
[SDXLC]
device = auto
n_steps=28
high_noise_frac=None
seed=42

vae_model_path = None
base_model_path = Asahina2K/Animagine-xl-3.1-diffuser-variant-fp16
refiner_model_path = None

controlnet_path = diffusers/controlnet-depth-sdxl-1.0
control_mode = depth

lora_weight_path = ./inputs/DreamyvibesartstyleSDXL.safetensors
lora_scale = 1.0

use_karras_sigmas = True
scheduler_algorithm_type = dpmsolver++
solver_order = 2

cfg_scale = 7.0
width = 832
height = 1216
output_type = pil
aesthetic_score = 6
negative_aesthetic_score = 2.5

save_latent_simple = False
save_latent_overstep = False
save_latent_approximation = False

"""

with open("configs/config.ini", "w", encoding="utf-8") as f:
  f.write(config_text)

5th Cell
use_lora = True
7th Cell
 controlnet_conditioning_scale = 0.7

Reference Image


(Since the image used was a copyrighted work, I am only providing the depth image. The original image was one of the images created with the prompt used in the official tutorial provided by ANIMAGINE XL 3.1 (likely around seed 42-47).)

Results

Execution time

generate image time:  43.19255384778535
generate image time:  40.45602893829346
generate image time:  40.45509457588196

The generated images are shown below.


Although there were some mysterious results, such as the hair color turning red, I was surprised that such high-quality and cute art could be generated while specifying the composition to this extent.
It feels like almost any composition, character, or style could be generated by combining ControlNet and LoRA.

Summary

In this article, we created an environment for beginners to easily use Generative AI on Google Colaboratory.

I truly realized how high the performance of SDXL is.
I hope the community gets excited about SD3 as well and high-performance fine-tuned models and LoRA models are released. I look forward to the future.

I'm reaching the limit of 80,000 characters around here, so I'll wrap it up!
Thank you very much!

Discussion