xinsir
/

controlnet-canny-sdxl-1.0

controlnet-canny-sdxl-1.0

Model card Files Files and versions

xinsir commited on May 11, 2024

Commit

535f0a1

·

verified ·

1 Parent(s): 3207ab5

Update README.md

Files changed (1) hide show

README.md +22 -1

README.md CHANGED Viewed

@@ -88,6 +88,24 @@ import torch
 import numpy as np
 import cv2
 controlnet_conditioning_scale = 1.0
 prompt = "your prompt, the longer the better, you can describe it as detail as possible"
 negative_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality'
@@ -143,7 +161,10 @@ images[0].save(f"your image save path, png format is usually better than jpg or
 ## Training Details
-The model is trained using high quality data, only 1 stage training. The resolution setting is the same with sdxl-base, 1024*1024
 ### Training Data

 import numpy as np
 import cv2
+def HWC3(x):
+    assert x.dtype == np.uint8
+    if x.ndim == 2:
+        x = x[:, :, None]
+    assert x.ndim == 3
+    H, W, C = x.shape
+    assert C == 1 or C == 3 or C == 4
+    if C == 3:
+        return x
+    if C == 1:
+        return np.concatenate([x, x, x], axis=2)
+    if C == 4:
+        color = x[:, :, 0:3].astype(np.float32)
+        alpha = x[:, :, 3:4].astype(np.float32) / 255.0
+        y = color * alpha + 255.0 * (1.0 - alpha)
+        y = y.clip(0, 255).astype(np.uint8)
+        return y
 controlnet_conditioning_scale = 1.0
 prompt = "your prompt, the longer the better, you can describe it as detail as possible"
 negative_prompt = 'longbody, lowres, bad anatomy, bad hands, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality'
 ## Training Details
+The model is trained using high quality data, only 1 stage training, the resolution setting is the same with sdxl-base, 1024*1024. We use random threshold to generate canny images like lvming zhang, It is essential to find proper hyerparameters
+to realize data augmentation, too easy or too hard will hurt the model performance. Besides, we use random mask to random mask out a random percentage of canny images to force the model to learn more semantic meaning between the prompt and the line.
+We use over 10000000 images, which are annotated carefully, cogvlm is proved to be a powerful image caption model[https://github.com/THUDM/CogVLM?tab=readme-ov-file]. For comic images, it is recommened to use waifu tagger to generate special tags
+[https://huggingface.co/spaces/SmilingWolf/wd-tagger]. More than 64 A100s are used to train the model and the real batch size is 2560 when used accumulate_grad_batches.
 ### Training Data