elismasilva commited on
Commit
62561bb
·
1 Parent(s): 6dc45e0

add differential diffusion inpaint

Browse files
.gitignore CHANGED
@@ -17,4 +17,5 @@ bk/
17
  outputs/
18
  original/
19
  Makefile
20
- pyproject.toml
 
 
17
  outputs/
18
  original/
19
  Makefile
20
+ pyproject.toml
21
+ README_.md
README.md CHANGED
@@ -41,29 +41,54 @@ pip install -r requirements.txt
41
 
42
  ## 🚀 Usage
43
 
44
- ## 📂 Repository Structure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- * `./transformer/z_image_turbo_control_unified_v2.1_q4_k_m.gguf`: The unified, quantized Q4_K_M model weights.
47
- * `./transformer/z_image_turbo_control_unified_v2.1_q8_0.gguf`: The unified, quantized Q8_0 model weights.
48
- * `infer_controlnet.py`: Script for running controlnet inference.
49
- * `infer_inpaint.py`: Script for running inpaint inference.
50
- * `infer_t2i.py`: Script for running text-to-image inference.
51
- * `infer_i2i.py`: Script for running image-to-image inference.
52
- * `diffusers_local/`: Custom pipeline code (`ZImageControlUnifiedPipeline`) and transformer logic.
53
- * `requirements.txt`: Python dependencies.
54
 
55
- The primary script for inference is `infer_controlnet.py`, which is designed to handle all supported generation modes.
56
 
57
- ### Option 1: Low VRAM (GGUF) - Recommended
58
- Use this version if you have limited VRAM (e.g., 6GB - 8GB). It loads the model from a quantized **GGUF** file (`z_image_turbo_control_unified_v2.1_q4_k_m.gguf`). Simply configure the `infer_controlnet.py` script to point to the GGUF file.
59
 
60
- **Key Features of this mode:**
61
- * Loads the unified transformer from a single 4-bit quantized file.
62
- * Enables aggressive `group_offload` to fit large models in consumer GPUs.
 
 
 
 
 
 
 
 
 
63
 
64
- ### Option 2: High Precision (Diffusers/BF16)
65
- Use this version if you have ample VRAM (e.g., 24GB+). Configure `infer_controlnet.py` to load the model using the standard `from_pretrained` directory structure for full **BFloat16** precision.
66
 
 
 
 
 
 
 
 
 
67
 
68
  ## 🛠️ Model Features & Configuration (V2)
69
 
@@ -76,7 +101,7 @@ Use this version if you have ample VRAM (e.g., 24GB+). Configure `infer_controln
76
 
77
  This optmized V2 model introduces several new features and parameters for enhanced control and flexibility:
78
 
79
- * **Unified Pipeline:** A single pipeline now handles Text-to-Image, Image-to-Image, ControlNet, and Inpainting tasks.
80
  * **Refiner Scale (`controlnet_refiner_conditioning_scale`):** It provides fine-grained control over the influence of the initial refining layers, allowing for isolated adjustments without the influence of the controlnet's conditioning force.
81
  * **Optional Refiner (`add_control_noise_refiner=False`):** You can now disable the control noise refiner layers when loading the model to save memory or for different stylistic results.
82
  * **Inpainting Blur (`mask_blur_radius`):** A parameter to soften the edges of the inpainting mask for smoother transitions.
@@ -98,12 +123,18 @@ The new `controlnet_refiner_conditioning_scale` parameter allows for fine-tuning
98
 
99
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
100
  <tr>
101
- <td>Pose + Inpaint</td>
102
- <td>Output</td>
 
 
 
103
  </tr>
104
  <tr>
105
- <td><img src="assets/inpaint.jpg" width="100%" /><img src="assets/mask_inpaint.jpg" width="100%" /></td>
106
- <td><img src="results/new_tests/result_inpaint.png" width="100%" /></td>
 
 
 
107
  </tr>
108
  </table>
109
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
@@ -188,3 +219,16 @@ The table below shows the generation results under different combinations of Dif
188
  | **20** | ![](results/scale_test/20_scale_0.65.png) | ![](results/scale_test/20_scale_0.70.png) | ![](results/scale_test/20_scale_0.75.png) | ![](results/scale_test/20_scale_0.8.png) | ![](results/scale_test/20_scale_0.9.png) | ![](results/scale_test/20_scale_1.0.png) |
189
  | **30** | ![](results/scale_test/30_scale_0.65.png) | ![](results/scale_test/30_scale_0.70.png) | ![](results/scale_test/30_scale_0.75.png) | ![](results/scale_test/30_scale_0.8.png) | ![](results/scale_test/30_scale_0.9.png) | ![](results/scale_test/30_scale_1.0.png) |
190
  | **40** | ![](results/scale_test/40_scale_0.65.png) | ![](results/scale_test/40_scale_0.70.png) | ![](results/scale_test/40_scale_0.75.png) | ![](results/scale_test/40_scale_0.8.png) | ![](results/scale_test/40_scale_0.9.png) | ![](results/scale_test/40_scale_1.0.png) |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## 🚀 Usage
43
 
44
+ This repository provides separate, easy-to-use scripts for each generation task.
45
+
46
+ ### High-Level Scripts
47
+ * `infer_t2i.py`: For Text-to-Image generation.
48
+ * `infer_i2i.py`: For Image-to-Image generation.
49
+ * `infer_controlnet.py`: For ControlNet-guided generation (Pose, Canny, Depth, etc.).
50
+ * `infer_inpaint.py`: For all inpainting tasks.
51
+
52
+ ### Hardware Options
53
+
54
+ #### Option 1: Low VRAM (GGUF) - Recommended
55
+ Use this version if you have limited VRAM (e.g., 6GB - 8GB). It loads the model from a quantized **GGUF** file. To use it, set `use_gguf = True` in the desired inference script and provide the path to the `.gguf` file.
56
+
57
+ **Key Features:**
58
+ * Loads the unified transformer from a single 4-bit or 8-bit quantized file.
59
+ * Enables aggressive `group_offload` to fit large models on consumer GPUs.
60
 
61
+ #### Option 2: High Precision (Diffusers/BF16)
62
+ Use this version if you have ample VRAM (e.g., 24GB+). Set `use_gguf = False` in the script to load the model using the standard `from_pretrained` directory structure for full **BFloat16** precision.
 
 
 
 
 
 
63
 
64
+ ## 🎨 Inpainting Guide
65
 
66
+ The `infer_inpaint.py` script leverages a powerful, unified inpainting system with multiple modes controlled by the `inpaint_mode` parameter.
 
67
 
68
+ ### Preparing Your Mask
69
+ For best results, especially when removing objects or dealing with complex edges, it's recommended to pre-process your mask. We provide a utility script for this.
70
+
71
+ **`prepare_mask.py`**
72
+ This script expands the white areas of your mask and applies a feather (blur) to the edges. This helps to completely cover artifacts from the old image and ensures a smooth, seamless blend with the new generated content.
73
+
74
+ **Usage:**
75
+ ```bash
76
+ python prepare_mask.py <input_mask_path> <output_mask_path> --expand 15 --feather 10
77
+ ```
78
+ * `--expand`: Expands the mask to cover "ghosting".
79
+ * `--feather`: Creates a soft gradient for seamless blending.
80
 
81
+ ### Inpainting Modes in `infer_inpaint.py`
82
+ You can choose the inpainting method by setting the `inpaint_mode` variable in the script:
83
 
84
+ 1. **`inpaint_mode = "default"`**
85
+ * Uses the standard ControlNet-based inpainting. Good for general-purpose tasks.
86
+
87
+ 2. **`inpaint_mode = "diff"`**
88
+ * Uses the "Differential Diffusion" inpainting technique. This method is excellent for preserving the original background texture and lighting perfectly while generating new content in the masked area. It works by composing latents at each step of the diffusion process.
89
+
90
+ 3. **`inpaint_mode = "diff+inpaint"`**
91
+ * Combines both methods. It uses the `diff` mode for background preservation while also feeding the inpainting context to the ControlNet layers. This can be useful for complex scenes where both structural guidance and texture preservation are needed.
92
 
93
  ## 🛠️ Model Features & Configuration (V2)
94
 
 
101
 
102
  This optmized V2 model introduces several new features and parameters for enhanced control and flexibility:
103
 
104
+ * **Unified Pipeline:** A single pipeline now handles Text-to-Image, Image-to-Image, ControlNet, and and multiple Inpainting modes.
105
  * **Refiner Scale (`controlnet_refiner_conditioning_scale`):** It provides fine-grained control over the influence of the initial refining layers, allowing for isolated adjustments without the influence of the controlnet's conditioning force.
106
  * **Optional Refiner (`add_control_noise_refiner=False`):** You can now disable the control noise refiner layers when loading the model to save memory or for different stylistic results.
107
  * **Inpainting Blur (`mask_blur_radius`):** A parameter to soften the edges of the inpainting mask for smoother transitions.
 
123
 
124
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
125
  <tr>
126
+ <td>Pose + Inpaint Image</td>
127
+ <td>Inpaint Mask</td>
128
+ <td>Model Inpaint</td>
129
+ <td>Diff Inpaint</td>
130
+ <td>Diff + Model Inpaint</td>
131
  </tr>
132
  <tr>
133
+ <td><img src="assets/pose.jpg" width="100%" /><img src="assets/inpaint.jpg" width="100%" /></td>
134
+ <td><img src="assets/inpaint_mask.jpg" width="100%" /></td>
135
+ <td><img src="results/new_tests/result_inpaint_default.png" width="100%" /></td>
136
+ <td><img src="results/new_tests/result_inpaint_diff.png" width="100%" /></td>
137
+ <td><img src="results/new_tests/result_inpaint_diffinpaint.png" width="100%" /></td>
138
  </tr>
139
  </table>
140
  <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
 
219
  | **20** | ![](results/scale_test/20_scale_0.65.png) | ![](results/scale_test/20_scale_0.70.png) | ![](results/scale_test/20_scale_0.75.png) | ![](results/scale_test/20_scale_0.8.png) | ![](results/scale_test/20_scale_0.9.png) | ![](results/scale_test/20_scale_1.0.png) |
220
  | **30** | ![](results/scale_test/30_scale_0.65.png) | ![](results/scale_test/30_scale_0.70.png) | ![](results/scale_test/30_scale_0.75.png) | ![](results/scale_test/30_scale_0.8.png) | ![](results/scale_test/30_scale_0.9.png) | ![](results/scale_test/30_scale_1.0.png) |
221
  | **40** | ![](results/scale_test/40_scale_0.65.png) | ![](results/scale_test/40_scale_0.70.png) | ![](results/scale_test/40_scale_0.75.png) | ![](results/scale_test/40_scale_0.8.png) | ![](results/scale_test/40_scale_0.9.png) | ![](results/scale_test/40_scale_1.0.png) |
222
+
223
+ ---
224
+
225
+ ## 📂 Repository Structure
226
+
227
+ * `./transformer/`: Directory for model weights (GGUF or standard).
228
+ * `infer_controlnet.py`: Script for ControlNet inference.
229
+ * `infer_inpaint.py`: Script for inpainting inference.
230
+ * `infer_t2i.py`: Script for Text-to-Image inference.
231
+ * `infer_i2i.py`: Script for Image-to-Image inference.
232
+ * `prepare_mask.py`: Utility script to process masks for inpainting.
233
+ * `diffusers_local/`: Custom pipeline code.
234
+ * `requirements.txt`: Python dependencies.
assets/inpaint_mask.jpg ADDED

Git LFS Details

  • SHA256: ceb6d79bacb39bf5378af4ac953522ccd7500cffbbe04d8861643ece5427e8e4
  • Pointer size: 130 Bytes
  • Size of remote file: 31.6 kB
assets/mask_1.jpg ADDED

Git LFS Details

  • SHA256: c294c68a191353e60b40c43a9922e3d1ec4017f86a34865570fc5acbbc01858c
  • Pointer size: 130 Bytes
  • Size of remote file: 73.4 kB
assets/{mask_inpaint.jpg → mask_2.jpg} RENAMED
File without changes
diffusers_local/pipeline_z_image_control_unified.py CHANGED
@@ -15,11 +15,12 @@
15
 
16
 
17
  import inspect
18
- from typing import Any, Callable, Dict, List, Optional, Tuple, Union
19
 
20
  import numpy as np
21
  import torch
22
  import torch.nn.functional as F
 
23
  from diffusers import AutoencoderKL, DiffusionPipeline, FlowMatchEulerDiscreteScheduler
24
  from diffusers.image_processor import PipelineImageInput, VaeImageProcessor
25
  from diffusers.loaders import FromSingleFileMixin, ZImageLoraLoaderMixin
@@ -467,6 +468,8 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
467
  reference_latents_shape: Tuple,
468
  device: torch.device,
469
  dtype: torch.dtype,
 
 
470
  ) -> torch.Tensor:
471
  """
472
  Processes a MASK using the mask_processor, inverts it, resizes it, and formats it for the control_context.
@@ -494,13 +497,18 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
494
  )
495
  return torch.zeros(placeholder_shape, device=device, dtype=dtype)
496
 
497
- mask_condition = self.mask_processor.preprocess(mask_image, height=height, width=width).to(device=device, dtype=dtype)
 
 
 
 
498
 
499
- mask_for_inpainting = 1.0 - mask_condition
500
-
501
- mask_latents = F.interpolate(mask_for_inpainting, size=reference_latents_shape[-2:], mode="nearest")
502
-
503
- return mask_latents.unsqueeze(2)
 
504
 
505
  def prepare_control_latents(
506
  self, image: PipelineImageInput, width: int, height: int, batch_size: int, num_images_per_prompt: int, device: torch.device, dtype: torch.dtype
@@ -595,7 +603,8 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
595
  prompt: Union[str, List[str]],
596
  image: Optional[PipelineImageInput] = None,
597
  mask_image: Optional[PipelineImageInput] = None,
598
- mask_blur_radius: float = 4.0,
 
599
  control_image: Optional[PipelineImageInput] = None,
600
  height: Optional[int] = None,
601
  width: Optional[int] = None,
@@ -630,7 +639,10 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
630
  The initial image for image-to-image or inpainting modes.
631
  mask_image (`PipelineImageInput`, *optional*):
632
  The mask image for inpainting. White areas are preserved, black areas are inpainted.
633
- mask_blur_radius (`float`, *optional*, defaults to 4.0):
 
 
 
634
  The radius for blurring the edges of the inpainting mask to create a smoother transition.
635
  control_image (`PipelineImageInput`, *optional*):
636
  The conditioning image for control modes (e.g., Canny, depth).
@@ -640,21 +652,21 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
640
  The width in pixels of the generated image.
641
  num_inference_steps (`int`, *optional*, defaults to 20):
642
  The number of denoising steps. More denoising steps usually lead to a higher quality image at the
643
- expense of slower inference.
644
  sigmas (`List[float]`, *optional*):
645
  Custom sigmas to use for the denoising process. If not defined, the scheduler's default behavior
646
- will be used.
647
  strength (`float`, *optional*, defaults to 1.0):
648
  Denoising strength for image-to-image. A value of 1.0 means the initial image is fully replaced,
649
- while a lower value preserves more of the original image structure. Only used in img2img mode.
650
  guidance_scale (`float`, *optional*, defaults to 4.0):
651
  The scale for classifier-free guidance. A value > 1 enables it. Higher values encourage images
652
- closer to the prompt, potentially at the cost of quality.
653
  cfg_normalization (`bool`, *optional*, defaults to False):
654
  Whether to apply normalization to the guidance, which can prevent oversaturation.
655
  cfg_truncation (`float`, *optional*, defaults to 1.0):
656
  A value between 0.0 and 1.0 that disables CFG for the final portion of the denoising steps,
657
- specified as a fraction of total steps. For example, 0.8 disables CFG for the last 20% of steps.
658
  negative_prompt (`str` or `List[str]`, *optional*):
659
  The prompt or prompts not to guide the image generation.
660
  num_images_per_prompt (`int`, *optional*, defaults to 1):
@@ -698,8 +710,12 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
698
  is_two_stage_control_model = self.transformer.control_in_dim > self.transformer.in_channels if hasattr(self.transformer, "control_in_dim") else False
699
  device = self._execution_device
700
  dtype = self.transformer.dtype
701
- vae_scale = self.vae_scale_factor * 2
702
-
 
 
 
 
703
  ref_image = control_image or image
704
  image_height = None
705
  image_width = None
@@ -742,22 +758,23 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
742
  prompt_embeds_model_input = prompt_embeds + negative_prompt_embeds
743
  else:
744
  prompt_embeds_model_input = prompt_embeds
745
-
746
- is_inpaint_mode = image is not None and mask_image is not None
747
- is_img2img_mode = image is not None and not is_inpaint_mode
748
-
749
- if control_image is not None or is_inpaint_mode:
750
  control_latents = self.prepare_control_latents(control_image, width, height, batch_size, num_images_per_prompt, device, dtype)
751
 
752
- if is_two_stage_control_model:
753
- mask_to_use = self._apply_mask_blur(mask_image, mask_blur_radius, is_inpaint_mode)
 
 
 
 
754
 
755
  inpaint_latents = self._prepare_image_latents(
756
- image, mask_to_use, width, height, batch_size, num_images_per_prompt, device, dtype, do_preprocess=True
757
  )
758
-
759
  mask_latents = self._prepare_mask_latents(
760
- mask_to_use,
761
  width,
762
  height,
763
  batch_size,
@@ -765,6 +782,8 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
765
  inpaint_latents.shape,
766
  device,
767
  dtype,
 
 
768
  )
769
  control_context = torch.cat([control_latents, mask_latents, inpaint_latents], dim=1)
770
  else:
@@ -783,7 +802,7 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
783
  timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, device, sigmas, mu=mu)
784
  self._num_timesteps = len(timesteps)
785
 
786
- if is_img2img_mode and not is_inpaint_mode:
787
  strength = min(strength, 1.0)
788
  else:
789
  strength = 1.0
@@ -798,7 +817,8 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
798
 
799
  latent_timestep = timesteps[:1].repeat(effective_batch_size) if strength < 1.0 else None
800
 
801
- use_image_for_latents = is_img2img_mode and not is_inpaint_mode
 
802
  latents = self.prepare_latents(
803
  effective_batch_size,
804
  self.transformer.in_channels,
@@ -811,33 +831,78 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
811
  timestep=latent_timestep if use_image_for_latents else None,
812
  latents=latents,
813
  )
814
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
815
  num_warmup_steps = len(timesteps) - num_steps_to_run * self.scheduler.order
816
  with torch.inference_mode():
817
  with self.progress_bar(total=num_steps_to_run) as progress_bar:
818
  for i, t in enumerate(timesteps):
819
  if self.interrupt:
820
  continue
821
-
 
 
 
 
 
 
 
 
 
 
 
 
822
  timestep = t.expand(latents.shape[0])
823
  timestep = (1000 - timestep) / 1000
 
824
  t_norm = timestep[0].item()
825
-
826
  current_guidance_scale = self.guidance_scale
827
  if self.do_classifier_free_guidance and self._cfg_truncation is not None and float(self._cfg_truncation) <= 1:
828
  if t_norm > self._cfg_truncation:
829
  current_guidance_scale = 0.0
830
-
831
  apply_cfg = self.do_classifier_free_guidance and current_guidance_scale > 0
832
 
833
  if apply_cfg:
834
- latents_typed = latents.to(self.transformer.dtype)
835
- latent_model_input = latents_typed.repeat(2, 1, 1, 1)
836
  timestep_model_input = timestep.repeat(2)
837
  else:
838
- latent_model_input = latents.to(self.transformer.dtype)
839
  timestep_model_input = timestep
840
 
 
841
  latent_model_input = latent_model_input.unsqueeze(2)
842
  latent_model_input_list = list(latent_model_input.unbind(dim=0))
843
 
 
15
 
16
 
17
  import inspect
18
+ from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, Union
19
 
20
  import numpy as np
21
  import torch
22
  import torch.nn.functional as F
23
+ import torchvision.transforms as T
24
  from diffusers import AutoencoderKL, DiffusionPipeline, FlowMatchEulerDiscreteScheduler
25
  from diffusers.image_processor import PipelineImageInput, VaeImageProcessor
26
  from diffusers.loaders import FromSingleFileMixin, ZImageLoraLoaderMixin
 
468
  reference_latents_shape: Tuple,
469
  device: torch.device,
470
  dtype: torch.dtype,
471
+ invert_mask: bool = False,
472
+ do_unsqueeze: bool = True,
473
  ) -> torch.Tensor:
474
  """
475
  Processes a MASK using the mask_processor, inverts it, resizes it, and formats it for the control_context.
 
497
  )
498
  return torch.zeros(placeholder_shape, device=device, dtype=dtype)
499
 
500
+ mask_tensor = self.mask_processor.preprocess(mask_image, height=height, width=width)
501
+ mask_tensor = mask_tensor.to(device=device, dtype=dtype)
502
+
503
+ if invert_mask:
504
+ mask_tensor = 1.0 - mask_tensor
505
 
506
+ mask_latents = F.interpolate(mask_tensor, size=reference_latents_shape[-2:], mode="nearest")
507
+
508
+ if do_unsqueeze:
509
+ mask_latents = mask_latents.unsqueeze(2)
510
+
511
+ return mask_latents
512
 
513
  def prepare_control_latents(
514
  self, image: PipelineImageInput, width: int, height: int, batch_size: int, num_images_per_prompt: int, device: torch.device, dtype: torch.dtype
 
603
  prompt: Union[str, List[str]],
604
  image: Optional[PipelineImageInput] = None,
605
  mask_image: Optional[PipelineImageInput] = None,
606
+ inpaint_mode: Literal["default", "diff", "diff+inpaint"] = "default",
607
+ mask_blur_radius: float=8.0,
608
  control_image: Optional[PipelineImageInput] = None,
609
  height: Optional[int] = None,
610
  width: Optional[int] = None,
 
639
  The initial image for image-to-image or inpainting modes.
640
  mask_image (`PipelineImageInput`, *optional*):
641
  The mask image for inpainting. White areas are preserved, black areas are inpainted.
642
+ inpaint_mode (`str`, *optional*, defaults to `"default"`):
643
+ The inpainting mode. Can be "default", "diff", or "diff+inpaint". Determines how the inpainting
644
+ process is handled.
645
+ mask_blur_radius (`float`, *optional*, defaults to 8.0):
646
  The radius for blurring the edges of the inpainting mask to create a smoother transition.
647
  control_image (`PipelineImageInput`, *optional*):
648
  The conditioning image for control modes (e.g., Canny, depth).
 
652
  The width in pixels of the generated image.
653
  num_inference_steps (`int`, *optional*, defaults to 20):
654
  The number of denoising steps. More denoising steps usually lead to a higher quality image at the
655
+ expense of slower inference.
656
  sigmas (`List[float]`, *optional*):
657
  Custom sigmas to use for the denoising process. If not defined, the scheduler's default behavior
658
+ will be used.
659
  strength (`float`, *optional*, defaults to 1.0):
660
  Denoising strength for image-to-image. A value of 1.0 means the initial image is fully replaced,
661
+ while a lower value preserves more of the original image structure. Only used in img2img mode.
662
  guidance_scale (`float`, *optional*, defaults to 4.0):
663
  The scale for classifier-free guidance. A value > 1 enables it. Higher values encourage images
664
+ closer to the prompt, potentially at the cost of quality.
665
  cfg_normalization (`bool`, *optional*, defaults to False):
666
  Whether to apply normalization to the guidance, which can prevent oversaturation.
667
  cfg_truncation (`float`, *optional*, defaults to 1.0):
668
  A value between 0.0 and 1.0 that disables CFG for the final portion of the denoising steps,
669
+ specified as a fraction of total steps. For example, 0.8 disables CFG for the last 20% of steps.
670
  negative_prompt (`str` or `List[str]`, *optional*):
671
  The prompt or prompts not to guide the image generation.
672
  num_images_per_prompt (`int`, *optional*, defaults to 1):
 
710
  is_two_stage_control_model = self.transformer.control_in_dim > self.transformer.in_channels if hasattr(self.transformer, "control_in_dim") else False
711
  device = self._execution_device
712
  dtype = self.transformer.dtype
713
+ vae_scale = self.vae_scale_factor * 2
714
+ has_inpaint_inputs = image is not None and mask_image is not None
715
+ is_inpaint_control_mode = has_inpaint_inputs and inpaint_mode in ["default", "diff+inpaint"]
716
+ is_diff_mode = has_inpaint_inputs and inpaint_mode in ["diff", "diff+inpaint"]
717
+ is_img2img_mode = image is not None and not has_inpaint_inputs
718
+
719
  ref_image = control_image or image
720
  image_height = None
721
  image_width = None
 
758
  prompt_embeds_model_input = prompt_embeds + negative_prompt_embeds
759
  else:
760
  prompt_embeds_model_input = prompt_embeds
761
+
762
+ if control_image is not None or is_inpaint_control_mode:
 
 
 
763
  control_latents = self.prepare_control_latents(control_image, width, height, batch_size, num_images_per_prompt, device, dtype)
764
 
765
+ if is_two_stage_control_model:
766
+ image_for_inpaint = None if is_diff_mode and not is_inpaint_control_mode else image
767
+ mask_for_inpaint = None if is_diff_mode and not is_inpaint_control_mode else mask_image
768
+
769
+ if is_inpaint_control_mode:
770
+ mask_for_inpaint = self._apply_mask_blur(mask_for_inpaint, mask_blur_radius, True)
771
 
772
  inpaint_latents = self._prepare_image_latents(
773
+ image_for_inpaint, mask_for_inpaint, width, height, batch_size, num_images_per_prompt, device, dtype
774
  )
775
+
776
  mask_latents = self._prepare_mask_latents(
777
+ mask_for_inpaint,
778
  width,
779
  height,
780
  batch_size,
 
782
  inpaint_latents.shape,
783
  device,
784
  dtype,
785
+ invert_mask=is_inpaint_control_mode,
786
+ do_unsqueeze=True,
787
  )
788
  control_context = torch.cat([control_latents, mask_latents, inpaint_latents], dim=1)
789
  else:
 
802
  timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, device, sigmas, mu=mu)
803
  self._num_timesteps = len(timesteps)
804
 
805
+ if is_img2img_mode:
806
  strength = min(strength, 1.0)
807
  else:
808
  strength = 1.0
 
817
 
818
  latent_timestep = timesteps[:1].repeat(effective_batch_size) if strength < 1.0 else None
819
 
820
+ use_image_for_latents = is_img2img_mode
821
+
822
  latents = self.prepare_latents(
823
  effective_batch_size,
824
  self.transformer.in_channels,
 
831
  timestep=latent_timestep if use_image_for_latents else None,
832
  latents=latents,
833
  )
834
+
835
+ if is_diff_mode:
836
+ original_image_tensor = self.image_processor.preprocess(image, height=height, width=width).to(device=device, dtype=self.vae.dtype)
837
+ with torch.no_grad():
838
+ original_clean_latents = retrieve_latents(self.vae.encode(original_image_tensor), sample_mode="argmax")
839
+ original_clean_latents = (original_clean_latents - self.vae.config.shift_factor) * self.vae.config.scaling_factor
840
+ original_clean_latents = original_clean_latents.to(dtype)
841
+
842
+ noise = randn_tensor(original_clean_latents.shape, generator=generator, device=device, dtype=dtype)
843
+ latents_list = []
844
+ step_indices = [(self.scheduler.timesteps == t).nonzero().item() for t in timesteps]
845
+ for i in step_indices:
846
+ sigma = self.scheduler.sigmas[i]
847
+ noisy_latent = (1.0 - sigma) * original_clean_latents + sigma * noise
848
+ latents_list.append(noisy_latent)
849
+
850
+ original_latents_trajectory = torch.cat(latents_list, dim=0)
851
+ blurred_mask_image = self._apply_mask_blur(mask_image, mask_blur_radius, True)
852
+ map_processed = self._prepare_mask_latents(
853
+ blurred_mask_image,
854
+ width,
855
+ height,
856
+ batch_size,
857
+ num_images_per_prompt,
858
+ latents.shape,
859
+ device,
860
+ dtype,
861
+ invert_mask=True,
862
+ do_unsqueeze=False,
863
+ )
864
+
865
+ thresholds = torch.arange(len(timesteps), device=device, dtype=dtype) / len(timesteps)
866
+ thresholds = thresholds.view(-1, 1, 1, 1)
867
+ time_masks = map_processed > thresholds
868
+
869
  num_warmup_steps = len(timesteps) - num_steps_to_run * self.scheduler.order
870
  with torch.inference_mode():
871
  with self.progress_bar(total=num_steps_to_run) as progress_bar:
872
  for i, t in enumerate(timesteps):
873
  if self.interrupt:
874
  continue
875
+
876
+ if is_diff_mode:
877
+ if i == 0:
878
+ latents = original_latents_trajectory[:1]
879
+ else:
880
+ current_mask = time_masks[i].to(latents.dtype)
881
+ current_original_latent = original_latents_trajectory[i:i+1]
882
+
883
+ if current_mask.ndim == 3:
884
+ current_mask = current_mask.unsqueeze(1)
885
+
886
+ latents = current_original_latent * current_mask + latents * (1 - current_mask)
887
+
888
  timestep = t.expand(latents.shape[0])
889
  timestep = (1000 - timestep) / 1000
890
+
891
  t_norm = timestep[0].item()
 
892
  current_guidance_scale = self.guidance_scale
893
  if self.do_classifier_free_guidance and self._cfg_truncation is not None and float(self._cfg_truncation) <= 1:
894
  if t_norm > self._cfg_truncation:
895
  current_guidance_scale = 0.0
 
896
  apply_cfg = self.do_classifier_free_guidance and current_guidance_scale > 0
897
 
898
  if apply_cfg:
899
+ latent_model_input = latents.repeat(2, 1, 1, 1)
 
900
  timestep_model_input = timestep.repeat(2)
901
  else:
902
+ latent_model_input = latents
903
  timestep_model_input = timestep
904
 
905
+ latent_model_input = latent_model_input.to(self.transformer.dtype)
906
  latent_model_input = latent_model_input.unsqueeze(2)
907
  latent_model_input_list = list(latent_model_input.unbind(dim=0))
908
 
infer_inpaint.py CHANGED
@@ -11,16 +11,14 @@ from diffusers_local import patch # Apply necessary patches for local diffusers
11
  from diffusers_local.pipeline_z_image_control_unified import ZImageControlUnifiedPipeline
12
  from diffusers_local.z_image_control_transformer_2d import ZImageControlTransformer2DModel
13
 
14
-
15
  os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,garbage_collection_threshold:0.7,max_split_size_mb:1024"
16
 
17
-
18
  def main():
19
  # 1. Set params
20
- BASE_MODEL_ID = "."
21
  GGUF_MODEL_FILE = "./transformer/z_image_turbo_control_unified_v2.1_q4_k_m.gguf"
22
  GGUF_MODEL_FILE = "./transformer/z_image_turbo_control_unified_v2.1_q8_0.gguf"
23
-
24
  use_gguf = True
25
 
26
  # prompt="一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动,裙摆轻盈飞扬。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,肤色白皙细腻,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,手指清晰可见、五指完整、指节自然、姿势优雅放松,仿佛沉浸在思绪之中。背景是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕,海浪轻轻拍打沙滩,天空湛蓝云朵稀薄。整体画面高清锐利、细节丰富、色彩鲜艳、焦点清晰、8K分辨率、杰作、最佳质量、无模糊、无噪点、无畸变、自然光照、电影级渲染。"
@@ -29,13 +27,14 @@ def main():
29
 
30
  target_height = 1728
31
  target_width = 992
32
- num_inference_steps = 20
33
  guidance_scale = 0 # 2.5
34
  controlnet_conditioning_scale = 0.7
35
  controlnet_conditioning_refiner_scale = 0.75
36
- mask_blur_radius = 8.0
37
- seed = 42
38
  shift = 3.0
 
39
  generator = torch.Generator("cuda").manual_seed(seed)
40
 
41
  print("Loading Pipeline...")
@@ -74,8 +73,7 @@ def main():
74
 
75
  pose_image = load_image("assets/pose.jpg")
76
  inpaint_image = load_image("assets/inpaint.jpg")
77
- mask_image = load_image("assets/mask_inpaint.jpg")
78
-
79
  start_inference_time = time.time()
80
 
81
  generated_image = pipe(
@@ -84,7 +82,8 @@ def main():
84
  image=inpaint_image,
85
  control_image=pose_image,
86
  mask_image=mask_image,
87
- mask_blur_radius=mask_blur_radius,
 
88
  height=target_height,
89
  width=target_width,
90
  num_inference_steps=num_inference_steps,
 
11
  from diffusers_local.pipeline_z_image_control_unified import ZImageControlUnifiedPipeline
12
  from diffusers_local.z_image_control_transformer_2d import ZImageControlTransformer2DModel
13
 
 
14
  os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,garbage_collection_threshold:0.7,max_split_size_mb:1024"
15
 
 
16
  def main():
17
  # 1. Set params
18
+ BASE_MODEL_ID = "."
19
  GGUF_MODEL_FILE = "./transformer/z_image_turbo_control_unified_v2.1_q4_k_m.gguf"
20
  GGUF_MODEL_FILE = "./transformer/z_image_turbo_control_unified_v2.1_q8_0.gguf"
21
+
22
  use_gguf = True
23
 
24
  # prompt="一位年轻女子站在阳光明媚的海岸线上,白裙在轻拂的海风中微微飘动,裙摆轻盈飞扬。她拥有一头鲜艳的紫色长发,在风中轻盈舞动,发间系着一个精致的黑色蝴蝶结,与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀,眉目精致,肤色白皙细腻,透着一股甜美的青春气息;神情柔和,略带羞涩,目光静静地凝望着远方的地平线,双手自然交叠于身前,手指清晰可见、五指完整、指节自然、姿势优雅放松,仿佛沉浸在思绪之中。背景是辽阔无垠、波光粼粼的大海,阳光洒在海面上,映出温暖的金色光晕,海浪轻轻拍打沙滩,天空湛蓝云朵稀薄。整体画面高清锐利、细节丰富、色彩鲜艳、焦点清晰、8K分辨率、杰作、最佳质量、无模糊、无噪点、无畸变、自然光照、电影级渲染。"
 
27
 
28
  target_height = 1728
29
  target_width = 992
30
+ num_inference_steps = 25
31
  guidance_scale = 0 # 2.5
32
  controlnet_conditioning_scale = 0.7
33
  controlnet_conditioning_refiner_scale = 0.75
34
+ mask_blur_radius = 12
35
+ seed = 48
36
  shift = 3.0
37
+ inpaint_mode = "diff+inpaint" # ("default", "diff", "diff+inpaint")
38
  generator = torch.Generator("cuda").manual_seed(seed)
39
 
40
  print("Loading Pipeline...")
 
73
 
74
  pose_image = load_image("assets/pose.jpg")
75
  inpaint_image = load_image("assets/inpaint.jpg")
76
+ mask_image = load_image("assets/inpaint_mask.jpg")
 
77
  start_inference_time = time.time()
78
 
79
  generated_image = pipe(
 
82
  image=inpaint_image,
83
  control_image=pose_image,
84
  mask_image=mask_image,
85
+ mask_blur_radius=mask_blur_radius,
86
+ inpaint_mode=inpaint_mode,
87
  height=target_height,
88
  width=target_width,
89
  num_inference_steps=num_inference_steps,
prepare_mask.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ from PIL import Image, ImageFilter
3
+
4
+ def expand_and_feather_mask(mask_image: Image.Image, expand_pixels: int = 10, feather_radius: int = 8) -> Image.Image:
5
+ """
6
+ Expands the white area of a mask and then smooths its edges using Pillow filters.
7
+
8
+ This is useful for preparing inpainting masks to ensure complete coverage of the
9
+ area to be replaced and to create a smooth blend with the surrounding image.
10
+
11
+ Args:
12
+ mask_image (PIL.Image.Image): The input mask (black and white). It's
13
+ expected to be a PIL Image.
14
+ expand_pixels (int): The number of pixels to expand (dilate) the white
15
+ area. This helps to cover any "ghosting" from the old image.
16
+ feather_radius (int): The radius of the Gaussian blur used to create the
17
+ soft edge (feathering) effect.
18
+
19
+ Returns:
20
+ PIL.Image.Image: The processed mask with expanded and feathered edges.
21
+ """
22
+ # Ensure the mask is in 'L' (grayscale) mode for the filters to work correctly.
23
+ mask = mask_image.convert("L")
24
+
25
+ # 1. Expansion (Dilation)
26
+ # The MaxFilter finds the brightest pixel in a kernel window and replaces the
27
+ # center pixel with it. For a black and white image, this causes the white
28
+ # areas to expand.
29
+ if expand_pixels > 0:
30
+ # The filter size must be an odd number. The formula (pixels * 2 + 1)
31
+ # creates a kernel of the correct odd size.
32
+ expand_size = expand_pixels * 2 + 1
33
+ print(f"Expanding mask by {expand_pixels} pixels (filter size: {expand_size}x{expand_size})...")
34
+ mask = mask.filter(ImageFilter.MaxFilter(size=expand_size))
35
+
36
+ # 2. Feathering (Gaussian Blur)
37
+ # Applies a Gaussian blur to the expanded mask, creating a smooth
38
+ # gradient from white to black at the edges.
39
+ if feather_radius > 0:
40
+ print(f"Feathering mask with a radius of {feather_radius} pixels...")
41
+ mask = mask.filter(ImageFilter.GaussianBlur(radius=feather_radius))
42
+
43
+ return mask
44
+
45
+ def main():
46
+ """Main function to parse arguments and process the mask."""
47
+ parser = argparse.ArgumentParser(description="Expand and feather an inpainting mask.")
48
+
49
+ parser.add_argument(
50
+ "input_path",
51
+ type=str,
52
+ help="Path to the input mask image file."
53
+ )
54
+ parser.add_argument(
55
+ "output_path",
56
+ type=str,
57
+ help="Path to save the processed output mask image file."
58
+ )
59
+ parser.add_argument(
60
+ "--expand",
61
+ type=int,
62
+ default=10,
63
+ help="Number of pixels to expand the white areas of the mask. Default is 10."
64
+ )
65
+ parser.add_argument(
66
+ "--feather",
67
+ type=int,
68
+ default=8,
69
+ help="Radius in pixels for the Gaussian blur (feathering) effect. Default is 8."
70
+ )
71
+
72
+ args = parser.parse_args()
73
+
74
+ try:
75
+ # Load the input mask
76
+ print(f"Loading mask from: {args.input_path}")
77
+ original_mask = Image.open(args.input_path)
78
+ except FileNotFoundError:
79
+ print(f"Error: Input file not found at '{args.input_path}'")
80
+ return
81
+ except Exception as e:
82
+ print(f"Error loading image: {e}")
83
+ return
84
+
85
+ # Process the mask using the function
86
+ processed_mask = expand_and_feather_mask(
87
+ original_mask,
88
+ expand_pixels=args.expand,
89
+ feather_radius=args.feather
90
+ )
91
+
92
+ # Save the final mask
93
+ try:
94
+ print(f"Saving processed mask to: {args.output_path}")
95
+ processed_mask.save(args.output_path)
96
+ print("Done!")
97
+ except Exception as e:
98
+ print(f"Error saving image: {e}")
99
+
100
+ if __name__ == "__main__":
101
+ main()
results/new_tests/{result_inpaint.png → result_inpaint_2.png} RENAMED
File without changes
results/new_tests/result_inpaint_default.png ADDED

Git LFS Details

  • SHA256: baf856d1e4e581cbced169a801c0e90efc00b37d117be19bee892d1865d511c2
  • Pointer size: 132 Bytes
  • Size of remote file: 1.92 MB
results/new_tests/result_inpaint_diff.png ADDED

Git LFS Details

  • SHA256: 1bc7cc72bc1959e3ab65bbc2b23d3c57ebf93ea4ca78671fbdc9a8ab38e8e6bd
  • Pointer size: 132 Bytes
  • Size of remote file: 1.7 MB
results/new_tests/result_inpaint_diffinpaint.png ADDED

Git LFS Details

  • SHA256: 60099065d5601a81ea53518d49c8f8d6902174dea9e39d4eb5d3ea1b34c9cf5f
  • Pointer size: 132 Bytes
  • Size of remote file: 1.71 MB