add differential diffusion inpaint

Browse files

Files changed (12) hide show

.gitignore +2 -1
README.md +66 -22
assets/inpaint_mask.jpg +3 -0
assets/mask_1.jpg +3 -0
assets/{mask_inpaint.jpg → mask_2.jpg} +0 -0
diffusers_local/pipeline_z_image_control_unified.py +100 -35
infer_inpaint.py +9 -10
prepare_mask.py +101 -0
results/new_tests/{result_inpaint.png → result_inpaint_2.png} +0 -0
results/new_tests/result_inpaint_default.png +3 -0
results/new_tests/result_inpaint_diff.png +3 -0
results/new_tests/result_inpaint_diffinpaint.png +3 -0

.gitignore CHANGED Viewed

@@ -17,4 +17,5 @@ bk/
 outputs/
 original/
 Makefile
-pyproject.toml

 outputs/
 original/
 Makefile
+pyproject.toml
+README_.md

README.md CHANGED Viewed

@@ -41,29 +41,54 @@ pip install -r requirements.txt
 ## 🚀 Usage
-## 📂 Repository Structure
-*   `./transformer/z_image_turbo_control_unified_v2.1_q4_k_m.gguf`: The unified, quantized Q4_K_M model weights.
-*   `./transformer/z_image_turbo_control_unified_v2.1_q8_0.gguf`: The unified, quantized Q8_0 model weights.
-*   `infer_controlnet.py`: Script for running controlnet inference.
-*   `infer_inpaint.py`: Script for running inpaint inference.
-*   `infer_t2i.py`: Script for running text-to-image inference.
-*   `infer_i2i.py`: Script for running image-to-image inference.
-*   `diffusers_local/`: Custom pipeline code (`ZImageControlUnifiedPipeline`) and transformer logic.
-*   `requirements.txt`: Python dependencies.
-The primary script for inference is `infer_controlnet.py`, which is designed to handle all supported generation modes.
-### Option 1: Low VRAM (GGUF) - Recommended
-Use this version if you have limited VRAM (e.g., 6GB - 8GB). It loads the model from a quantized **GGUF** file (`z_image_turbo_control_unified_v2.1_q4_k_m.gguf`). Simply configure the `infer_controlnet.py` script to point to the GGUF file.
-**Key Features of this mode:**
-*   Loads the unified transformer from a single 4-bit quantized file.
-*   Enables aggressive `group_offload` to fit large models in consumer GPUs.
-### Option 2: High Precision (Diffusers/BF16)
-Use this version if you have ample VRAM (e.g., 24GB+). Configure `infer_controlnet.py` to load the model using the standard `from_pretrained` directory structure for full **BFloat16** precision.
 ## 🛠️ Model Features & Configuration (V2)
@@ -76,7 +101,7 @@ Use this version if you have ample VRAM (e.g., 24GB+). Configure `infer_controln
 This optmized V2 model introduces several new features and parameters for enhanced control and flexibility:
-*   **Unified Pipeline:** A single pipeline now handles Text-to-Image, Image-to-Image, ControlNet, and Inpainting tasks.
 *   **Refiner Scale (`controlnet_refiner_conditioning_scale`):** It provides fine-grained control over the influence of the initial refining layers, allowing for isolated adjustments without the influence of the controlnet's conditioning force.
 *   **Optional Refiner (`add_control_noise_refiner=False`):** You can now disable the control noise refiner layers when loading the model to save memory or for different stylistic results.
 *   **Inpainting Blur (`mask_blur_radius`):** A parameter to soften the edges of the inpainting mask for smoother transitions.
@@ -98,12 +123,18 @@ The new `controlnet_refiner_conditioning_scale` parameter allows for fine-tuning
 <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
   <tr>
-    <td>Pose + Inpaint</td>
-    <td>Output</td>
   </tr>
   <tr>
-    <td><img src="assets/inpaint.jpg" width="100%" /><img src="assets/mask_inpaint.jpg" width="100%" /></td>
-    <td><img src="results/new_tests/result_inpaint.png" width="100%" /></td>
   </tr>
 </table>
 <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
@@ -188,3 +219,16 @@ The table below shows the generation results under different combinations of Dif
 | **20** | ![](results/scale_test/20_scale_0.65.png) | ![](results/scale_test/20_scale_0.70.png) | ![](results/scale_test/20_scale_0.75.png) | ![](results/scale_test/20_scale_0.8.png) | ![](results/scale_test/20_scale_0.9.png) | ![](results/scale_test/20_scale_1.0.png) |
 | **30** | ![](results/scale_test/30_scale_0.65.png) | ![](results/scale_test/30_scale_0.70.png) | ![](results/scale_test/30_scale_0.75.png) | ![](results/scale_test/30_scale_0.8.png) | ![](results/scale_test/30_scale_0.9.png) | ![](results/scale_test/30_scale_1.0.png) |
 | **40** | ![](results/scale_test/40_scale_0.65.png) | ![](results/scale_test/40_scale_0.70.png) | ![](results/scale_test/40_scale_0.75.png) | ![](results/scale_test/40_scale_0.8.png) | ![](results/scale_test/40_scale_0.9.png) | ![](results/scale_test/40_scale_1.0.png) |

 ## 🚀 Usage
+This repository provides separate, easy-to-use scripts for each generation task.
+### High-Level Scripts
+*   `infer_t2i.py`: For Text-to-Image generation.
+*   `infer_i2i.py`: For Image-to-Image generation.
+*   `infer_controlnet.py`: For ControlNet-guided generation (Pose, Canny, Depth, etc.).
+*   `infer_inpaint.py`: For all inpainting tasks.
+### Hardware Options
+#### Option 1: Low VRAM (GGUF) - Recommended
+Use this version if you have limited VRAM (e.g., 6GB - 8GB). It loads the model from a quantized **GGUF** file. To use it, set `use_gguf = True` in the desired inference script and provide the path to the `.gguf` file.
+**Key Features:**
+*   Loads the unified transformer from a single 4-bit or 8-bit quantized file.
+*   Enables aggressive `group_offload` to fit large models on consumer GPUs.
+#### Option 2: High Precision (Diffusers/BF16)
+Use this version if you have ample VRAM (e.g., 24GB+). Set `use_gguf = False` in the script to load the model using the standard `from_pretrained` directory structure for full **BFloat16** precision.
+## 🎨 Inpainting Guide
+The `infer_inpaint.py` script leverages a powerful, unified inpainting system with multiple modes controlled by the `inpaint_mode` parameter.
+### Preparing Your Mask
+For best results, especially when removing objects or dealing with complex edges, it's recommended to pre-process your mask. We provide a utility script for this.
+**`prepare_mask.py`**
+This script expands the white areas of your mask and applies a feather (blur) to the edges. This helps to completely cover artifacts from the old image and ensures a smooth, seamless blend with the new generated content.
+**Usage:**
+```bash
+python prepare_mask.py <input_mask_path> <output_mask_path> --expand 15 --feather 10
+```
+*   `--expand`: Expands the mask to cover "ghosting".
+*   `--feather`: Creates a soft gradient for seamless blending.
+### Inpainting Modes in `infer_inpaint.py`
+You can choose the inpainting method by setting the `inpaint_mode` variable in the script:
+1.  **`inpaint_mode = "default"`**
+    *   Uses the standard ControlNet-based inpainting. Good for general-purpose tasks.
+2.  **`inpaint_mode = "diff"`**
+    *   Uses the "Differential Diffusion" inpainting technique. This method is excellent for preserving the original background texture and lighting perfectly while generating new content in the masked area. It works by composing latents at each step of the diffusion process.
+3.  **`inpaint_mode = "diff+inpaint"`**
+    *   Combines both methods. It uses the `diff` mode for background preservation while also feeding the inpainting context to the ControlNet layers. This can be useful for complex scenes where both structural guidance and texture preservation are needed.
 ## 🛠️ Model Features & Configuration (V2)
 This optmized V2 model introduces several new features and parameters for enhanced control and flexibility:
+*   **Unified Pipeline:** A single pipeline now handles Text-to-Image, Image-to-Image, ControlNet, and and multiple Inpainting modes.
 *   **Refiner Scale (`controlnet_refiner_conditioning_scale`):** It provides fine-grained control over the influence of the initial refining layers, allowing for isolated adjustments without the influence of the controlnet's conditioning force.
 *   **Optional Refiner (`add_control_noise_refiner=False`):** You can now disable the control noise refiner layers when loading the model to save memory or for different stylistic results.
 *   **Inpainting Blur (`mask_blur_radius`):** A parameter to soften the edges of the inpainting mask for smoother transitions.
 <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
   <tr>
+    <td>Pose + Inpaint Image</td>
+    <td>Inpaint Mask</td>
+    <td>Model Inpaint</td>
+    <td>Diff Inpaint</td>
+    <td>Diff + Model Inpaint</td>
   </tr>
   <tr>
+    <td><img src="assets/pose.jpg" width="100%" /><img src="assets/inpaint.jpg" width="100%" /></td>
+    <td><img src="assets/inpaint_mask.jpg" width="100%" /></td>
+    <td><img src="results/new_tests/result_inpaint_default.png" width="100%" /></td>
+    <td><img src="results/new_tests/result_inpaint_diff.png" width="100%" /></td>
+    <td><img src="results/new_tests/result_inpaint_diffinpaint.png" width="100%" /></td>
   </tr>
 </table>
 <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
 | **20** | ![](results/scale_test/20_scale_0.65.png) | ![](results/scale_test/20_scale_0.70.png) | ![](results/scale_test/20_scale_0.75.png) | ![](results/scale_test/20_scale_0.8.png) | ![](results/scale_test/20_scale_0.9.png) | ![](results/scale_test/20_scale_1.0.png) |
 | **30** | ![](results/scale_test/30_scale_0.65.png) | ![](results/scale_test/30_scale_0.70.png) | ![](results/scale_test/30_scale_0.75.png) | ![](results/scale_test/30_scale_0.8.png) | ![](results/scale_test/30_scale_0.9.png) | ![](results/scale_test/30_scale_1.0.png) |
 | **40** | ![](results/scale_test/40_scale_0.65.png) | ![](results/scale_test/40_scale_0.70.png) | ![](results/scale_test/40_scale_0.75.png) | ![](results/scale_test/40_scale_0.8.png) | ![](results/scale_test/40_scale_0.9.png) | ![](results/scale_test/40_scale_1.0.png) |
+---
+## 📂 Repository Structure
+*   `./transformer/`: Directory for model weights (GGUF or standard).
+*   `infer_controlnet.py`: Script for ControlNet inference.
+*   `infer_inpaint.py`: Script for inpainting inference.
+*   `infer_t2i.py`: Script for Text-to-Image inference.
+*   `infer_i2i.py`: Script for Image-to-Image inference.
+*   `prepare_mask.py`: Utility script to process masks for inpainting.
+*   `diffusers_local/`: Custom pipeline code.
+*   `requirements.txt`: Python dependencies.

assets/inpaint_mask.jpg ADDED Viewed

Git LFS Details

SHA256: ceb6d79bacb39bf5378af4ac953522ccd7500cffbbe04d8861643ece5427e8e4
Pointer size: 130 Bytes
Size of remote file: 31.6 kB

assets/mask_1.jpg ADDED Viewed

Git LFS Details

SHA256: c294c68a191353e60b40c43a9922e3d1ec4017f86a34865570fc5acbbc01858c
Pointer size: 130 Bytes
Size of remote file: 73.4 kB

assets/{mask_inpaint.jpg → mask_2.jpg} RENAMED Viewed

File without changes

diffusers_local/pipeline_z_image_control_unified.py CHANGED Viewed

@@ -15,11 +15,12 @@
 import inspect
-from typing import Any, Callable, Dict, List, Optional, Tuple, Union
 import numpy as np
 import torch
 import torch.nn.functional as F
 from diffusers import AutoencoderKL, DiffusionPipeline, FlowMatchEulerDiscreteScheduler
 from diffusers.image_processor import PipelineImageInput, VaeImageProcessor
 from diffusers.loaders import FromSingleFileMixin, ZImageLoraLoaderMixin
@@ -467,6 +468,8 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
         reference_latents_shape: Tuple,
         device: torch.device,
         dtype: torch.dtype,
     ) -> torch.Tensor:
         """
         Processes a MASK using the mask_processor, inverts it, resizes it, and formats it for the control_context.
@@ -494,13 +497,18 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
             )
             return torch.zeros(placeholder_shape, device=device, dtype=dtype)
-        mask_condition = self.mask_processor.preprocess(mask_image, height=height, width=width).to(device=device, dtype=dtype)
-        mask_for_inpainting = 1.0 - mask_condition
-        mask_latents = F.interpolate(mask_for_inpainting, size=reference_latents_shape[-2:], mode="nearest")
-        return mask_latents.unsqueeze(2)
     def prepare_control_latents(
         self, image: PipelineImageInput, width: int, height: int, batch_size: int, num_images_per_prompt: int, device: torch.device, dtype: torch.dtype
@@ -595,7 +603,8 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
         prompt: Union[str, List[str]],
         image: Optional[PipelineImageInput] = None,
         mask_image: Optional[PipelineImageInput] = None,
-        mask_blur_radius: float = 4.0,
         control_image: Optional[PipelineImageInput] = None,
         height: Optional[int] = None,
         width: Optional[int] = None,
@@ -630,7 +639,10 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
                 The initial image for image-to-image or inpainting modes.
             mask_image (`PipelineImageInput`, *optional*):
                 The mask image for inpainting. White areas are preserved, black areas are inpainted.
-            mask_blur_radius (`float`, *optional*, defaults to 4.0):
                 The radius for blurring the edges of the inpainting mask to create a smoother transition.
             control_image (`PipelineImageInput`, *optional*):
                 The conditioning image for control modes (e.g., Canny, depth).
@@ -640,21 +652,21 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
                 The width in pixels of the generated image.
             num_inference_steps (`int`, *optional*, defaults to 20):
                 The number of denoising steps. More denoising steps usually lead to a higher quality image at the
-                expense of slower inference.
             sigmas (`List[float]`, *optional*):
                 Custom sigmas to use for the denoising process. If not defined, the scheduler's default behavior
-                will be used.
             strength (`float`, *optional*, defaults to 1.0):
                 Denoising strength for image-to-image. A value of 1.0 means the initial image is fully replaced,
-                while a lower value preserves more of the original image structure. Only used in img2img mode.
             guidance_scale (`float`, *optional*, defaults to 4.0):
                 The scale for classifier-free guidance. A value > 1 enables it. Higher values encourage images
-                closer to the prompt, potentially at the cost of quality.
             cfg_normalization (`bool`, *optional*, defaults to False):
                 Whether to apply normalization to the guidance, which can prevent oversaturation.
             cfg_truncation (`float`, *optional*, defaults to 1.0):
                 A value between 0.0 and 1.0 that disables CFG for the final portion of the denoising steps,
-                specified as a fraction of total steps. For example, 0.8 disables CFG for the last 20% of steps.
             negative_prompt (`str` or `List[str]`, *optional*):
                 The prompt or prompts not to guide the image generation.
             num_images_per_prompt (`int`, *optional*, defaults to 1):
@@ -698,8 +710,12 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
         is_two_stage_control_model = self.transformer.control_in_dim > self.transformer.in_channels if hasattr(self.transformer, "control_in_dim") else False
         device = self._execution_device
         dtype = self.transformer.dtype
-        vae_scale = self.vae_scale_factor * 2
         ref_image = control_image or image
         image_height = None
         image_width = None
@@ -742,22 +758,23 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
             prompt_embeds_model_input = prompt_embeds + negative_prompt_embeds
         else:
             prompt_embeds_model_input = prompt_embeds
-        is_inpaint_mode = image is not None and mask_image is not None
-        is_img2img_mode = image is not None and not is_inpaint_mode
-        if control_image is not None or is_inpaint_mode:
             control_latents = self.prepare_control_latents(control_image, width, height, batch_size, num_images_per_prompt, device, dtype)
-            if is_two_stage_control_model:
-                mask_to_use = self._apply_mask_blur(mask_image, mask_blur_radius, is_inpaint_mode)
                 inpaint_latents = self._prepare_image_latents(
-                    image, mask_to_use, width, height, batch_size, num_images_per_prompt, device, dtype, do_preprocess=True
                 )
                 mask_latents = self._prepare_mask_latents(
-                    mask_to_use,
                     width,
                     height,
                     batch_size,
@@ -765,6 +782,8 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
                     inpaint_latents.shape,
                     device,
                     dtype,
                 )
                 control_context = torch.cat([control_latents, mask_latents, inpaint_latents], dim=1)
             else:
@@ -783,7 +802,7 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
         timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, device, sigmas, mu=mu)
         self._num_timesteps = len(timesteps)
-        if is_img2img_mode and not is_inpaint_mode:
             strength = min(strength, 1.0)
         else:
             strength = 1.0
@@ -798,7 +817,8 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
         latent_timestep = timesteps[:1].repeat(effective_batch_size) if strength < 1.0 else None
-        use_image_for_latents = is_img2img_mode and not is_inpaint_mode
         latents = self.prepare_latents(
             effective_batch_size,
             self.transformer.in_channels,
@@ -811,33 +831,78 @@ class ZImageControlUnifiedPipeline(DiffusionPipeline, ZImageLoraLoaderMixin, Fro
             timestep=latent_timestep if use_image_for_latents else None,
             latents=latents,
         )
         num_warmup_steps = len(timesteps) - num_steps_to_run * self.scheduler.order
         with torch.inference_mode():
             with self.progress_bar(total=num_steps_to_run) as progress_bar:
                 for i, t in enumerate(timesteps):
                     if self.interrupt:
                         continue
                     timestep = t.expand(latents.shape[0])
                     timestep = (1000 - timestep) / 1000
                     t_norm = timestep[0].item()
                     current_guidance_scale = self.guidance_scale
                     if self.do_classifier_free_guidance and self._cfg_truncation is not None and float(self._cfg_truncation) <= 1:
                         if t_norm > self._cfg_truncation:
                             current_guidance_scale = 0.0
                     apply_cfg = self.do_classifier_free_guidance and current_guidance_scale > 0
                     if apply_cfg:
-                        latents_typed = latents.to(self.transformer.dtype)
-                        latent_model_input = latents_typed.repeat(2, 1, 1, 1)
                         timestep_model_input = timestep.repeat(2)
                     else:
-                        latent_model_input = latents.to(self.transformer.dtype)
                         timestep_model_input = timestep
                     latent_model_input = latent_model_input.unsqueeze(2)
                     latent_model_input_list = list(latent_model_input.unbind(dim=0))

 import inspect
+from typing import Any, Callable, Dict, List, Literal, Optional, Tuple, Union
 import numpy as np
 import torch
 import torch.nn.functional as F
+import torchvision.transforms as T
 from diffusers import AutoencoderKL, DiffusionPipeline, FlowMatchEulerDiscreteScheduler
 from diffusers.image_processor import PipelineImageInput, VaeImageProcessor
 from diffusers.loaders import FromSingleFileMixin, ZImageLoraLoaderMixin
         reference_latents_shape: Tuple,
         device: torch.device,
         dtype: torch.dtype,
+        invert_mask: bool = False,
+        do_unsqueeze: bool = True,
     ) -> torch.Tensor:
         """
         Processes a MASK using the mask_processor, inverts it, resizes it, and formats it for the control_context.
             )
             return torch.zeros(placeholder_shape, device=device, dtype=dtype)
+        mask_tensor = self.mask_processor.preprocess(mask_image, height=height, width=width)
+        mask_tensor = mask_tensor.to(device=device, dtype=dtype)
+        if invert_mask:
+            mask_tensor = 1.0 - mask_tensor
+        mask_latents = F.interpolate(mask_tensor, size=reference_latents_shape[-2:], mode="nearest")
+        if do_unsqueeze:
+            mask_latents = mask_latents.unsqueeze(2)
+        return mask_latents
     def prepare_control_latents(
         self, image: PipelineImageInput, width: int, height: int, batch_size: int, num_images_per_prompt: int, device: torch.device, dtype: torch.dtype
         prompt: Union[str, List[str]],
         image: Optional[PipelineImageInput] = None,
         mask_image: Optional[PipelineImageInput] = None,
+        inpaint_mode: Literal["default", "diff", "diff+inpaint"] = "default",
+        mask_blur_radius: float=8.0,
         control_image: Optional[PipelineImageInput] = None,
         height: Optional[int] = None,
         width: Optional[int] = None,
                 The initial image for image-to-image or inpainting modes.
             mask_image (`PipelineImageInput`, *optional*):
                 The mask image for inpainting. White areas are preserved, black areas are inpainted.
+            inpaint_mode (`str`, *optional*, defaults to `"default"`):
+                The inpainting mode. Can be "default", "diff", or "diff+inpaint". Determines how the inpainting
+            process is handled.
+            mask_blur_radius (`float`, *optional*, defaults to 8.0):
                 The radius for blurring the edges of the inpainting mask to create a smoother transition.
             control_image (`PipelineImageInput`, *optional*):
                 The conditioning image for control modes (e.g., Canny, depth).
                 The width in pixels of the generated image.
             num_inference_steps (`int`, *optional*, defaults to 20):
                 The number of denoising steps. More denoising steps usually lead to a higher quality image at the
+            expense of slower inference.
             sigmas (`List[float]`, *optional*):
                 Custom sigmas to use for the denoising process. If not defined, the scheduler's default behavior
+            will be used.
             strength (`float`, *optional*, defaults to 1.0):
                 Denoising strength for image-to-image. A value of 1.0 means the initial image is fully replaced,
+            while a lower value preserves more of the original image structure. Only used in img2img mode.
             guidance_scale (`float`, *optional*, defaults to 4.0):
                 The scale for classifier-free guidance. A value > 1 enables it. Higher values encourage images
+            closer to the prompt, potentially at the cost of quality.
             cfg_normalization (`bool`, *optional*, defaults to False):
                 Whether to apply normalization to the guidance, which can prevent oversaturation.
             cfg_truncation (`float`, *optional*, defaults to 1.0):
                 A value between 0.0 and 1.0 that disables CFG for the final portion of the denoising steps,
+            specified as a fraction of total steps. For example, 0.8 disables CFG for the last 20% of steps.
             negative_prompt (`str` or `List[str]`, *optional*):
                 The prompt or prompts not to guide the image generation.
             num_images_per_prompt (`int`, *optional*, defaults to 1):
         is_two_stage_control_model = self.transformer.control_in_dim > self.transformer.in_channels if hasattr(self.transformer, "control_in_dim") else False
         device = self._execution_device
         dtype = self.transformer.dtype
+        vae_scale = self.vae_scale_factor * 2
+        has_inpaint_inputs = image is not None and mask_image is not None
+        is_inpaint_control_mode = has_inpaint_inputs and inpaint_mode in ["default", "diff+inpaint"]
+        is_diff_mode = has_inpaint_inputs and inpaint_mode in ["diff", "diff+inpaint"]
+        is_img2img_mode = image is not None and not has_inpaint_inputs
         ref_image = control_image or image
         image_height = None
         image_width = None
             prompt_embeds_model_input = prompt_embeds + negative_prompt_embeds
         else:
             prompt_embeds_model_input = prompt_embeds
+        if control_image is not None or is_inpaint_control_mode:
             control_latents = self.prepare_control_latents(control_image, width, height, batch_size, num_images_per_prompt, device, dtype)
+            if is_two_stage_control_model:
+                image_for_inpaint = None if is_diff_mode and not is_inpaint_control_mode else image
+                mask_for_inpaint = None if is_diff_mode and not is_inpaint_control_mode else mask_image
+                if is_inpaint_control_mode:
+                    mask_for_inpaint = self._apply_mask_blur(mask_for_inpaint, mask_blur_radius, True)
                 inpaint_latents = self._prepare_image_latents(
+                    image_for_inpaint, mask_for_inpaint, width, height, batch_size, num_images_per_prompt, device, dtype
                 )
                 mask_latents = self._prepare_mask_latents(
+                    mask_for_inpaint,
                     width,
                     height,
                     batch_size,
                     inpaint_latents.shape,
                     device,
                     dtype,
+                    invert_mask=is_inpaint_control_mode,
+                    do_unsqueeze=True,
                 )
                 control_context = torch.cat([control_latents, mask_latents, inpaint_latents], dim=1)
             else:
         timesteps, num_inference_steps = retrieve_timesteps(self.scheduler, num_inference_steps, device, sigmas, mu=mu)
         self._num_timesteps = len(timesteps)
+        if is_img2img_mode:
             strength = min(strength, 1.0)
         else:
             strength = 1.0
         latent_timestep = timesteps[:1].repeat(effective_batch_size) if strength < 1.0 else None
+        use_image_for_latents = is_img2img_mode
         latents = self.prepare_latents(
             effective_batch_size,
             self.transformer.in_channels,
             timestep=latent_timestep if use_image_for_latents else None,
             latents=latents,
         )
+        if is_diff_mode:
+            original_image_tensor = self.image_processor.preprocess(image, height=height, width=width).to(device=device, dtype=self.vae.dtype)
+            with torch.no_grad():
+                original_clean_latents = retrieve_latents(self.vae.encode(original_image_tensor), sample_mode="argmax")
+            original_clean_latents = (original_clean_latents - self.vae.config.shift_factor) * self.vae.config.scaling_factor
+            original_clean_latents = original_clean_latents.to(dtype)
+            noise = randn_tensor(original_clean_latents.shape, generator=generator, device=device, dtype=dtype)
+            latents_list = []
+            step_indices = [(self.scheduler.timesteps == t).nonzero().item() for t in timesteps]
+            for i in step_indices:
+                sigma = self.scheduler.sigmas[i]
+                noisy_latent = (1.0 - sigma) * original_clean_latents + sigma * noise
+                latents_list.append(noisy_latent)
+            original_latents_trajectory = torch.cat(latents_list, dim=0)
+            blurred_mask_image = self._apply_mask_blur(mask_image, mask_blur_radius, True)
+            map_processed = self._prepare_mask_latents(
+                blurred_mask_image,
+                width,
+                height,
+                batch_size,
+                num_images_per_prompt,
+                latents.shape,
+                device,
+                dtype,
+                invert_mask=True,
+                do_unsqueeze=False,
+            )
+            thresholds = torch.arange(len(timesteps), device=device, dtype=dtype) / len(timesteps)
+            thresholds = thresholds.view(-1, 1, 1, 1)
+            time_masks = map_processed > thresholds
         num_warmup_steps = len(timesteps) - num_steps_to_run * self.scheduler.order
         with torch.inference_mode():
             with self.progress_bar(total=num_steps_to_run) as progress_bar:
                 for i, t in enumerate(timesteps):
                     if self.interrupt:
                         continue
+                    if is_diff_mode:
+                        if i == 0:
+                            latents = original_latents_trajectory[:1]
+                        else:
+                            current_mask = time_masks[i].to(latents.dtype)
+                            current_original_latent = original_latents_trajectory[i:i+1]
+                            if current_mask.ndim == 3:
+                                current_mask = current_mask.unsqueeze(1)
+                            latents = current_original_latent * current_mask + latents * (1 - current_mask)
                     timestep = t.expand(latents.shape[0])
                     timestep = (1000 - timestep) / 1000
                     t_norm = timestep[0].item()
                     current_guidance_scale = self.guidance_scale
                     if self.do_classifier_free_guidance and self._cfg_truncation is not None and float(self._cfg_truncation) <= 1:
                         if t_norm > self._cfg_truncation:
                             current_guidance_scale = 0.0
                     apply_cfg = self.do_classifier_free_guidance and current_guidance_scale > 0
                     if apply_cfg:
+                        latent_model_input = latents.repeat(2, 1, 1, 1)
                         timestep_model_input = timestep.repeat(2)
                     else:
+                        latent_model_input = latents
                         timestep_model_input = timestep
+                    latent_model_input = latent_model_input.to(self.transformer.dtype)
                     latent_model_input = latent_model_input.unsqueeze(2)
                     latent_model_input_list = list(latent_model_input.unbind(dim=0))

infer_inpaint.py CHANGED Viewed

@@ -11,16 +11,14 @@ from diffusers_local import patch # Apply necessary patches for local diffusers
 from diffusers_local.pipeline_z_image_control_unified import ZImageControlUnifiedPipeline
 from diffusers_local.z_image_control_transformer_2d import ZImageControlTransformer2DModel
 os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,garbage_collection_threshold:0.7,max_split_size_mb:1024"
 def main():
     # 1. Set params
-    BASE_MODEL_ID = "."
     GGUF_MODEL_FILE = "./transformer/z_image_turbo_control_unified_v2.1_q4_k_m.gguf"
     GGUF_MODEL_FILE = "./transformer/z_image_turbo_control_unified_v2.1_q8_0.gguf"
     use_gguf = True
     # prompt="一位年轻女子站在阳光明媚的海岸线上，白裙在轻拂的海风中微微飘动，裙摆轻盈飞扬。她拥有一头鲜艳的紫色长发，在风中轻盈舞动，发间系着一个精致的黑色蝴蝶结，与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀，眉目精致，肤色白皙细腻，透着一股甜美的青春气息；神情柔和，略带羞涩，目光静静地凝望着远方的地平线，双手自然交叠于身前，手指清晰可见、五指完整、指节自然、姿势优雅放松，仿佛沉浸在思绪之中。背景是辽阔无垠、波光粼粼的大海，阳光洒在海面上，映出温暖的金色光晕，海浪轻轻拍打沙滩，天空湛蓝云朵稀薄。整体画面高清锐利、细节丰富、色彩鲜艳、焦点清晰、8K分辨率、杰作、最佳质量、无模糊、无噪点、无畸变、自然光照、电影级渲染。"
@@ -29,13 +27,14 @@ def main():
     target_height = 1728
     target_width = 992
-    num_inference_steps = 20
     guidance_scale = 0  # 2.5
     controlnet_conditioning_scale = 0.7
     controlnet_conditioning_refiner_scale = 0.75
-    mask_blur_radius = 8.0
-    seed = 42
     shift = 3.0
     generator = torch.Generator("cuda").manual_seed(seed)
     print("Loading Pipeline...")
@@ -74,8 +73,7 @@ def main():
     pose_image = load_image("assets/pose.jpg")
     inpaint_image = load_image("assets/inpaint.jpg")
-    mask_image = load_image("assets/mask_inpaint.jpg")
     start_inference_time = time.time()
     generated_image = pipe(
@@ -84,7 +82,8 @@ def main():
         image=inpaint_image,
         control_image=pose_image,
         mask_image=mask_image,
-        mask_blur_radius=mask_blur_radius,
         height=target_height,
         width=target_width,
         num_inference_steps=num_inference_steps,

 from diffusers_local.pipeline_z_image_control_unified import ZImageControlUnifiedPipeline
 from diffusers_local.z_image_control_transformer_2d import ZImageControlTransformer2DModel
 os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True,garbage_collection_threshold:0.7,max_split_size_mb:1024"
 def main():
     # 1. Set params
+    BASE_MODEL_ID = "."
     GGUF_MODEL_FILE = "./transformer/z_image_turbo_control_unified_v2.1_q4_k_m.gguf"
     GGUF_MODEL_FILE = "./transformer/z_image_turbo_control_unified_v2.1_q8_0.gguf"
     use_gguf = True
     # prompt="一位年轻女子站在阳光明媚的海岸线上，白裙在轻拂的海风中微微飘动，裙摆轻盈飞扬。她拥有一头鲜艳的紫色长发，在风中轻盈舞动，发间系着一个精致的黑色蝴蝶结，与身后柔和的蔚蓝天空形成鲜明对比。她面容清秀，眉目精致，肤色白皙细腻，透着一股甜美的青春气息；神情柔和，略带羞涩，目光静静地凝望着远方的地平线，双手自然交叠于身前，手指清晰可见、五指完整、指节自然、姿势优雅放松，仿佛沉浸在思绪之中。背景是辽阔无垠、波光粼粼的大海，阳光洒在海面上，映出温暖的金色光晕，海浪轻轻拍打沙滩，天空湛蓝云朵稀薄。整体画面高清锐利、细节丰富、色彩鲜艳、焦点清晰、8K分辨率、杰作、最佳质量、无模糊、无噪点、无畸变、自然光照、电影级渲染。"
     target_height = 1728
     target_width = 992
+    num_inference_steps = 25
     guidance_scale = 0  # 2.5
     controlnet_conditioning_scale = 0.7
     controlnet_conditioning_refiner_scale = 0.75
+    mask_blur_radius = 12
+    seed = 48
     shift = 3.0
+    inpaint_mode = "diff+inpaint"  # ("default", "diff", "diff+inpaint")
     generator = torch.Generator("cuda").manual_seed(seed)
     print("Loading Pipeline...")
     pose_image = load_image("assets/pose.jpg")
     inpaint_image = load_image("assets/inpaint.jpg")
+    mask_image = load_image("assets/inpaint_mask.jpg")
     start_inference_time = time.time()
     generated_image = pipe(
         image=inpaint_image,
         control_image=pose_image,
         mask_image=mask_image,
+        mask_blur_radius=mask_blur_radius,
+        inpaint_mode=inpaint_mode,
         height=target_height,
         width=target_width,
         num_inference_steps=num_inference_steps,

prepare_mask.py ADDED Viewed

	@@ -0,0 +1,101 @@

+import argparse
+from PIL import Image, ImageFilter
+def expand_and_feather_mask(mask_image: Image.Image, expand_pixels: int = 10, feather_radius: int = 8) -> Image.Image:
+    """
+    Expands the white area of a mask and then smooths its edges using Pillow filters.
+    This is useful for preparing inpainting masks to ensure complete coverage of the
+    area to be replaced and to create a smooth blend with the surrounding image.
+    Args:
+        mask_image (PIL.Image.Image): The input mask (black and white). It's
+            expected to be a PIL Image.
+        expand_pixels (int): The number of pixels to expand (dilate) the white
+            area. This helps to cover any "ghosting" from the old image.
+        feather_radius (int): The radius of the Gaussian blur used to create the
+            soft edge (feathering) effect.
+    Returns:
+        PIL.Image.Image: The processed mask with expanded and feathered edges.
+    """
+    # Ensure the mask is in 'L' (grayscale) mode for the filters to work correctly.
+    mask = mask_image.convert("L")
+    # 1. Expansion (Dilation)
+    # The MaxFilter finds the brightest pixel in a kernel window and replaces the
+    # center pixel with it. For a black and white image, this causes the white
+    # areas to expand.
+    if expand_pixels > 0:
+        # The filter size must be an odd number. The formula (pixels * 2 + 1)
+        # creates a kernel of the correct odd size.
+        expand_size = expand_pixels * 2 + 1
+        print(f"Expanding mask by {expand_pixels} pixels (filter size: {expand_size}x{expand_size})...")
+        mask = mask.filter(ImageFilter.MaxFilter(size=expand_size))
+    # 2. Feathering (Gaussian Blur)
+    # Applies a Gaussian blur to the expanded mask, creating a smooth
+    # gradient from white to black at the edges.
+    if feather_radius > 0:
+        print(f"Feathering mask with a radius of {feather_radius} pixels...")
+        mask = mask.filter(ImageFilter.GaussianBlur(radius=feather_radius))
+    return mask
+def main():
+    """Main function to parse arguments and process the mask."""
+    parser = argparse.ArgumentParser(description="Expand and feather an inpainting mask.")
+    parser.add_argument(
+        "input_path",
+        type=str,
+        help="Path to the input mask image file."
+    )
+    parser.add_argument(
+        "output_path",
+        type=str,
+        help="Path to save the processed output mask image file."
+    )
+    parser.add_argument(
+        "--expand",
+        type=int,
+        default=10,
+        help="Number of pixels to expand the white areas of the mask. Default is 10."
+    )
+    parser.add_argument(
+        "--feather",
+        type=int,
+        default=8,
+        help="Radius in pixels for the Gaussian blur (feathering) effect. Default is 8."
+    )
+    args = parser.parse_args()
+    try:
+        # Load the input mask
+        print(f"Loading mask from: {args.input_path}")
+        original_mask = Image.open(args.input_path)
+    except FileNotFoundError:
+        print(f"Error: Input file not found at '{args.input_path}'")
+        return
+    except Exception as e:
+        print(f"Error loading image: {e}")
+        return
+    # Process the mask using the function
+    processed_mask = expand_and_feather_mask(
+        original_mask,
+        expand_pixels=args.expand,
+        feather_radius=args.feather
+    )
+    # Save the final mask
+    try:
+        print(f"Saving processed mask to: {args.output_path}")
+        processed_mask.save(args.output_path)
+        print("Done!")
+    except Exception as e:
+        print(f"Error saving image: {e}")
+if __name__ == "__main__":
+    main()

results/new_tests/{result_inpaint.png → result_inpaint_2.png} RENAMED Viewed

File without changes

results/new_tests/result_inpaint_default.png ADDED Viewed

Git LFS Details

SHA256: baf856d1e4e581cbced169a801c0e90efc00b37d117be19bee892d1865d511c2
Pointer size: 132 Bytes
Size of remote file: 1.92 MB

results/new_tests/result_inpaint_diff.png ADDED Viewed

Git LFS Details

SHA256: 1bc7cc72bc1959e3ab65bbc2b23d3c57ebf93ea4ca78671fbdc9a8ab38e8e6bd
Pointer size: 132 Bytes
Size of remote file: 1.7 MB

results/new_tests/result_inpaint_diffinpaint.png ADDED Viewed

Git LFS Details

SHA256: 60099065d5601a81ea53518d49c8f8d6902174dea9e39d4eb5d3ea1b34c9cf5f
Pointer size: 132 Bytes
Size of remote file: 1.71 MB