TY - JOUR
T1 - Dissecting and Mitigating Semantic Discrepancy in Stable Diffusion for Image-to-Image Translation
AU - Yuan, Yifan
AU - Yang, Guanqun
AU - Wang, James Z.
AU - Zhang, Hui
AU - Shan, Hongming
AU - Wang, Fei Yue
AU - Zhang, Junping
N1 - Publisher Copyright:
© 2014 Chinese Association of Automation.
PY - 2025
Y1 - 2025
N2 - Finding suitable initial noise that retains the original image's information is crucial for image-to-image (I2I) translation using text-to-image (T2I) diffusion models. A common approach is to add random noise directly to the original image, as in SDEdit. However, we have observed that this can result in “semantic discrepancy” issues, wherein T2I diffusion models mis-interpret the semantic relationships and generate content not present in the original image. We identify that the noise introduced by SDEdit disrupts the semantic integrity of the image, leading to unintended associations between unrelated regions after U-Net upsampling. Building on the widely-used latent diffusion model, Stable Diffusion, we propose a training-free, plug-and-play method to alleviate semantic discrepancy and enhance the fidelity of the translated image. By leveraging the deterministic nature of denoising diffusion implicit models (DDIMs) inversion, we correct the erroneous features and correlations from the original generative process with accurate ones from DDIM inversion. This approach alleviates semantic discrepancy and surpasses recent DDIM-inversion-based methods such as PnP with fewer priors, achieving a speedup of 11.2 times in experiments conducted on COCO, ImageNet, and ImageNet-R datasets across multiple I2I translation tasks.
AB - Finding suitable initial noise that retains the original image's information is crucial for image-to-image (I2I) translation using text-to-image (T2I) diffusion models. A common approach is to add random noise directly to the original image, as in SDEdit. However, we have observed that this can result in “semantic discrepancy” issues, wherein T2I diffusion models mis-interpret the semantic relationships and generate content not present in the original image. We identify that the noise introduced by SDEdit disrupts the semantic integrity of the image, leading to unintended associations between unrelated regions after U-Net upsampling. Building on the widely-used latent diffusion model, Stable Diffusion, we propose a training-free, plug-and-play method to alleviate semantic discrepancy and enhance the fidelity of the translated image. By leveraging the deterministic nature of denoising diffusion implicit models (DDIMs) inversion, we correct the erroneous features and correlations from the original generative process with accurate ones from DDIM inversion. This approach alleviates semantic discrepancy and surpasses recent DDIM-inversion-based methods such as PnP with fewer priors, achieving a speedup of 11.2 times in experiments conducted on COCO, ImageNet, and ImageNet-R datasets across multiple I2I translation tasks.
UR - https://www.scopus.com/pages/publications/105002829482
UR - https://www.scopus.com/pages/publications/105002829482#tab=citedBy
U2 - 10.1109/JAS.2024.124800
DO - 10.1109/JAS.2024.124800
M3 - Article
AN - SCOPUS:105002829482
SN - 2329-9266
VL - 12
SP - 705
EP - 718
JO - IEEE/CAA Journal of Automatica Sinica
JF - IEEE/CAA Journal of Automatica Sinica
IS - 4
ER -