Difference Inversion: Interpolate and Isolate the Difference with Token Consistency

Hyunsoo Kim1,2, Donghyun Kim1†, Suhyun Kim3†
1Korea University   2Korea Institute of Science and Technology   3Kyung Hee University

CVPR 2025

Teaser: Comparison A → A′ → B → B′ across methods

Figure: The goal of image analogy generation is to generate plausible B' based on the image triplet {A, A', B}, ensuring that it satisfies the image analogy generation formulation A:A'=B:B'.

Abstract

How can we generate an image B' that satisfies A:A'::B:B', given the input images A,A' and B? Recent works have tackled this challenge through approaches like visual in-context learning or visual instruction. However, these methods are typically limited to specific models (e.g. InstructPix2Pix, inpainting models) rather than general diffusion models (e.g. Stable Diffusion, SDXL). This dependency may lead to inherited biases or lower editing capabilities. In this paper, we propose Difference Inversion, a method that isolates only the difference from A and A' and applies it to B to generate a plausible B'. To address model dependency, it is crucial to structure prompts in the form of a "Full Prompt" suitable for input to stable diffusion models, rather than using an "Instruction Prompt". To this end, we accurately extract the Difference between A and A' and combine it with the prompt of B, enabling a plug-and-play application of the difference. To extract a precise difference, we first identify it through 1) Delta Interpolation. Additionally, to ensure accurate training, we propose the 2) Token Consistency Loss and 3) Zero Initialization of Token Embeddings. Our extensive experiments demonstrate that Difference Inversion outperforms existing baselines both quantitatively and qualitatively, indicating its ability to generate more feasible B' in a model-agnostic manner.

Quantitative Results

Quantitative 1
  • Top: CLIP and DINO directional similarity — measures how well the generated B′ aligns with the A:A′ transformation in embedding space, indicating semantic and visual fidelity.
  • Bottom-left: Human evaluation — collects perceptual ratings from human annotators to assess the realism and adherence of B′ to the intended analogy, providing a subjective quality check.
  • Bottom-right: Large-scale VLM evaluation — uses large vision–language models to automatically score how plausibly B′ satisfies the A:A′::B:B′ relationship, enabling scalable, objective assessment.

BibTeX

@inproceedings{kim2025difference,
  title     = {Difference Inversion: Interpolate and Isolate the Difference with Token Consistency for Image Analogy Generation},
  author    = {Kim, Hyunsoo and Kim, Donghyun and Kim, Suhyun},
  booktitle = {CVPR},
  year      = {2025},
}