ResEdit: Residual embeddings for precise generative image editing

1University of Cambridge    2Adobe Research
EGSR 2026 (Journal Track)
ResEdit overview

Our framework enables high-fidelity generative editing by isolating image identity into a learned residual image embedding. Unlike traditional inversion methods that struggle with "baked-in" condition features, ResEdit explicitly separates identity from the physical condition, facilitating streamlined intrinsic-space manipulation of geometry and material, as well as reference-based relighting. By shifting the burden of reconstruction from the noise latent to this dedicated residual channel, we achieve a superior balance of identity preservation and responsive editability without necessitating model surgery.

Abstract

Conditional diffusion image generators can be repurposed for editing through inversion, without the need for large-scale paired fine-tuning data. However, producing high-quality, targeted edits while maintaining image identity and global consistency remains challenging, as weakly conditioned inversion often embeds conflicting image features into the noise. We demonstrate that incorporating a residual image encoding as additional conditioning enables both improved identity preservation and better editability. We optimize this residual encoding to provide a strong conditioning signal for reconstruction, thereby reducing the reliance on inversion and susceptibility to its aforementioned pitfalls. To ensure this residual does not interfere with desired edits, we incorporate a gradient reversal-based optimization strategy that disentangles the residual from the edited condition. We illustrate our method's ability to produce high-fidelity results across precise intrinsic-based editing and relighting, and show proof-of-concept text-guided manipulation.

Method

ResEdit method overview
Method overview. Our framework enables high-fidelity editing through a three-stage process: (1) Given an input image and its corresponding condition (e.g., albedo), we optimize a residual embedding to capture the input identity. An adversarial loss is employed to disentangle the residual from the condition, encoding only the missing identity. (2) The input image is inverted relative to the joint conditioning signal to obtain a noise latent. Because the residual narrows the generative distribution, this inversion is more stable and often optional. (3) The edited condition is combined with the fixed residual and noise to produce the final edited image, which preserves the source identity while faithfully reflecting the conditional manipulation.

Results

Material editing result

Material editing. We showcase diverse surface appearance edits on intrinsic channels, including normal (top), albedo (middle), and roughness (bottom). Our method (2nd col.) produces more plausible results that better align with the user edit than IntrinsicEdit (3rd col.). Text-based methods (right) struggle to produce precise edits due to the limitations of natural language. Best viewed zoomed-in.

Object removal, insertion and translation results

Editing examples. We show edits for object removal, insertion, and translation (rows), respectively. The initial and edited intrinsic channels (e.g., edited albedo) are shown in the inset. Note that, compared to IntrinsicEdit, our method faithfully reconstructs unedited areas while resynthesizing plausible details in edited regions (e.g., shadows and reflections). Comparing to two text-based methods, Flux.1-Kontext and Nano Banana (rightmost columns), shows that text-based editing does not allow for precise editing (e.g., the rainbow pattern on the mug) and does not faithfully encode sufficiently detailed descriptions (apple example).

Intrinsic consistency analysis

Intrinsic-consistency trade-off analysis. The two axes show edit error and identity error, with lower values preferred. Our full method lies on the favorable trade-off frontier, achieving the lowest edit error and identity preservation comparable to the strongest baseline.

Relighting results

Relighting. Given a lighting description—in the form of text prompt (first two examples) or reference illumination (rightmost example)—we encode it into the UniLight latent space. We then use the resulting lighting tokens as input conditions for our method, achieving plausible, realistic relighting. The vanilla UniLight approach struggles with identity drifts (e.g., the vase material in the bottom left) as it is based on a simple RGB→X→RGB pipeline which relies solely on intrinsic channels for identity preservation.