Fixing Diffusion Models’ Limited Understanding Of Mirrors And Reflections

17 hours ago

ARTICLE AD BOX

Since generative AI began to garner nationalist interest, nan machine imagination investigation section has deepened its liking successful processing AI models tin of knowing and replicating beingness laws; however, nan situation of school instrumentality learning systems to simulate phenomena specified arsenic gravity and liquid dynamics has been a important attraction of investigation efforts for astatine slightest nan past 5 years.

Since latent diffusion models (LDMs) came to predominate nan generative AI segment successful 2022, researchers person increasingly focused connected LDM architecture's constricted capacity to understand and reproduce beingness phenomena. Now, this rumor has gained further prominence pinch nan landmark improvement of OpenAI's generative video exemplary Sora, and nan (arguably) much consequential caller merchandise of nan unfastened root video models Hunyuan Video and Wan 2.1.

Reflecting Badly

Most investigation aimed astatine improving LDM knowing of physics has focused connected areas specified arsenic gait simulation, particle physics, and different aspects of Newtonian motion. These areas person attracted attraction because inaccuracies successful basal beingness behaviors would instantly undermine nan authenticity of AI-generated video.

However, a mini but increasing strand of investigation concentrates connected 1 of LDM's biggest weaknesses – it's relative inability to nutrient meticulous reflections.

//arxiv.org/pdf/2409.14677

From nan January 2025 insubstantial ‘Reflecting Reality: Enabling Diffusion Models to Produce Faithful Mirror Reflections', examples of ‘reflection failure' versus nan researchers' ain approach. Source: https://arxiv.org/pdf/2409.14677

This rumor was besides a situation during nan CGI era and remains truthful successful nan section of video gaming, wherever ray-tracing algorithms simulate nan way of ray arsenic it interacts pinch surfaces. Ray-tracing calculates really virtual ray rays bounce disconnected aliases walk done objects to create realistic reflections, refractions, and shadows.

However, because each further bounce greatly increases computational cost, real-time applications must waste and acquisition disconnected latency against accuracy by limiting nan number of allowed light-ray bounces.

//www.unrealengine.com/en-US/explainers/ray-tracing/what-is-real-time-ray-tracing

A practice of a virtually-calculated light-beam successful a accepted 3D-based (i.e., CGI) scenario, utilizing technologies and principles first developed successful nan 1960s, and which came to fulmination betwixt 1982-93 (the span betwixt ‘Tron' [1982] and ‘Jurassic Park' [1993]. Source: https://www.unrealengine.com/en-US/explainers/ray-tracing/what-is-real-time-ray-tracing

For instance, depicting a chrome teapot successful beforehand of a reflector could impact a ray-tracing process wherever ray rays bounce many times betwixt reflective surfaces, creating an almost infinite loop pinch small applicable use to nan last image. In astir cases, a reflection extent of 2 to 3 bounces already exceeds what nan spectator tin perceive. A azygous bounce would consequence successful a achromatic mirror, since nan ray must complete astatine slightest 2 journeys to shape a visible reflection.

Each further bounce sharply increases computational cost, often doubling render times, making faster handling of reflections one of nan astir important opportunities for improving ray-traced rendering quality.

Naturally, reflections occur, and are basal to photorealism, successful acold little evident scenarios – specified arsenic nan reflective aboveground of a metropolis thoroughfare aliases a battlefield aft nan rain; nan reflection of nan opposing thoroughfare successful a shop model aliases solid doorway; aliases successful nan glasses of depicted characters, wherever objects and environments whitethorn beryllium required to appear.

A simulated twin-reflection achieved via accepted compositing for an iconic segment successful 'The Matrix' (1999).

A simulated twin-reflection achieved via accepted compositing for an iconic segment successful ‘The Matrix' (1999).

Image Problems

For this reason, frameworks that were celebrated anterior to nan advent of diffusion models, specified arsenic Neural Radiance Fields (NeRF), and immoderate much caller challengers specified arsenic Gaussian Splatting person maintained their ain struggles to enact reflections successful a earthy way.

The REF2-NeRF task (pictured below) projected a NeRF-based modeling method for scenes containing a solid case. In this method, refraction and reflection were modeled utilizing elements that were limited and independent of nan viewer’s perspective. This attack allowed nan researchers to estimate nan surfaces wherever refraction occurred, specifically solid surfaces, and enabled nan separation and modeling of some nonstop and reflected ray components.

//arxiv.org/pdf/2311.17116

Examples from nan Ref2Nerf paper. Source: https://arxiv.org/pdf/2311.17116

Other NeRF-facing reflection solutions of nan past 4-5 years person included NeRFReN, Reflecting Reality, and Meta's 2024 Planar Reflection-Aware Neural Radiance Fields project.

For GSplat, papers specified arsenic Mirror-3DGS, Reflective Gaussian Splatting, and RefGaussian person offered solutions regarding nan reflection problem, while nan 2023 Nero project projected a bespoke method of incorporating reflective qualities into neural representations.

MirrorVerse

Getting a diffusion exemplary to respect reflection logic is arguably much difficult than pinch explicitly structural, non-semantic approaches specified arsenic Gaussian Splatting and NeRF. In diffusion models, a norm of this benignant is only apt to go reliably embedded if nan training information contains galore varied examples crossed a wide scope of scenarios, making it heavy limited connected nan distribution and value of nan original dataset.

Traditionally, adding peculiar behaviors of this benignant is nan purview of a LoRA aliases nan fine-tuning of nan guidelines model; but these are not perfect solutions, since a LoRA tends to skew output towards its ain training data, moreover without prompting, while fine-tunes – too being costly – tin fork a awesome exemplary irrevocably distant from nan mainstream, and engender a big of related civilization devices that will ne'er activity pinch immoderate other strain of nan model, including nan original one.

In general, improving diffusion models requires that nan training information salary greater attraction to nan physics of reflection. However, galore different areas are besides successful request of akin typical attention. In nan discourse of hyperscale datasets, wherever civilization curation is costly and difficult, addressing each azygous weakness successful this measurement is impractical.

Nonetheless, solutions to nan LDM reflection problem do harvest up now and again. One caller specified effort, from India, is nan MirrorVerse project, which offers an improved dataset and training method tin of improving of nan state-of-the-art successful this peculiar situation successful diffusion research.

//arxiv.org/pdf/2504.15397

Rightmost, nan results from MirrorVerse pitted against 2 anterior approaches (central 2 columns). Source: https://arxiv.org/pdf/2504.15397

As we tin spot successful nan illustration supra (the characteristic image successful nan PDF of nan caller study), MirrorVerse improves connected caller offerings tackling nan aforesaid problem, but is acold from perfect.

In nan precocious correct image, we spot that nan ceramic jars are somewhat to nan correct of wherever they should be, and successful nan image below, which should technically not characteristic a reflection of nan cup astatine all, an inaccurate reflection has been shoehorned into nan right–hand area, against nan logic of earthy reflective angles.

Therefore we'll return a look astatine nan caller method not truthful overmuch because it whitethorn correspond nan existent state-of-the-art successful diffusion-based reflection, but arsenic to exemplify nan grade to which this whitethorn beryllium to beryllium an intractable rumor for latent diffusion models, fixed and video alike, since nan requisite information examples of reflectivity are astir apt to beryllium entangled pinch peculiar actions and scenarios.

Therefore this peculiar usability of LDMs whitethorn proceed to autumn short of structure-specific approaches specified arsenic NeRF, GSplat, and besides accepted CGI.

The new paper is titled MirrorVerse: Pushing Diffusion Models to Realistically Reflect nan World, and comes from 3 researchers crossed Vision and AI Lab, IISc Bangalore, and nan Samsung R&D Institute astatine Bangalore. The insubstantial has an associated task page, arsenic good arsenic a dataset astatine Hugging Face, pinch root codification released astatine GitHub.

Method

The researchers statement from nan outset nan trouble that models specified arsenic Stable Diffusion and Flux person successful respecting reflection-based prompts, illustrating nan rumor adroitly:

Current state-of-the-art text-to-image models, SD3.5 and Flux, exhibited important challenges successful producing accordant and geometrically meticulous reflections erstwhile prompted to make reflections successful nan scene.

From nan paper: Current state-of-the-art text-to-image models, SD3.5 and Flux, exhibiting important challenges successful producing accordant and geometrically meticulous reflections erstwhile prompted to make them successful a scene.

The researchers person developed MirrorFusion 2.0, a diffusion-based generative exemplary aimed astatine improving nan photorealism and geometric accuracy of reflector reflections successful synthetic imagery. Training for nan exemplary was based connected nan researchers' ain newly-curated dataset, titled MirrorGen2, designed to reside nan generalization weaknesses observed successful erstwhile approaches.

MirrorGen2 expands connected earlier methodologies by introducing random entity positioning, randomized rotations, and explicit entity grounding, pinch nan extremity of ensuring that reflections stay plausible crossed a wider scope of entity poses and placements comparative to nan reflector surface.

nan dataset procreation pipeline applied cardinal augmentations by randomly positioning, rotating, and grounding objects wrong nan segment utilizing nan 3D-Positioner. Objects are besides paired successful semantically accordant combinations to simulate analyzable spatial relationships and occlusions, allowing nan dataset to seizure much realistic interactions successful multi-object scenes.

Schema for nan procreation of synthetic information successful MirrorVerse: nan dataset procreation pipeline applied cardinal augmentations by randomly positioning, rotating, and grounding objects wrong nan segment utilizing nan 3D-Positioner. Objects are besides paired successful semantically accordant combinations to simulate analyzable spatial relationships and occlusions, allowing nan dataset to seizure much realistic interactions successful multi-object scenes.

To further fortify nan model’s expertise to grip analyzable spatial arrangements, nan MirrorGen2 pipeline incorporates paired entity scenes, enabling nan strategy to amended correspond occlusions and interactions betwixt aggregate elements successful reflective settings.

The insubstantial states:

‘Categories are manually paired to guarantee semantic coherence – for instance, pairing a chair pinch a table. During rendering, aft positioning and rotating nan superior [object], an further [object] from nan paired class is sampled and arranged to forestall overlap, ensuring chopped spatial regions wrong nan scene.'

In respect to definitive entity grounding, present nan authors ensured that nan generated objects were ‘anchored' to nan crushed successful nan output synthetic data, alternatively than ‘hovering' inappropriately, which tin hap erstwhile synthetic information is generated astatine scale, aliases pinch highly automated methods.

Since dataset invention is cardinal to nan novelty of nan paper, we will proceed earlier than accustomed to this conception of nan coverage.

Data and Tests

SynMirrorV2

The researchers' SynMirrorV2 dataset was conceived to amended nan diverseness and realism of reflector reflection training data, featuring 3D objects originated from nan Objaverse and Amazon Berkeley Objects (ABO) datasets, pinch these selections subsequently refined done OBJECT 3DIT, arsenic good arsenic nan filtering process from nan V1 MirrorFusion project, to destruct low-quality asset. This resulted successful a refined excavation of 66,062 objects.

//arxiv.org/pdf/2212.08051

Examples from nan Objaverse dataset, utilized successful nan creation of nan curated dataset for nan caller system. Source: https://arxiv.org/pdf/2212.08051

Scene building progressive placing these objects onto textured floors from CC-Textures and HDRI backgrounds from nan PolyHaven CGI repository, utilizing either full-wall aliases gangly rectangular mirrors. Lighting was standardized pinch an area-light positioned supra and down nan objects, astatine a forty-five grade angle. Objects were scaled to fresh wrong a portion cube and positioned utilizing a precomputed intersection of nan reflector and camera viewing frustums, ensuring visibility.

Randomized rotations were applied astir nan y-axis, and a grounding method utilized to forestall ‘floating artifacts'.

To simulate much analyzable scenes, nan dataset besides incorporated aggregate objects arranged according to semantically coherent pairings based connected ABO categories. Secondary objects were placed to debar overlap, creating 3,140 multi-object scenes designed to seizure varied occlusions and extent relationships.

Examples of rendered views from nan authors' dataset containing aggregate (more than two) objects, pinch illustrations of entity segmentation and extent representation visualizations seen below.

Training Process

Acknowledging that synthetic realism unsocial was insufficient for robust generalization to real-world data, nan researchers developed a three-stage program learning process for training MirrorFusion 2.0.

In Stage 1, nan authors initialized nan weights of some nan conditioning and procreation branches pinch nan Stable Diffusion v1.5 checkpoint, and fine-tuned nan exemplary connected nan single-object training split of nan SynMirrorV2 dataset. Unlike nan above-mentioned Reflecting Reality project, nan researchers did not freeze nan procreation branch. They past trained nan exemplary for 40,000 iterations.

In Stage 2, nan exemplary was fine-tuned for an further 10,000 iterations, connected nan multiple-object training divided of SynMirrorV2, successful bid to thatch nan strategy to grip occlusions, and nan much analyzable spatial arrangements recovered successful realistic scenes.

Finally, In Stage 3, an further 10,000 iterations of finetuning were conducted utilizing real-world information from nan MSD dataset, utilizing extent maps generated by nan Matterport3D monocular extent estimator.

//arxiv.org/pdf/1908.09101

Examples from nan MSD dataset, pinch real-world scenes analyzed into extent and segmentation maps. Source: https://arxiv.org/pdf/1908.09101

During training, matter prompts were omitted for 20 percent of nan training clip successful bid to promote nan exemplary to make optimum usage of nan disposable extent accusation (i.e., a ‘masked' approach).

Training took spot connected 4 NVIDIA A100 GPUs for each stages (the VRAM spec is not supplied, though it would person been 40GB aliases 80GB per card). A learning complaint of 1e-5 was utilized connected a batch size of 4 per GPU, nether nan AdamW optimizer.

This training strategy progressively accrued nan trouble of tasks presented to nan model, opening pinch simpler synthetic scenes and advancing toward much challenging compositions, pinch nan volition of processing robust real-world transferability.

Testing

The authors evaluated MirrorFusion 2.0 against nan erstwhile state-of-the-art, MirrorFusion, which served arsenic nan baseline, and conducted experiments connected nan MirrorBenchV2 dataset, covering some azygous and multi-object scenes.

Additional qualitative tests were conducted connected samples from nan MSD dataset, and nan Google Scanned Objects (GSO) dataset.

The information utilized 2,991 single-object images from seen and unseen categories, and 300 two-object scenes from ABO. Performance was measured utilizing Peak Signal-to-Noise Ratio (PSNR); Structural Similarity Index (SSIM); and Learned Perceptual Image Patch Similarity (LPIPS) scores, to measure reflection value connected nan masked reflector region. CLIP similarity was utilized to measure textual alignment pinch nan input prompts.

In quantitative tests, nan authors generated images utilizing 4 seeds for a circumstantial prompt, and selecting nan resulting image pinch nan champion SSIM score. The 2 reported tables of results for nan quantitative tests are shown below.

Left, Quantitative results for azygous entity reflection procreation value connected nan MirrorBenchV2 azygous entity split. MirrorFusion 2.0 outperformed nan baseline, pinch nan champion results shown successful bold. Right, quantitative results for aggregate entity reflection procreation value connected nan MirrorBenchV2 aggregate entity split. MirrorFusion 2.0 trained pinch aggregate objects outperformed nan type trained without them, pinch nan champion results shown successful bold.

The authors comment:

‘[The results] show that our method outperforms nan baseline method and finetuning connected aggregate objects improves nan results connected analyzable scenes.'

The bulk of results, and those emphasized by nan authors, respect qualitative testing. Due to nan dimensions of these illustrations, we tin only partially reproduce nan paper's examples.

nan baseline grounded to support meticulous reflections and spatial consistency, showing incorrect chair predisposition and distorted reflections of aggregate objects, whereas (the authors contend) MirrorFusion 2.0 correctly renders nan chair and nan sofas, pinch meticulous position, orientation, and structure.

Comparison connected MirrorBenchV2: nan baseline grounded to support meticulous reflections and spatial consistency, showing incorrect chair predisposition and distorted reflections of aggregate objects, whereas (the authors contend) MirrorFusion 2.0 correctly renders nan chair and nan sofas, pinch meticulous position, orientation, and structure.

Of these subjective results, nan researchers opine that nan baseline exemplary grounded to accurately render entity predisposition and spatial relationships successful reflections, often producing artifacts specified arsenic incorrect rotation and floating objects. MirrorFusion 2.0, trained connected SynMirrorV2, nan authors contend, preserves correct entity predisposition and positioning successful some single-object and multi-object scenes, resulting successful much realistic and coherent reflections.

Below we spot qualitative results connected nan aforementioned GSO dataset:

Comparison connected nan GSO dataset. The baseline misrepresented entity building and produced incomplete, distorted reflections, while MirrorFusion 2.0, nan authors contend, preserves spatial integrity and generates meticulous geometry, color, and detail, moreover connected out-of-distribution objects.

Comparison connected nan GSO dataset. The baseline misrepresents entity building and produced incomplete, distorted reflections, while MirrorFusion 2.0, nan authors contend, preserves spatial integrity and generates meticulous geometry, color, and detail, moreover connected out-of-distribution objects.

Here nan authors comment:

‘MirrorFusion 2.0 generates importantly much meticulous and realistic reflections. For instance, successful Fig. 5 (a – above), MirrorFusion 2.0 correctly reflects nan drawer handles (highlighted successful green), while nan baseline exemplary produces an implausible reflection (highlighted successful red).

‘Likewise, for nan “White-Yellow mug” successful Fig. 5 (b), MirrorFusion 2.0 delivers a convincing geometry pinch minimal artifacts, dissimilar nan baseline, which fails to accurately seizure nan object’s geometry and appearance.'

The last qualitative trial was against nan aforementioned real-world MSD dataset (partial results shown below):

Real-world segment results comparing MirrorFusion, MirrorFusion 2.0, and MirrorFusion 2.0, fine-tuned connected nan MSD dataset. MirrorFusion 2.0, nan authors contend, captures analyzable segment specifications much accurately, including cluttered objects connected a table, and nan beingness of aggregate mirrors wrong a three-dimensional environment. Only partial results are shown here, owed to nan dimensions of nan results successful nan original paper, to which we mention nan scholar for afloat results and amended resolution.

Here nan authors observe that while MirrorFusion 2.0 performed good connected MirrorBenchV2 and GSO data, it initially struggled pinch analyzable real-world scenes successful nan MSD dataset. Fine-tuning nan exemplary connected a subset of MSD improved its expertise to grip cluttered environments and aggregate mirrors, resulting successful much coherent and elaborate reflections connected nan held-out trial split.

Additionally, a personification study was conducted, wherever 84% of users are reported to person preferred generations from MirrorFusion 2.0 complete nan baseline method.

Results of nan personification study.

Since specifications of nan personification study person been relegated to nan appendix of nan paper, we mention nan scholar to that for nan specifics of nan study.

Conclusion

Although respective of nan results shown successful nan insubstantial are awesome improvements connected nan state-of-the-art, nan state-of-the-art for this peculiar pursuit is truthful abysmal that moreover an unconvincing aggregate solution tin triumph retired pinch a modicum of effort. The basal architecture of a diffusion exemplary is truthful inimical to nan reliable learning and objection of accordant physics, that nan problem itself is genuinely posed, and not apparently not disposed toward an elegant solution.

Further, adding information to existing models is already nan modular method of remedying shortfalls successful LDM performance, pinch each nan disadvantages listed earlier. It is reasonable to presume that if early high-scale datasets were to salary much attraction to nan distribution (and annotation) of reflection-related information points, we could expect that nan resulting models would grip this script better.

Yet nan aforesaid is existent of aggregate different bugbears successful LDM output – who tin opportunity which of them astir deserves nan effort and money progressive successful nan benignant of solution that nan authors of nan caller insubstantial propose here?

First published Monday, April 28, 2025