Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

1Stanford University   2UCSD   3Apparate Labs

TLDR: MVEdit is a training-free 3D Adapter that enables 3D generation/editing
using off-the-shelf 2D Stable Diffusion models. And nope, it's not SDS.


Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.


Controllable Extent of Editing

Failed to load image
Failed to load image

Text-guided 3D-to-3D using the same seed but different 𝑡start.

Diverse Samples

Failed to load image

Unlike SDS approaches that exhibit a mode-seeking behavior, MVEdit can generate variations from the exact same input using different random seeds.

Comparison on Image-to-3D

Failed to load image

Comparison on image-to-3D generation. SyncDreamer and DreamCraft3D are not evaluated on the 248 objects due to slow inference.

Comparison on Texture Generation

Failed to load image

Comparison on text-guided texture generation. *Our ablation study without skip connections resembles the method of TexFusion.

Failed to load image


    title={Generic 3D Diffusion Adapter Using Controlled Multi-View Editing},
    author={Hansheng Chen and Ruoxi Shi and Yulin Liu and Bokui Shen and Jiayuan Gu and Gordon Wetzstein and Hao Su and Leonidas Guibas},