MVEdit: Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

¹Stanford University ²UCSD ³Apparate Labs

Abstract

Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.

Gallery

Zero123++ v1.1 Image-to-3D

HELP SHARE

Text-Guided Re-Texturing: An NBA basketball player

HELP SHARE

Text-Guided Re-Texturing: A robot

HELP SHARE

Text-Guided Re-Texturing: A yellow vintage car

HELP SHARE

StableSSDNeRF Text-to-3D + Text-Guided Re-Texturing: A futuristic racing car

HELP SHARE

StableSSDNeRF Text-to-3D: A Ferrari 458 GT3 racing car

HELP SHARE

StableSSDNeRF Text-to-3D: A LEGO Volkswagen Beetle

HELP SHARE

StableSSDNeRF Text-to-3D: Game ready 3D model of a Porsche 911 police car, police light bar, the word 'police' on the side of the car

HELP SHARE

Text-Guided 3D-to-3D + Text-Guided Re-Texturing: Tomb Raider Lara Croft, wearing a backpack

HELP SHARE

Instruct 3D-to-3D + Text-Guided Re-Texturing: As a Zelda cosplay, blue outfit

HELP SHARE

Instruct 3D-to-3D: Make it a marble roman sculpture

HELP SHARE

Zero123++ v1.1 Image-to-3D

HELP SHARE

Zero123++ v1.1 Image-to-3D

HELP SHARE

Zero123++ v1.1 Image-to-3D

HELP SHARE

Zero123++ v1.1 Image-to-3D

HELP SHARE

Zero123++ v1.1 Image-to-3D

Text-Guided Re-Texturing: An NBA basketball player

Text-Guided Re-Texturing: A robot

Text-Guided Re-Texturing: A yellow vintage car

StableSSDNeRF Text-to-3D + Text-Guided Re-Texturing: A futuristic racing car

StableSSDNeRF Text-to-3D: A Ferrari 458 GT3 racing car

StableSSDNeRF Text-to-3D: A LEGO Volkswagen Beetle

StableSSDNeRF Text-to-3D: Game ready 3D model of a Porsche 911 police car, police light bar, the word 'police' on the side of the car

Text-Guided 3D-to-3D + Text-Guided Re-Texturing: Tomb Raider Lara Croft, wearing a backpack

Instruct 3D-to-3D + Text-Guided Re-Texturing: As a Zelda cosplay, blue outfit

Instruct 3D-to-3D: Make it a marble roman sculpture

Zero123++ v1.1 Image-to-3D

Text-Guided Re-Texturing: An NBA basketball player

BibTeX

@misc{mvedit2024, title={Generic 3D Diffusion Adapter Using Controlled Multi-View Editing}, author={Hansheng Chen and Ruoxi Shi and Yulin Liu and Bokui Shen and Jiayuan Gu and Gordon Wetzstein and Hao Su and Leonidas Guibas}, year={2024}, eprint={2403.12032}, archivePrefix={arXiv}, primaryClass={cs.CV} }

MVEdit

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

TLDR: MVEdit is a training-free 3D Adapter that enables 3D generation/editing using off-the-shelf 2D Stable Diffusion models. And nope, it's not SDS.

Abstract

Gallery

Zero123++ v1.1 Image-to-3D

Zero123++ v1.1 Image-to-3D

Text-Guided Re-Texturing: An NBA basketball player

Text-Guided Re-Texturing: A robot

Text-Guided Re-Texturing: A yellow vintage car

StableSSDNeRF Text-to-3D + Text-Guided Re-Texturing: A futuristic racing car

StableSSDNeRF Text-to-3D: A Ferrari 458 GT3 racing car

StableSSDNeRF Text-to-3D: A LEGO Volkswagen Beetle

StableSSDNeRF Text-to-3D: Game ready 3D model of a Porsche 911 police car, police light bar, the word 'police' on the side of the car

Text-Guided 3D-to-3D + Text-Guided Re-Texturing: Tomb Raider Lara Croft, wearing a backpack

Instruct 3D-to-3D + Text-Guided Re-Texturing: As a Zelda cosplay, blue outfit

Instruct 3D-to-3D: Make it a marble roman sculpture

Zero123++ v1.1 Image-to-3D

Zero123++ v1.1 Image-to-3D

Zero123++ v1.1 Image-to-3D

Zero123++ v1.1 Image-to-3D

Zero123++ v1.1 Image-to-3D

Text-Guided Re-Texturing: An NBA basketball player

Text-Guided Re-Texturing: A robot

Text-Guided Re-Texturing: A yellow vintage car

StableSSDNeRF Text-to-3D + Text-Guided Re-Texturing: A futuristic racing car

StableSSDNeRF Text-to-3D: A Ferrari 458 GT3 racing car

StableSSDNeRF Text-to-3D: A LEGO Volkswagen Beetle

StableSSDNeRF Text-to-3D: Game ready 3D model of a Porsche 911 police car, police light bar, the word 'police' on the side of the car

Text-Guided 3D-to-3D + Text-Guided Re-Texturing: Tomb Raider Lara Croft, wearing a backpack

Instruct 3D-to-3D + Text-Guided Re-Texturing: As a Zelda cosplay, blue outfit

Instruct 3D-to-3D: Make it a marble roman sculpture

Zero123++ v1.1 Image-to-3D

Zero123++ v1.1 Image-to-3D

Zero123++ v1.1 Image-to-3D

Zero123++ v1.1 Image-to-3D

Zero123++ v1.1 Image-to-3D

Text-Guided Re-Texturing: An NBA basketball player

Examples showcasing MVEdit’s generality across various 3D tasks, with associated inference times (on an RTX A6000) and the number of timesteps. For image-to-3D, note that the initial views by Zero123++ are not strictly 3D consistent, an issue remedied by MVEdit.

The initialization and ancestral sampling process of MVEdit. The original single-image SDEdit is shown in blue, the additional 3D Adapter in red, and extra conditioning in orange. For brevity, only the first view is depicted, and VAE encoding/decoding is omitted in cases involving latent diffusion.

Examples showcasing MVEdit’s generality across various 3D tasks, with associated inference times (on an RTX A6000) and the number of timesteps. For image-to-3D, note that the initial views by Zero123++ are not strictly 3D consistent, an issue remedied by MVEdit.

The initialization and ancestral sampling process of MVEdit. The original single-image SDEdit is shown in blue, the additional 3D Adapter in red, and extra conditioning in orange. For brevity, only the first view is depicted, and VAE encoding/decoding is omitted in cases involving latent diffusion.

Examples showcasing MVEdit’s generality across various 3D tasks, with associated inference times (on an RTX A6000) and the number of timesteps. For image-to-3D, note that the initial views by Zero123++ are not strictly 3D consistent, an issue remedied by MVEdit.

The initialization and ancestral sampling process of MVEdit. The original single-image SDEdit is shown in blue, the additional 3D Adapter in red, and extra conditioning in orange. For brevity, only the first view is depicted, and VAE encoding/decoding is omitted in cases involving latent diffusion.

Examples showcasing MVEdit’s generality across various 3D tasks, with associated inference times (on an RTX A6000) and the number of timesteps. For image-to-3D, note that the initial views by Zero123++ are not strictly 3D consistent, an issue remedied by MVEdit.

Controllable Extent of Editing

Text-guided 3D-to-3D using the same seed but different 𝑡start.

Diverse Samples

Unlike SDS approaches that exhibit a mode-seeking behavior, MVEdit can generate variations from the exact same input using different random seeds.

Comparison on Image-to-3D

Comparison on image-to-3D generation. SyncDreamer and DreamCraft3D are not evaluated on the 248 objects due to slow inference.

Comparison on Texture Generation

Comparison on text-guided texture generation. *Our ablation study without skip connections resembles the method of TexFusion.

BibTeX

TLDR: MVEdit is a training-free 3D Adapter that enables 3D generation/editing
using off-the-shelf 2D Stable Diffusion models. And nope, it's not SDS.

Text-guided 3D-to-3D using the same seed but different 𝑡^start.