Consistent Flow Distillation for Text-to-3D Generation

Abstract

Score Distillation Sampling (SDS) has made significant strides in distilling image-generative models for 3D generation. However, its maximum-likelihood-seeking behavior often leads to degraded visual quality and diversity, limiting its effectiveness in 3D applications. In this work, we propose Consistent Flow Distillation (CFD), which addresses these limitations. We begin by leveraging the gradient of the diffusion ODE or SDE sampling process to guide the 3D generation. From the gradient-based sampling perspective, we find that the consistency of 2D image flows across different viewpoints is important for high-quality 3D generation. To achieve this, we introduce multi-view consistent Gaussian noise on the 3D object, which can be rendered from various viewpoints to compute the flow gradient. Our experiments demonstrate that CFD, through consistent flows, significantly outperforms previous methods in text-to-3D generation.

Method Overview

Overview of CFD update

Computing the multiview consistent Gaussian noise

Overview of multiview consistent Gaussian Noise \(\tilde{\boldsymbol{\epsilon}}(\theta,\boldsymbol{c})\) computation

"Heavily modified car with oversized wheels, reinforced steel plating, snarling front grille, spikes, and chains"

"Towering white and black rocket launching with smoke and flames, detailed paneling, and NASA logo"

Visualization of multiview consistent Gaussian Noise \(\tilde{\boldsymbol{\epsilon}}(\theta,\boldsymbol{c})\).

Pause the video to see the noise in a fixed view: the noise at different pixels are i.i.d. Gaussian Noise.

Clean Flow ODE/SDE

noisy variable \(\boldsymbol{x}_t\)

ground-truth variable \(\hat{\boldsymbol{x}}^{\text{gt}}_t\)

clean variable \(\hat{\boldsymbol{x}}^{\text{c}}_t\)

Visualization of different variables in a DDPM (Diffusion SDE) image generation process

For a forward diffusion process \(p_t(\boldsymbol{x}_t|\boldsymbol{x}_0)=\mathcal{N}(\alpha_t\boldsymbol{x}_0, \sigma_t^2\boldsymbol{I}),\;\boldsymbol{x}_0\sim p_0(\boldsymbol{x}_0)\), a diffusion PF-ODE [1] yields the same marginal distribution as the forward diffusion process at any time \(t\). The PF-ODE defined on the noisy variable \(\boldsymbol{x}_t\) that starts from pure Gaussian Noise \(\boldsymbol{x}_T \sim \mathcal{N}(\boldsymbol{0},\sigma_T^2 \boldsymbol{I})\) takes the form of \[ \mathrm{d} \left(\frac{\boldsymbol{x}_t}{\alpha_t}\right) = \textcolor[rgb]{0.7, 0.36, 0.14}{\underbrace{{\mathrm{d} \left(\frac{\sigma_t}{\alpha_t}\right)}}_{-lr}} \cdot \textcolor[rgb]{0.42, 0.28, 0.7}{\underbrace{{\boldsymbol{\epsilon}_\phi(\boldsymbol{x}_t, t, y)}}_{\nabla \mathcal{L}}}, \] where \(\boldsymbol{\epsilon}_\phi(\boldsymbol{x}_t, t, y) \approx - \sigma_t \nabla_{\boldsymbol{x}_t} \log p_t(\boldsymbol{x}_t|y)\) is a pre-trained \(\epsilon\)-prediction Diffusion Model. Following FSD [2], we can change the variable of the PF-ODE to (i) the ground-truth variable \(\hat{\boldsymbol{x}}^{\text{gt}}_t \triangleq \frac{\boldsymbol{x}_t-\sigma_t \boldsymbol{\epsilon}_\phi(\boldsymbol{x}_t, t, y)}{\alpha_t}\), which is also known as the sample-prediction of the Diffusion Model, or (ii) the clean variable \(\hat{\boldsymbol{x}}^{\text{c}}_t \triangleq \frac{\boldsymbol{x}_t-\sigma_t \tilde{\boldsymbol{\epsilon}}}{\alpha_t}\), where \(\tilde{\boldsymbol{\epsilon}}\) is a constant for each PF-ODE trajectory and equals the initial noise \(\frac{\boldsymbol{x}_T}{\sigma_T}\). Notably, under such definition, (i) \(\hat{\boldsymbol{x}}^{\text{c}}_t\) are visually non-noisy images for all \(t \in [0,T]\), therefore, it can be substituted with the rendered clean images \(\boldsymbol{g}_\theta(\boldsymbol{c})\). (ii) \(\hat{\boldsymbol{x}}^{\text{c}}_t\) achieves zero initialization \(\hat{\boldsymbol{x}}^{\text{c}}_T=\boldsymbol{0}\), which is consistent with the typical NeRF initialization. And (iii) the end point of the new ODE trajectory \(\hat{\boldsymbol{x}}^{\text{c}}_0 = \boldsymbol{x}_0\) is a sample from the target distribution \(p_0(\boldsymbol{x}_0)\) and is completely determined by the constant \(\tilde{\boldsymbol{\epsilon}}\) (thus \(\tilde{\boldsymbol{\epsilon}}\) can be viewed as the identity of the trajectory). The reformulated PF-ODE on the clean variable \(\hat{\boldsymbol{x}}^{\text{c}}_t\) is \[ \mathrm{d} \hat{\boldsymbol{x}}^{\text{c}}_t = \textcolor[rgb]{0.7, 0.36, 0.14}{\underbrace{{\mathrm{d} \left(\frac{\sigma_t}{\alpha_t}\right)}}_{-lr}} \cdot \textcolor[rgb]{0.42, 0.28, 0.7}{\underbrace{\left(\boldsymbol{\epsilon}_\phi(\alpha_t \hat{\boldsymbol{x}}^{\text{c}}_t + \sigma_t \tilde{\boldsymbol{\epsilon}}, t, y)-\tilde{\boldsymbol{\epsilon}}\right)}_{\nabla \mathcal{L}}}, \] which we also refer to as the clean flow. We further showed that the SDE presented by Song et al. [1] also has a similar form as the above equation, which implies the change-of-variable is potentially general (Refer to paper for details). In order to use Diffusion PF-ODE as a guidance for 3D generation, we can directly replace the clean variable \(\hat{\boldsymbol{x}}^{\text{c}}_t\) with the rendered image \(\boldsymbol{g}_\theta(\boldsymbol{c})\) at camera view \(\boldsymbol{c}\) from a 3D representation \(\theta\) by \(\hat{\boldsymbol{x}}^{\text{c}}_t = \boldsymbol{g}_\theta(\boldsymbol{c})\). (It's also reasonable to replace the ground-truth variable \(\hat{\boldsymbol{x}}^{\text{gt}}_t\), but this faces difficulties as discussed in Appendix of the paper.) We use the rendered images from the 3D representation \(\theta\) at optimization time \(\tau\) to simulate the clean variable \(\hat{\boldsymbol{x}}^{\text{c}}_t\) at diffusion timestep \(t(\tau)\), where \(t(\tau)\) is a predefined monotonically decreasing timestep annealing function. By mimicking \(\mathrm{d} (\sigma_t/\alpha_t)\) with the learning rate of an optimizer and seeing \(\boldsymbol{\epsilon}_\phi(\alpha_t \hat{\boldsymbol{x}}^{\text{c}}_t + \sigma_t \tilde{\boldsymbol{\epsilon}}, t, y)-\tilde{\boldsymbol{\epsilon}}\) as the gradient of the loss, we can update the 3D representation \(\theta\) by propogating the gradient of the loss through Jacobian matrix of the rendering function \(\boldsymbol{g}_\theta\): \[ \nabla_\theta \mathcal{L}_{\text{CFD}}(\theta) = \mathbb{E}_{\boldsymbol{c}} \left[\textcolor[rgb]{0.42, 0.28, 0.7}{\big(\boldsymbol{\epsilon}_\phi(\alpha_t \boldsymbol{g}_\theta(\boldsymbol{c}) + \sigma_t \tilde{\boldsymbol{\epsilon}}, t(\tau), y)-\tilde{\boldsymbol{\epsilon}}\big)} \frac{\partial \boldsymbol{g}_\theta(\boldsymbol{c})}{\partial \theta} \right]. \] In this formula, \(\tilde{\boldsymbol{\epsilon}} = \tilde{\boldsymbol{\epsilon}}(\theta,\boldsymbol{c})\) is a determined view and object surface dependent noise function. In the 2D image clean flow, a fixed noise \(\tilde{\boldsymbol{\epsilon}}\) is added to the clean variable \(\hat{\boldsymbol{x}}^{\text{c}}_t\) during the reverse diffusion process. When generalizing clean flow to 3D, we also follow the same idea and compute a multi-view consistent Gaussian noise \(\tilde{\boldsymbol{\epsilon}}(\theta,\boldsymbol{c})\), so that similar local noise patterns are added to the same local regions of rendered images even at different camera views.

By using the clean flow, we disentangle the noise from the noisy variable \(\boldsymbol{x}_t\). We believe this is essential for using image PF-ODE/diffusion SDE as 3D guidance. Clean images and Gaussian noise follow different transport equations. While clean images can be transported using a common 3D representation, Gaussian noise needs be transported using the Noise Transport Equation [3], which is incompatible with common 3D representations.

noisy variable \(\boldsymbol{x}_t\)

clean variable \(\hat{\boldsymbol{x}}^{\text{c}}_t\)

Visualization of different variable spaces in DDIM image generation processes with random prompts and at random timesteps

Diverse Results

"A treasure chest full of gold coins and jewels, high resolution, sharp"

"A steampunk owl with mechanical wings"

"An astronaut riding a horse"

"A llama in a tuxedo at a fancy gala"

"A cowboy raccoon with a lasso"

"A 3D model of mecha vampire girl chibi"

Text-to-3D Gallery

"A samurai panda with a bamboo sword"

"A painter hedgehog with a palette"

"A knight fox in shining armor"

"A wizard frog with a spellbook"

"A 3D model of a DSLR camera, photography, box modeling, Maya"

"A manga magical girl with magic wand"

"A 3D model of a medieval house with grass, vines, stone, wood, and medieval decor"

"A 3D anime-style dragon girl with shimmering scales, horns, and a confident expression"

Note: NeRF samples in the Gallery sections are generated using the 2 stage pipline. Stage-1: CFD distilling MVDream (a text-to-multiview image diffusion model); Stage-2: CFD distilling Stable Diffusion. NeRF samples in the Comparison section are generated by only distilling Stable Diffusion.

Consistent Flow Distillation
for Text-to-3D Generation

We can generate high-quality 3D models by distilling pre-trained image diffusion models with the multi-view consistent Gaussian noise. With the consistent noise, the update direction of the 3D model is more consistent with the diffusion PF-ODE, thus the generation results are of higher-quality.

Abstract

Method Overview

Clean Flow ODE/SDE

Diverse Results

Text-to-3D Gallery

References

BiBTex

Consistent Flow Distillationfor Text-to-3D Generation

We can generate high-quality 3D models by distilling pre-trained image diffusion models with the multi-view consistent Gaussian noise. With the consistent noise, the update direction of the 3D model is more consistent with the diffusion PF-ODE, thus the generation results are of higher-quality.

Abstract

Method Overview

Clean Flow ODE/SDE

Diverse Results

Text-to-3D Gallery

References

BiBTex

Consistent Flow Distillation
for Text-to-3D Generation