Fewer parameters and inference time steps,more aesthetically results
Text-to-image uses AI to understand your words and convert them to a unique image each time.
Stable diffusion: a milestone work on latent space. A framework that trains the diffusion models on latent space, which is a scaled-up version of Latent Diffusion Model (LDM).
Two-stage distillation model:
stage1: Guidance weight stage, Guidance weight w as extra parameter to the UNet to prevent positive prompts and negative prompts of the Classifier-Free Guidance (CFG).
stage2: Progressive distillation, the progress distillation pipeline similarity the work is employ to decrease the inference steps.
stage1:In direct distillation, the teacher only distills once (16 → 8).Balancing the CFG-Aware loss and Vanilla step distillation loss for step reduction
-
An extra multi-stage distillation pipeline will lead to accumulated error.512→256→128→64→32→16→8
-
Poor scalability:simple UNet output with v-prediction, the w-condition and snapfusion are based on diffusion model outputs in the form of v-prediction predictions(stabel diffusion v2.0).But stable diffusion v1.x are not suitable.
The prediction type of stable diffusion consist of 'epsilon','sample', and 'v_prediction'. More information can be found in Progressive distillation.
- The high quality and large scale image-text dataset are counstructed from Laion-art, MSCOCO, DiffunsionDB and HPSV2. For the quality of image, the hpsv2 score is used for evaluate the human references. More detail information can be found in HPSv2.For the quality of prompts, the minigpt, blip are used to clean the prompts, the process follows the work:recastai/LAION-art-EN-improved-captions.
Images of size 512 × 512 are generated from the publicly released model with the DPM Solver++ sampler, 8, 16 and 20 sampling steps, and a guidance scale of 7.5. For MSCOCO 2017 caption dataset, the first prompt of each image are selected for zero-shot inference.
For HPSv2 dataset, we infer the test dataset benchmark include the four types prompts,"Animation", "Concept-art", "Painting", and "Photo".
https://snap-research.github.io/SnapFusion/