-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why not using VQVAE like stable diffusion? #40
Comments
I tried it out, and found that the VQVAE downsampling rate is 8, which means that a sequence of length T will become a latent feature of length T/8 after being processed by the encoder. However, I encountered some issues:
|
Interesting problem. We have already achieved the VQVAE below. Please refer to our MotionGPT project. In our experiment, MotionGPT achieves 0.067 on the FID metric for the motion reconstruction task, which is quite different from your evaluations. |
I guess it's because I use a different backbone than T2M-GPT :( |
I hope I won't disturb you, but is the process of dataloader exactly the same as T2M-GPT, and when will the code of MotionGPT be released? |
Of course, we will release MotionGPT just like this motion-latent-diffusion project. It could cost a week to a month to set up everything. You are right the VQVAE part is quite similar to T2M-GPT's. |
Have you tried VQVAE+Diffusion. My VQVAE performance is fine, but why the VQVAE+Diffusion result is very poor? : ( |
Diffusion models is oriniglly desgined for continuous data, like RGB values in images, while VQVAE outputs codebook for discrete representation. I guess you need some "discrete" diffusion models to support the idea of "VQVAE+Diffusion". |
I am still a beginner in the field of motion generation, thank you very much for your answer. |
I am very curious what is the difference between an image and a motion sequence? Images are continuous in 2-D space, and motion sequences are continuous in the temporal dimension. But why does stable diffusion (VQVAE + diffusion) work well on images? In fact, I found that the latent embedding obtained by diffusion passes through the quantization layer with index collapse. I think it may be because the distribution generated by the encoder of VQVAE is too hard for diffusion. The 'distribution' here does not mean that VQVAE can obtain as a continuous representation, but rather all the discrete representations are viewed as a batch of data which has a distribution to be learned by diffusion. I did not use the discriminator of GAN and perceptual loss like VQGAN, if I had used them, I guess that the encoder of VQVAE might have been able to get a distribution that is easier to learn for diffusion. And SD uses a large scale dataset, the distribution obtained by VQVAE encoder will be smoother. And diffusion will learn more easily. But this is all my opinion as a beginner, I would like to ask you what you think |
Hi @daidaiershidi, have you figured out why VQVAE+diffusion works badly on motion? Recently, I am also working on this scenario. (i.e. training VQVAE then use it to train latent diffusion model) Do you guys have any idea about this? @ChenFengYe @daidaiershidi |
Hi @MingCongSu , I recently ran into the same dilemma(frozen motions). Do you have any insights on this? |
Thank you for bringing interesting work. I'm curious if you have tried VQVAE?
The text was updated successfully, but these errors were encountered: