Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于 use cross-face 训练 #33

Open
sunjian2015 opened this issue Dec 25, 2024 · 5 comments
Open

关于 use cross-face 训练 #33

sunjian2015 opened this issue Dec 25, 2024 · 5 comments

Comments

@sunjian2015
Copy link

我看论文中有 “use cross-face (e.g., reference images are sourced from video frames outside the training frames) as inputs with probability β” 这种训练方法,想问一下,如果使用训练帧之外的图像作为参考图像,这就不是同一个 id 了,loss 怎么计算呢?

@SHYuanBest
Copy link
Member

感谢关注。我们的数据pipeline会使用yolo+sam2为视频中每一帧的同一个人打上唯一的id标注,cross face loss只会选取训练帧之外的同一个id标注的参考图像进行loss计算。

@sunjian2015
Copy link
Author

感谢关注。我们的数据pipeline会使用yolo+sam2为视频中每一帧的同一个人打上唯一的id标注,cross face loss只会选取训练帧之外的同一个id标注的参考图像进行loss计算。

明白了,感谢您的回复。另外,计算 loss 的时候,我看代码中是这么写的

model_pred = scheduler.get_velocity(model_output, noisy_video_latents, timesteps)
...
target = video_latents
...
loss = (weights * (model_pred - target) ** 2).reshape(batch_size, -1)

这里有些不大明白,scheduler.get_velocity 不应该是根据 video_latents 和 noise 获取 target_v 吗?这里的 model_pred 不是 v 吗?

@SHYuanBest
Copy link
Member

You can refer to THUDM/CogVideo#403.

  1. v = αϵ − σx0, according to origin get_velocity func
  2. ConsisID current input xt and the model predicted v, so tmp_out = αxt − σv
  3. xt = αx0 + σϵ, according to ddpm noise add func
  4. tmp_out = αxt − σv = α*(αx0 + σϵ) − σ*(αϵ − σx0) = α^2x0+ασϵ-ασϵ+σ^2x0 = α^2x0 + σ^2x0 = x0, according to α^2+σ^2=1

@sunjian2015
Copy link
Author

You can refer to THUDM/CogVideo#403.

  1. v = αϵ − σx0, according to origin get_velocity func
  2. ConsisID current input xt and the model predicted v, so tmp_out = αxt − σv
  3. xt = αx0 + σϵ, according to ddpm noise add func
  4. tmp_out = αxt − σv = α*(αx0 + σϵ) − σ*(αϵ − σx0) = α^2x0+ασϵ-ασϵ+σ^2x0 = α^2x0 + σ^2x0 = x0, according to α^2+σ^2=1

大佬,论文里的图是不是画错了?这个 CLIP 和 FaceExtractor 是不是该换下位置?
image

@SHYuanBest
Copy link
Member

You can refer to THUDM/CogVideo#403.

  1. v = αϵ − σx0, according to origin get_velocity func
  2. ConsisID current input xt and the model predicted v, so tmp_out = αxt − σv
  3. xt = αx0 + σϵ, according to ddpm noise add func
  4. tmp_out = αxt − σv = α*(αx0 + σϵ) − σ*(αϵ − σx0) = α^2x0+ασϵ-ασϵ+σ^2x0 = α^2x0 + σ^2x0 = x0, according to α^2+σ^2=1

大佬,论文里的图是不是画错了?这个 CLIP 和 FaceExtractor 是不是该换下位置? image

我检查了一下,应该没有吧

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants