Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Attention #13

Open
cuishuhao opened this issue May 12, 2023 · 12 comments
Open

Question about Attention #13

cuishuhao opened this issue May 12, 2023 · 12 comments

Comments

@cuishuhao
Copy link

Thank you for sharing the job!
I wonder how the codes achieve the process from source attention to the target, is it achieved by https://github.com/TencentARC/MasaCtrl/blob/main/masactrl/masactrl.py#L35C25-L43?

@ljzycmd
Copy link
Collaborator

ljzycmd commented May 12, 2023

Hi @cuishuhao, these codes perform the attention process. Specifically, as shown in https://github.com/TencentARC/MasaCtrl/blob/f4476b0adeb6d111a532aca5111457fc5b6e9f88/masactrl/masactrl.py#LL58C1-L59C138, the target image features serve as Q, and the K, V are obtained from the source image features to query contents from the source image.

@LWprogramming
Copy link

Should K and V be ku and vu in the second line instead of kc vc then? Since we want to use the K and V from source

@ljzycmd
Copy link
Collaborator

ljzycmd commented Aug 16, 2023

Hi @LWprogramming, u and c represent unconditional and conditional, respectively. And both conditional and unconditional parts account for the synthesis results with classifier-free guidance. Therefore, we query both parts from the source image. Note that ku and vu contains both source and target features, ku[ :num_heads] and vu[ :num_heads] are the features from the source image.

Hope this can help you. 😃

@LWprogramming
Copy link

Oh, you're right, I misread that :) I hadn't properly understood the connection between __call__ and forward methods before. Looking at it more closely, it looks like in the notebook we call regiter_attention_editor_diffusers, which calls editor.__call__. But then it seems like the relevant logic for saving K and V from source is in AttentionStore, while MutualSelfAttentionControl inherits from AttentionBase instead of AttentionStore. How does it eventually connect?

@ljzycmd
Copy link
Collaborator

ljzycmd commented Aug 17, 2023

Hi @LWprogramming, register_attention_editor_diffusers can replace the original attention-forward process with our modified one. Note that qu, ku, vu contains both the query, key, value of source and target images. Take qu as an example, qu[ :num_head] is the query of source image, and qu[num_heads: ] is the query of target image. This is applicable for ku, vu and the conditional part qc, kc, vc.

@LWprogramming
Copy link

Ah, I see now-- in the originally linked code I'd overlooked that q, k, v for u and c all have 2 * num_heads instead of num_heads in a dimension. Thanks!

@kingnobro
Copy link

kingnobro commented Sep 14, 2023

In this link, I do not understand why passing the entire qu. What is the intuitive explanation for passing the entire qu, i.e., using both source and target images?
My understanding is that both the source image and target image need to go through Attention, so qu is not qu[num_heads:]. In the Attention block, two images do not interfere with each other. Finally, we only output the target image. That is to say, we can use qu[num_heads:] too?

@ljzycmd
Copy link
Collaborator

ljzycmd commented Sep 14, 2023

Hi @kingnobro, I am sorry for the confusion. Note that qu[:num_heads] is the query feature from the source image, and the qu[num_heads: ] is the query of the target image, while only the source key and value features serve as K, V in the attention process. Therefore, the source image can be reconstructed or synthesized, and the target image can query image contents from the source image. Since the two denoising processes are performed simultaneously in the current implementation, we cannot only use qu[num_heads: ] to generate the target image.

@FerryHuang
Copy link

Maybe it works fine too, without chunking for u and c? I checked it and it turns out to be the same value with the current algorithm.

@ljzycmd
Copy link
Collaborator

ljzycmd commented Sep 27, 2023

Hi @FerryHuang, I'd like to further validate the results without chunking the unconditional and conditional parts during the denoising process, and the results will be updated here. In our previous experiments, performing the mutual self-attention on two parts independently achieved better results than jointly.

@TimelessXZY
Copy link

Sorry, I don't really understand why the two denoising processes are performed simultaneously. In implementation

noise_pred = self.unet(model_inputs, t, encoder_hidden_states=text_embeddings).sample
, it seems that there is only one denoising process performed

Hi @kingnobro, I am sorry for the confusion. Note that qu[:num_heads] is the query feature from the source image, and the qu[num_heads: ] is the query of the target image, while only the source key and value features serve as K, V in the attention process. Therefore, the source image can be reconstructed or synthesized, and the target image can query image contents from the source image. Since the two denoising processes are performed simultaneously in the current implementation, we cannot only use qu[num_heads: ] to generate the target image.

@ljzycmd
Copy link
Collaborator

ljzycmd commented Feb 2, 2024

Hi @TimelessXZY, the model_inputs consists of the noisy latent for the source branch and the target branch. The real editing is performed inner each hacked attention class (e.g., MutualSelfAttentionControl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants