You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you so much for your great work and codebase!
I would appreciate your clarifications on a few items.
From within TextToVideoSDPipelineCall.py, at this line, the attention maps from the temporal layers seem to be empty, by approximately using this code block
for name, module in self.unet.named_modules():
module_name = type(module).__name__
if module_name == "Attention" and "attn2" in name:
# --- First set
if "temp_attentions" in name:
print(name) # replace .0 with [0]
extracted_attention_map = module.processor.cross_attention_map
if extracted_attention_map!=None:
print(extracted_attention_map.shape)
else:
# --- Second set
...
If one should assume that the second set is the spatial attention maps, it does not align with modules listed in the supplemental document (page 1, screenshot included), particularly the transformer_in.transformer_blocks[0].attn2 with size 64, 64, 24, 24 suggesting its temporal (not spatial as mentioned in supplemental) with 24 frames and mid_block.attentions[0].transformer_blocks[0].attn2 with size 480, 8, 8, 77 suggesting its the spatial attention map (not temporal) with 77 tokens.
Your kind clarification would be very helpful. Thanks
The text was updated successfully, but these errors were encountered:
Hello,
Thank you so much for your great work and codebase!
I would appreciate your clarifications on a few items.
TextToVideoSDPipelineCall.py
, at this line, the attention maps from the temporal layers seem to be empty, by approximately using this code block- First set
while only
.attentions
layers and thetransformer_in
layer in the second set have cross attention maps.- Second set
transformer_in.transformer_blocks[0].attn2
with size 64, 64, 24, 24 suggesting its temporal (not spatial as mentioned in supplemental) with 24 frames andmid_block.attentions[0].transformer_blocks[0].attn2
with size 480, 8, 8, 77 suggesting its the spatial attention map (not temporal) with 77 tokens.Your kind clarification would be very helpful. Thanks
The text was updated successfully, but these errors were encountered: