You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I understand the main pipeline, i.e., encoding the speech content features, id features, and pose features respectively then feeding them to the generator for the driven results. But I am a little bit confused after reading the inference code.
As can be seen, the mel-spectrogram is encoded by the audio encoder first in Line 473 and is ready to be fused with the pose feature in Line 483. However, in the merge_mouthpose() function:
I found the audio features are further embedded, what is the intuition behind that? In my view, the netE.mouth_embed would be used to embed the mouth features for the video but NOT for the audio. If anything is wrong, please correct me. Thanks in advance.
The text was updated successfully, but these errors were encountered:
Hi, thanks for sharing this great work!
I understand the main pipeline, i.e., encoding the speech content features, id features, and pose features respectively then feeding them to the generator for the driven results. But I am a little bit confused after reading the inference code.
Talking-Face_PC-AVS/models/av_model.py
Lines 473 to 484 in 23585e2
As can be seen, the mel-spectrogram is encoded by the audio encoder first in Line 473 and is ready to be fused with the pose feature in Line 483. However, in the
merge_mouthpose()
function:Talking-Face_PC-AVS/models/av_model.py
Lines 454 to 461 in 23585e2
I found the audio features are further embedded, what is the intuition behind that? In my view, the
netE.mouth_embed
would be used to embed the mouth features for the video but NOT for the audio. If anything is wrong, please correct me. Thanks in advance.The text was updated successfully, but these errors were encountered: