why embedding the audio features #58

e4s2022 · 2022-06-29T06:41:15Z

Hi, thanks for sharing this great work!

I understand the main pipeline, i.e., encoding the speech content features, id features, and pose features respectively then feeding them to the generator for the driven results. But I am a little bit confused after reading the inference code.

Talking-Face_PC-AVS/models/av_model.py

Lines 473 to 484 in 23585e2

    
           A_mouth_feature = self.encode_audiosync_feature(spectrogram) 
        
           A_mouth_feature = A_mouth_feature * mouth_feature_weight 
        
           sel_id_feature = [] 
        
           sel_id_feature.append(self.select_frames(id_feature[0])) 
        
           sel_id_feature.append(self.select_frames(id_feature[1])) 
        
           V_noid_ref_feature = self.encode_ref_noid(input_img) 
        
           V_headpose_ref_feature = self.netE.to_headpose(V_noid_ref_feature) 
        
           ref_merge_feature_a = self.select_frames(self.merge_mouthpose(A_mouth_feature, V_headpose_ref_feature)) 
        
           fake_image_ref_pose_a, _ = self.generate_fake(sel_id_feature, ref_merge_feature_a)

As can be seen, the mel-spectrogram is encoded by the audio encoder first in Line 473 and is ready to be fused with the pose feature in Line 483. However, in the merge_mouthpose() function:

Talking-Face_PC-AVS/models/av_model.py

Lines 454 to 461 in 23585e2

    
           def merge_mouthpose(self, mouth_feature, headpose_feature, embed_headpose=False): 
        
               mouth_feature = self.netE.mouth_embed(mouth_feature) 
        
               if not embed_headpose: 
        
                   headpose_feature = self.netE.headpose_embed(headpose_feature) 
        
               pose_feature = torch.cat((mouth_feature, headpose_feature), dim=2) 
        
               return pose_feature

I found the audio features are further embedded, what is the intuition behind that? In my view, the netE.mouth_embed would be used to embed the mouth features for the video but NOT for the audio. If anything is wrong, please correct me. Thanks in advance.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why embedding the audio features #58

why embedding the audio features #58

e4s2022 commented Jun 29, 2022

why embedding the audio features #58

why embedding the audio features #58

Comments

e4s2022 commented Jun 29, 2022