Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/support qwenvl glm4-v phi3-v(conflict resolving) #4377

Closed
wants to merge 41 commits into from

Conversation

marko1616
Copy link
Contributor

@marko1616 marko1616 commented Jun 19, 2024

What does this PR do?

Fixes #4375

Before submitting

@hiyouga hiyouga added the pending This problem is yet to be addressed label Jun 19, 2024
@marko1616
Copy link
Contributor Author

终于还差一个image的padding处理就能做好训练支持了。

@marko1616
Copy link
Contributor Author

@hiyouga 改的比较多捏,有空帮忙看看这个实现思路行不行。谢谢。

@marko1616
Copy link
Contributor Author

成功跑了训练。

@marko1616 marko1616 changed the title Feature/support qwenvl glm4-v *WORKING DO NOT MERGE* Feature/support qwenvl glm4-v (tested) Jun 23, 2024
@BUAADreamer BUAADreamer self-requested a review June 28, 2024 17:05
@marko1616 marko1616 changed the title Feature/support qwenvl glm4-v (tested) Feature/support qwenvl glm4-v phi3-v(tested) Jun 30, 2024
@BUAADreamer
Copy link
Collaborator

BUAADreamer commented Jul 1, 2024

暂时先不要引入更多模型,把现有的三个模型完善好,控制diff🤗 @marko1616

@marko1616
Copy link
Contributor Author

暂时先不要引入更多模型,把现有的三个模型完善好,控制diff🤗 @marko1616

oaky,现在确实在测试别的训练模式。

@marko1616
Copy link
Contributor Author

marko1616 commented Jul 11, 2024 via email

@marko1616 marko1616 changed the title Feature/support qwenvl glm4-v phi3-v(tested) Feature/support qwenvl glm4-v phi3-v(conflict resolving) Jul 18, 2024
@marko1616 marko1616 requested a review from zjysteven July 19, 2024 20:40
@BUAADreamer BUAADreamer removed the request for review from zjysteven July 21, 2024 02:26
@chocoded
Copy link

在sft训练时,eval步骤出现问题:
File ".../TuningFactory/src/llmtuner/train/sft/workflow.py", line 98, in run_sft
metrics = trainer.evaluate(metric_key_prefix="eval", **gen_kwargs)

...

File ".../glm-4v-9b/modeling_chatglm.py", line 1035, in forward
full_attention_mask = self.get_masks(inputs_embeds, past_key_values, padding_mask=attention_mask)

File ".../glm-4v-9b/modeling_chatglm.py", line 835, in get_masks
full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1)
RuntimeError: The size of tensor a (1623) must match the size of tensor b (24) at non-singleton dimension 2

input_ids:
[151331, 151333, 151336, 198, 151339, 151329, 151340, 98598, 98992, 100555, 101052, 101939, 11314, 151337, 198, 111000, 127102, 98993, 114571, 98362, 104343, 1773, 151329]

inputs:
[gMASK] <|user|>
<|begin_of_image|> <|endoftext|> <|end_of_image|> 图中的狗是什么品种? <|assistant|>
图中是一只拉布拉多犬。 <|endoftext|>

可能的原因:

...
data_collator = DataCollatorForSeq2Seq(
        tokenizer=tokenizer,
        pad_to_multiple_of=8 if tokenizer.padding_side == "right" else None,  # for shift short attention
        label_pad_token_id=IGNORE_INDEX if data_args.ignore_pad_token_for_loss else tokenizer.pad_token_id,
)
...

在 sft/workflow.py 文件 59 行处设置了 pad_to_multiple_of=8,如果 attention_mask 长度不为 8 的倍数,则末尾需要用 0 填充,在 modeling_chatglm.py 第 1035 行处需要计算 full_attention_mask,eval 时不会在之前步骤重新计算 attention_mask,导致 inputs_embeds 和 attention_mask 长度不匹配,从而出现错误。

可能的修复:
在 glm4v 模型中,设置 pad_to_multiple_of=1,即不做填充,这也是我看到的 ms-swift 的做法。

@marko1616
Copy link
Contributor Author

v-9b/modeling_chatglm.py", line 1035, in forward full_attention_mask = self.get_masks(inputs_embeds, past_key_values, padding_mask=attention_mask)

File ".../glm-4v-9b/modeling_chatglm.py", line 835, in get_masks full_attention_mask = full_attention_mask * padding_mask.unsqueeze(1) RuntimeError: The size of tensor a (1623) must match the size of tensor b (24) at non-singleton dimension 2

input_ids: [151331, 151333, 151336, 198, 151339, 151329, 151340, 98598, 98992, 100555, 101052, 101939, 11314, 151337, 198, 111000,

真的非常感谢你指出了这个问题与详细的分析,我会更加仔细的查看对应的功能实现,并尽快给出可用的commit。

@chocoded
Copy link

chocoded commented Aug 1, 2024

补充:
glm4v 模型在无监督的情况下 train 的步骤会出现问题,原因是 _encode_unsupervised_example 中 encode 过程没有保证 input_ids 和 labels 长度对齐,在 modeling_chatglm.py 中 1216 行处计算 loss 时会报错。

@marko1616
Copy link
Contributor Author

marko1616 commented Aug 1, 2024

补充: glm4v 模型在无监督的情况下 train 的步骤会出现问题,原因是 _encode_unsupervised_example 中 encode 过程没有保证 input_ids 和 labels 长度对齐,在 modeling_chatglm.py 中 1216 行处计算 loss 时会报错。

方便加个聊天方式交流一下吗?(非常感谢你完成了这些测试,我今天就会进行修正)

@chocoded
Copy link

chocoded commented Aug 1, 2024

方便加个聊天方式交流一下吗?(非常感谢你完成了这些测试,我今天就会进行修正)

行,我的qq是227154737

@marko1616
Copy link
Contributor Author

方便加个聊天方式交流一下吗?(非常感谢你完成了这些测试,我今天就会进行修正)

行,我的qq是227154737

OK加了

@marko1616 marko1616 marked this pull request as draft September 3, 2024 09:53
@marko1616 marko1616 closed this Dec 31, 2024
@marko1616 marko1616 deleted the feature/Support-Qwenvl branch December 31, 2024 12:25
@hiyouga hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed labels Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature request] 支持Qwen-VL
5 participants