Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于第十章:翻译任务中transformers部分API的变更 #27

Open
PikaChyou opened this issue Sep 13, 2024 · 1 comment
Open

Comments

@PikaChyou
Copy link

第十章:翻译任务中提到的默认分词器编码设定采用的上下文管理器 as_target_tokenizer() 即将被废弃

默认情况下分词器会采用源语言的设定来编码文本,要编码目标语言则需要通过上下文管理器 as_target_tokenizer()

zh_sentence = train_data[0]["chinese"]
en_sentence = train_data[0]["english"]

inputs = tokenizer(zh_sentence)
with tokenizer.as_target_tokenizer():
   targets = tokenizer(en_sentence)

在当前版本的transformer中使用 as_target_tokenizer() 虽然还能够正常运行,但将会给出一个warning提示该API将会在下一个大版本中被移除

UserWarning: `as_target_tokenizer` is deprecated and will be removed in v5 of Transformers. You can tokenize your labels by using the argument `text_target` of the regular `__call__` method (either in the same call as your input texts if you use the same keyword arguments, or in a separate call.
  warnings.warn(

huggingface官方推荐的做法为改用 text_target 参数进行编码,在官方文档中有详细的注解

因此个人推荐将原文的表述更改为

默认情况下分词器会采用源语言的设定来编码文本,要编码目标语言则需要使用参数 text_target

zh_sentence = train_data[0]["chinese"]
en_sentence = train_data[0]["english"]

inputs = tokenizer(zh_sentence)
targets = tokenizer(text_target=en_sentence)
@jsksxs360
Copy link
Owner

非常感谢!已经对教程和代码进行了更新。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants