Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer should be replaced to processing_class in Seq2SeqTrainer? #35446

Open
2 of 4 tasks
zzaebok opened this issue Dec 29, 2024 · 1 comment
Open
2 of 4 tasks

tokenizer should be replaced to processing_class in Seq2SeqTrainer? #35446

zzaebok opened this issue Dec 29, 2024 · 1 comment
Labels
bug Core: Tokenization Internals of the library; Tokenization. trainer

Comments

@zzaebok
Copy link

zzaebok commented Dec 29, 2024

System Info

  • transformers version: 4.47.1
  • Platform: Linux-5.4.0-200-generic-x86_64-with-glibc2.31
  • Python version: 3.10.16
  • Huggingface_hub version: 0.27.0
  • Safetensors version: 0.4.5
  • Accelerate version: 1.2.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA GeForce RTX 2070 SUPER

Who can help?

@amyeroberts @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

In trainer_seq2seq.py file, there is still calling self.tokenizer. which produces deprecation warning "Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead."

def _pad_tensors_to_max_len(self, tensor, max_length):
    if self.tokenizer is not None and hasattr(self.tokenizer, "pad_token_id"):
        # If PAD token is not defined at least EOS token has to be defined
        pad_token_id = (
            self.tokenizer.pad_token_id if self.tokenizer.pad_token_id is not None else self.tokenizer.eos_token_id
        )

Expected behavior

I believe self.tokenizer should be replaced to self.processing_class

def _pad_tensors_to_max_len(self, tensor, max_length):
    if self.processing_class is not None and hasattr(self.processing_class, "pad_token_id"):
        # If PAD token is not defined at least EOS token has to be defined
        pad_token_id = (
            self.processing_class.pad_token_id if self.processing_class.pad_token_id is not None else self.processing_class.eos_token_id
        )

Is it okay for me to make a PR for this issue? 😄

@zzaebok zzaebok added the bug label Dec 29, 2024
@LysandreJik
Copy link
Member

Thanks @zzaebok! Would you like to open a PR to fix this warning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Core: Tokenization Internals of the library; Tokenization. trainer
Projects
None yet
Development

No branches or pull requests

2 participants