Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question regarding error metrics/dataset creation #15

Open
nsrishankar opened this issue Jan 24, 2022 · 1 comment
Open

Question regarding error metrics/dataset creation #15

nsrishankar opened this issue Jan 24, 2022 · 1 comment

Comments

@nsrishankar
Copy link

I had a few questions/clarifications regarding the hdf5 dataset that was linked on the notebook:

  1. I ran the notebook for training from scratch using the existing hdf5 and obtained a CER of ~0.09 using just a single model (and not an ensemble).
  2. When creating the hdf5 from scratch and running the training procedure my CER is similar to the best/second best models (~0.16-0.18).

So, as far as I can see the main difference would be in the dataset generation/preprocessing steps or the tokenizer:
a. In the notebook there's a comment that the pretained models used a vocab size of 100 as opposed to 99 (95 characters + SOS/EOS/PAD/UNK tokens)- is there an additional token used here?
b. Was the generation procedure of the hdf5 that was linked/on the google drive a little different?

Thank you!

@him4318
Copy link
Owner

him4318 commented Mar 18, 2022

I also don't remember exactly the first iteration but I am working on a paper with the different experiments involving pre-processing steps that will help the community. I will update you once it is finalized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants