Auto-detect wordpiece tokenizer when model.type is missing #1151

xenova · 2025-01-16T19:57:30Z

Some old wordpiece tokenizers do not have model.type saved inside tokenizer.json. This PR adds logic to auto-detect when this is the case. Thanks to @pcuenca for finding this! 🤗

pcuenca · 2025-01-16T19:59:14Z

src/tokenizers.js

                if (config.vocab) {
                    if (Array.isArray(config.vocab)) {
                        // config.vocab is of type `[string, number][]`
                        // @ts-ignore
                        return new Unigram(config, ...args);
+                    } else if (typeof config.vocab === 'object' && config.continuing_subword_prefix && config.unk_token) {


pcuenca · 2025-01-16T20:03:31Z

By the way, I'm @pcuenca here (but it's fine, @pcuenq is an alt account 😂)

xenova · 2025-01-16T20:07:51Z

By the way, I'm @pcuenca here (but it's fine, @pcuenq is an alt account 😂)

lol whoops - I think I auto-completed your twitter username 😅

xenova added 2 commits January 16, 2025 19:54

Auto-detect wordpiece tokenizer when model.type is missing

5429254

Update test name

5a2fb1d

pcuenca reviewed Jan 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-detect wordpiece tokenizer when model.type is missing #1151

Auto-detect wordpiece tokenizer when model.type is missing #1151

xenova commented Jan 16, 2025 •

edited

Loading

pcuenca Jan 16, 2025

pcuenca commented Jan 16, 2025

xenova commented Jan 16, 2025

Auto-detect wordpiece tokenizer when model.type is missing #1151

Are you sure you want to change the base?

Auto-detect wordpiece tokenizer when model.type is missing #1151

Conversation

xenova commented Jan 16, 2025 • edited Loading

pcuenca Jan 16, 2025

Choose a reason for hiding this comment

pcuenca commented Jan 16, 2025

xenova commented Jan 16, 2025

xenova commented Jan 16, 2025 •

edited

Loading