Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Vocab.oov_prob, fix lexeme_ lookups table creation #12242

Closed

Conversation

adrianeboyd
Copy link
Contributor

Description

Remove Vocab.oov_prob and Vocab.cfg (both unused) and fix how lexeme_prob and lexeme_cluster tables are added dynamically when an attribute is set on a vocab where these tables don't already exist.

Types of change

?

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

Remove `Vocab.oov_prob` and `Vocab.cfg` (both unused) and fix how
`lexeme_prob` and `lexeme_cluster` tables are added dynamically when an
attribute is set on a vocab where these tables don't already exist.
@adrianeboyd adrianeboyd added the 🔜 v4.0 Related to upcoming v4.0 label Feb 7, 2023

assert lex1 < lex2
assert lex2 > lex1
@pytest.mark.parametrize("word1,prob1,word2,prob2", [("NOUN", -1, "opera", -2)])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original test did not make sense. lex1 < lex2 is comparing ORTH and not PROB. Replaced with a test for the lookups table creation.

@adrianeboyd
Copy link
Contributor Author

This is meant to be a teeny step in the right direction in case these attributes are preserved in this format.

It may not make sense to continue to keep the Lexeme properties that have been kept for backwards compatibility with lookups tables instead of values stored on LexemeC.

They would be a good candidate for a refactored lex_attr_getters where these can be added flexibly and the data structures and defaults and backoffs would come from user code rather than being hard-coded here.

@adrianeboyd adrianeboyd closed this Feb 7, 2023
@adrianeboyd adrianeboyd reopened this Feb 7, 2023
@adrianeboyd
Copy link
Contributor Author

Actually, I'm not sure this is helpful for lex_attr_getters because those are only for the values in the struct. It is a mess at this point!

@svlandeg svlandeg added the feat / doc Feature: Doc, Span and Token objects label Jun 10, 2023
@honnibal
Copy link
Member

You'll maybe hate me for saying this because it's indeed been pretty broken / messy. But:

I don't feel great about having the OOV_PROB be a constant value. This is probably pedantic because nobody really makes these unigram frequency distributions anymore, but what we're really trying to do here is say if n=0 in some sample, what probability should we assume that the item has of occurring? The value that it's sensible to say for this will depend on how many items we've sampled. (As a probably irrelevant technical note, in the early versions of the lexeme data these probabilities were estimated with Simple Good Turing estimation, so the probability was a projection based on the probability of items with count 2 and the probability of items with count 1).

Is there a version of this that lets us keep the oov probability as a value in the vocab, but handles it correctly?

@adrianeboyd adrianeboyd marked this pull request as draft July 31, 2023 11:49
@svlandeg svlandeg deleted the branch explosion:v5 January 29, 2024 09:15
@svlandeg svlandeg closed this Jan 29, 2024
@svlandeg svlandeg added 🔜 v5.0 Related to upcoming v5.0 and removed 🔜 v4.0 Related to upcoming v4.0 labels Jan 29, 2024
@svlandeg svlandeg reopened this Jan 29, 2024
@svlandeg svlandeg changed the base branch from v4 to v5 January 29, 2024 09:25
@svlandeg
Copy link
Member

Temporarily closing this PR as we currently don't have the bandwidth to finish this.

@svlandeg svlandeg closed this Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / doc Feature: Doc, Span and Token objects 🔜 v5.0 Related to upcoming v5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants