-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect join on large tables for add_tfidf #50
Comments
@jstammers this could be getting caused by ibis-project/ibis#9014, but I think it might actually be due to a buggy implementation of Do you have a repro script for this bug? that would help pin this down, and would be great to add as a test to prevent regression. |
@jstammers possibly this is also getting caused by ibis-project/ibis#9703, though I'm not sure |
Hi @NickCrews I've been away for a few days so haven't had time to respond to this. The only 'script' I have to reproduce this so far is the code I pasted above, but it takes around 10 mins on my machine to generate those samples. |
@jstammers sorry, I don't know why I didn't see the Faker, the example above is totally reproducible enough, I'm good! |
I've found that the current implementation of
add_tfidf
does not correctly join on the term frequencies for large tables.Here's an example using
faker
that illustrates the problemI've been able to resolve this myself by caching the result of this line
mismo/mismo/sets/_tfidf.py
Line 252 in fc65234
My gut feeling is that it's related to the lazy evaluation of
ibis.row_number()
which isn't being preserved when joiningterm_counts
withidf
. This feels more like an upstream issue to me as I would expect the row_number to be consistent even if the intermediate table isn't cachedThe text was updated successfully, but these errors were encountered: