Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect join on large tables for add_tfidf #50

Open
jstammers opened this issue Jul 24, 2024 · 4 comments
Open

Incorrect join on large tables for add_tfidf #50

jstammers opened this issue Jul 24, 2024 · 4 comments

Comments

@jstammers
Copy link
Contributor

I've found that the current implementation of add_tfidf does not correctly join on the term frequencies for large tables.
Here's an example using faker that illustrates the problem

from mismo.sets import add_tfidf
import pandas as pd
import ibis
from ibis import _
from faker import Faker
import numpy as np
ibis.options.interactive = True

f = Faker()
Faker.seed(1234)

addresses = [f.address() for _ in range(2_000_000)]

table = ibis.memtable({"address":addresses}).mutate(tokens=_.address.split(" "))
table = add_tfidf(table, 'tokens')

table
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addresstokenstokens_tfidf                                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringarray<string>map<string, float64>                                                    │
├─────────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤
│ USS Hernandez\nFPO AP 03359 │ ['USS', 'Hernandez\nFPO', ... +2] │ {'USS': 2.357353036528199, 'AP': 1.6684075613664435, ... +2}            │
│ USNV Jenkins\nFPO AA 17631  │ ['USNV', 'Jenkins\nFPO', ... +2]  │ {'USS': 2.357353036528199, 'AP': 1.6684075613664435, ... +2}            │
│ USS Stanton\nFPO AP 43458   │ ['USS', 'Stanton\nFPO', ... +2]   │ {'AP': 1.6684075613664435, 'USNS': 2.3627657713296375, ... +2}          │
│ USNS Brown\nFPO AP 97018    │ ['USNS', 'Brown\nFPO', ... +2]    │ {'Olson\nFPO': 4.9226093222060765, 'USNV': 2.3593934563875294, ... +2}  │
│ USS Meza\nFPO AE 53363      │ ['USS', 'Meza\nFPO', ... +2]      │ {'13102': 5.521460917862246, 'Espinoza\nFPO': 5.21556014730925, ... +2} │
│ USNS Espinoza\nFPO AP 13102 │ ['USNS', 'Espinoza\nFPO', ... +2] │ {'USS': 2.357353036528199, 'Baker\nFPO': 4.511930402516782, ... +2}     │
│ USS Baker\nFPO AP 67296     │ ['USS', 'Baker\nFPO', ... +2]     │ {'USNV': 2.3593934563875294, '68208': 5.50607508852887, ... +2}         │
│ USNV Sanchez\nFPO AE 68208  │ ['USNV', 'Sanchez\nFPO', ... +2]  │ {'USNV': 2.3593934563875294, 'AA': 1.6625767137942553, ... +2}          │
│ USS Jackson\nFPO AA 20151   │ ['USS', 'Jackson\nFPO', ... +2]   │ {'USS': 2.357353036528199, '49182': 5.665301954088137, ... +2}          │
│ USS Harper\nFPO AP 49182    │ ['USS', 'Harper\nFPO', ... +2]    │ {'USNV': 2.3593934563875294, 'AA': 1.6625767137942553, ... +2}          │
│ …                           │ …                                 │ …                                                                       │
└─────────────────────────────┴───────────────────────────────────┴─────────────────────────────────────────────────────────────────────────┘

I've been able to resolve this myself by caching the result of this line

with_counts = with_counts.mutate(__row_number=ibis.row_number())

My gut feeling is that it's related to the lazy evaluation of ibis.row_number() which isn't being preserved when joining term_counts with idf. This feels more like an upstream issue to me as I would expect the row_number to be consistent even if the intermediate table isn't cached

@NickCrews
Copy link
Owner

@jstammers this could be getting caused by ibis-project/ibis#9014, but I think it might actually be due to a buggy implementation of add_array_value_counts(). If I remember right I recently found that was giving me weird results, and I don't actually have any unit tests for it yet 😬

Do you have a repro script for this bug? that would help pin this down, and would be great to add as a test to prevent regression.

@NickCrews
Copy link
Owner

@jstammers possibly this is also getting caused by ibis-project/ibis#9703, though I'm not sure

@jstammers
Copy link
Contributor Author

Hi @NickCrews I've been away for a few days so haven't had time to respond to this. The only 'script' I have to reproduce this so far is the code I pasted above, but it takes around 10 mins on my machine to generate those samples.
I'll see if I can use hypothesis to generate a smaller failing test case

@NickCrews
Copy link
Owner

@jstammers sorry, I don't know why I didn't see the Faker, the example above is totally reproducible enough, I'm good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants