Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect coordinate blocking with missing values #76

Open
jstammers opened this issue Dec 5, 2024 · 1 comment
Open

Incorrect coordinate blocking with missing values #76

jstammers opened this issue Dec 5, 2024 · 1 comment

Comments

@jstammers
Copy link
Contributor

I've discovered that mismo.lib.geo.CoordinateBlocker doesn't handle missing values as I'd expect.

If a record has a missing coordinate value, I would not expect it to be blocked as the returned distance would be NaN.

The following example shows that records with a null coordinate value are indeed blocked together

from mismo.lib.geo import CoordinateBlocker
import ibis
ibis.options.interactive = True

con = ibis.get_backend()
data =[{"record_id":1, "lat":1, "lon":1}, {"record_id":2, "lat":2, "lon":None}, {"record_id":3, "lat":3, "lon":None}]
table = con.create_table("test", ibis.memtable(data), overwrite=True)

blocker = CoordinateBlocker(lat="lat", lon="lon", distance_km=1000)

blocker(table, table)
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━┳━━━━━━━━━┓
┃ record_id_lrecord_id_rlat_llat_rlon_llon_r   ┃
┡━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━━━╇━━━━━━━━━┩
│ int64int64int64int64float64float64 │
├─────────────┼─────────────┼───────┼───────┼─────────┼─────────┤
│           2323NULLNULL │
└─────────────┴─────────────┴───────┴───────┴─────────┴─────────┘

In this case, I can see that mismo.lib.geo.distance_km evaluates to NULL,

I think this can be resolved by modifying the logic here so that it returns null if either lat or lon is null

@NickCrews
Copy link
Owner

Thanks! Indeed looks like a bug. Expected behavior is that a record where either lat is null or lon is null should be blocked with no other records.

I'll play around when I'm at a computer. That fix you suggest seems promising, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants