-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address parsing performance #47
Comments
Hi @lmores, thanks for looking into this - that looks pretty promising. I'm not able to replicate this using a test dataset of my own with ~130k records. parse_start = time()
t = t.mutate(
libpostal_address=postal_parse_address(_.full_address)
).cache()
print(t)
print(time() - parse_start) When I do that I see a runtime of around 32s for the current version and 28s for your version. I've also tried profiling the execution time of from postal.parser import parse_address
addresses = t.full_address.execute()
parse_start = time()
s = [parse_address(p) for p in addresses.to_list()]
print(time() - parse_start) which gives me a time of ~21s. Given that we're calling this as a non-vectorized python UDF that evaluates it row-by-row, I don't expect we would do any better than this without some parallelization |
Thanks for noting that I should call I will experiment a bit more and let you know if I manage to achieve some more gains. |
Ideally I would like the dataset to be ~1M records, since duckdb parallelizes in chunks of ~122k. If we use at least this much data then we will actually see a difference if we can get parralellizing to work. More than 1M records seems unneeded to check for parrallelization, and I bet 1M is still gonna be really slow. It might even be so slow its unusable, IDK if 130k takes 30 seconds... Here are 1M addresses from the Alaska Public Offices Commission, how about we try making your script reproducible with this? These are public records from campaign contributions, political campaigns are required to disclose who they are getting their money from. Yes, you will need to call .cache() just before the benchmark to ensure all the data is loaded into the database, and then .cache() right after to test for it persisting. To ensure we don't get the overhead from copying the 1M source rows, we should only persist the output: inp = inp.cache()
start = time.time()
res = func(inp.oneline)
persisted = res.as_table().cache()
end = time.time() A good way to do this would be to try to assign blame to different parts of the UDF, by making several different versions that mock various parts out:
The deltas between these should give us a sense of how much each part is contributing to timing. Or more likely, it doesn't appear to make any sense, and then we figure out what to do :)
This is good we have different machines to compare between, I have an M1 macbook. @jstammers if you run your benchmark on the above data, can you share your specs?
IDK how duckdb/other backends handle parrallelization with UDFs. My impression is not well. I think my first hunch would be to pursue caching. At the python level would be trivial using functools.cache(), or more complicated we could even do it at the duckdb level, and actually reduce the number of times we actually call the UDF. |
On my personal computer with the following specs
I can parse the addresses for that dataset in 89s On my work VM which I used for the previous test, with the following specs
I see a runtime of 153s I like the idea of caching the address parsing, particularly as there's likely to be a lot of redundancy in the addresses, although that would probably require some further digging into the inner workings of functools.cache() will certainly be simpler to implement, but I'm not sure there'll be significant improvements if the addresses are mostly distinct |
Made a very rough benchmark on a dataset with
2_192_071
Italian street addresses (that unfortunately I am not allowed to share).I replaced the current implementation of
postal_parse_address()
withI am starting from an ibis table
t
backed by duckdb with the following columns['record_id', 'administrative_area', 'locality', 'postal_code', 'address_lines', 'recipients', 'region_code', 'language_code', 'sorting_code', 'sublocality', 'organization']
I add a new column containing the whole address:
and then run the following snippet for both versions of
postal_parse_address()
:Current implementation takes
4.291484832763672
seconds, the above one just0.206512451171875
seconds.My laptop specs:
@NickCrews @jstammers:
The text was updated successfully, but these errors were encountered: