-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Fuzzy dedup params for long strings support. #77
Conversation
3849f25
to
19777e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good to me, just a minor nit around max_text_bytes_per_part
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
a892490
to
deca160
Compare
….8 and above Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
Signed-off-by: Ayush Dattagupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, added a general question.
if input_meta is not None: | ||
read_kwargs["prune_columns"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should move away from input_meta
in favor of a keyword like dtype
(like Pandas' and cuDF's read_json
) and having the user configure prune_columns
themselves?
I think this would align with #50 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally in favor of overhauling the IO helpers in the current setup for something better. When we tackle #50. I'll share more thoughts there, but moving to encouraging users using the read_xyz api's is easier.
We can then have a common helper that based on the filetype directs to the relevant read_xyz
api rather than the other way around where read_json
goes to a common read method that handles different formats.
Regarding: prune_columns
specifically: This change is important in newer versions of rapids because many public datasets like rpv1 do not have consistent metadata across all their files. If we do not prune columns to just ID & Text, cuDF will now fail with inconsistent metadata errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Expose more configurations to test long string support Signed-off-by: Ayush Dattagupta <[email protected]> * Export libcudf env for long string support Signed-off-by: Ayush Dattagupta <[email protected]> * Default to using larger batches Signed-off-by: Ayush Dattagupta <[email protected]> * Remove large strings env variable since it's enabled by default in 24.8 and above Signed-off-by: Ayush Dattagupta <[email protected]> * Remove debug print, filter nulls before bucketing Signed-off-by: Ayush Dattagupta <[email protected]> * Remove hardcoded id field Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]> Signed-off-by: Yang Yu <[email protected]>
* Expose more configurations to test long string support Signed-off-by: Ayush Dattagupta <[email protected]> * Export libcudf env for long string support Signed-off-by: Ayush Dattagupta <[email protected]> * Default to using larger batches Signed-off-by: Ayush Dattagupta <[email protected]> * Remove large strings env variable since it's enabled by default in 24.8 and above Signed-off-by: Ayush Dattagupta <[email protected]> * Remove debug print, filter nulls before bucketing Signed-off-by: Ayush Dattagupta <[email protected]> * Remove hardcoded id field Signed-off-by: Ayush Dattagupta <[email protected]> --------- Signed-off-by: Ayush Dattagupta <[email protected]>
Fuzzy deduplication is currently accelerated via cuDF which until release 24.04 had a limit that a string column could not exceed int32 number of characters. Consequently some defaults and core logic in the deduplication pipeline aims to mitigate errors for cases where we may exceed this value.
Starting 24.06, cuDF has experimental support for longer strings (int64 number of chars), and this PR attempts to change defaults and simplify logic around handling long strings.