Better mimic DocumentDataset's `read_` functions to Dask's `read_` functions #50

sarahyurick · 2024-05-03T18:01:37Z

Right now, DocumentDataset has a couple of read_* functions:
(1)

def read_json(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(2)

def read_parquet(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(3)

def read_pickle(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

It would be good if these functions could support Dask's read_json and read_parquet parameters (there is no read_pickle function in Dask but we can perhaps look to Pandas for this).

In addition to this, we can restructure our to_* functions as well.

The text was updated successfully, but these errors were encountered:

ayushdg · 2024-05-03T18:52:49Z

I believe the reason we have a custom read_json implementation is the ability to specify files_per_partition and combine multiple files into a single read_json call from cudf which isn't supported in dask dataframe. Since parquet and a few others have support for many params ootb, it makes sense to mimic dask in the parquet case.

sarahyurick · 2024-09-09T20:18:56Z

From #46:
"
https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fuzzy_deduplication.py#L53-L60

Is there a reason you didn't use DocumentDataset.read_parquet? I would prefer to use that or expand its flexibility such that you can do what you need to do.

Yeah, the DocumentDataset.read_parquet functionality is a bit lacking in column support and a few other missing config options. I'd prefer the DocumentDataset.read_parquet method to mimic Dask's read_parquet for the time being.

I would be interested in that discussion as well. My intuition is that we should mimic the behavior of Dask as much as possible, but there might be good reasons to deviate.

Yeah, I agree that the goal should be to mimic Dask's read_* functions as best as possible, probably with kwargs.
"

sarahyurick · 2024-09-09T20:33:15Z

From #130:
"
Couple of things here:

After reading the body of this function, the num_samples parameter is misleading in its name. A sample typically refers to a single document, while in this case it appears to be referring to a file. Can this be renamed to num_files or num_shards?
I am not a fan of having another unique function for reading files/figuring out what files to be read. It makes code much more confusing and harder to maintain. I want to enforce some kind of consistency. Even within semantic dedup, each of the three CLI scripts have a different way of reading in files:
- compute_embeddings.py (this script) uses read_data with the new get_input_files function.
- clustering.py uses dask_cudf.read_parquet.
- extract_dedup_data.py reads the files in deep in SemanticClusterLevelDedup.compute_semantic_match_dfs, which eventually calls cudf.read_parquet
We already have noted that our file operations are not easy to work with, and our future plans are only going to get harder as we introduce more ways of reading in files.

The way I want users (and us) to read in files (right now) is this:
- Use DocumentDataset.read_* whenever you know the datatype at the time of writing the script.
- Use read_data whenever you don't. We should eventually make a similar function directly in DocumentDataset, but that's beside the point.
read_data should be the way to go with in the CLI scripts. get_remaining_files or get_all_files_paths_under can be helpers for that function if needed (I'm not a fan of having two helpers in the first place either, but again, beside the point). I'd rather not have a new helper method like this in the mix too. In this case, perhaps we could merge this function with the get_remaining_files function. See my comment below for more on that.

Furthermore, we shouldn't need to be working around our file operations. If we feel that we need to do that, we should modify them instead to fit our usecase. I know we're on a crunch right now, but anything you can do to get us closer to the ideal case I mentioned above would be great.

"

and

"
I agree with the spirit of having consistent IO format but we wont be able to do it till we address #50, like

compute_embeddings.py (this script) uses read_data with the new get_input_files function.
Agreed, merging with read_data.
clustering.py uses dask_cudf.read_parquet because we don't have a block-wise support which is important from performance. Once we fix it, I am happy to revisit this.
SemanticClusterLevelDedup.compute_semantic_match_dfs calls cudf.read_parquet, Unfortunately there is no straightforward way for this. We should pick the tool as needed especially for complex workflows so I think we are stuck there.

For now, I will link #50 here and merge get_remaining_files. I hope that's a good middle path.
"

sarahyurick · 2024-09-09T20:36:09Z

From #77:
"
Do you think we should move away from input_meta in favor of a keyword like dtype (like Pandas' and cuDF's read_json) and having the user configure prune_columns themselves?

I'm generally in favor of overhauling the IO helpers in the current setup for something better. When we tackle #50. I'll share more thoughts there, but moving to encouraging users using the read_xyz api's is easier.
We can then have a common helper that based on the filetype directs to the relevant read_xyz api rather than the other way around where read_json goes to a common read method that handles different formats.

Regarding: prune_columns specifically: This change is important in newer versions of rapids because many public datasets like rpv1 do not have consistent metadata across all their files. If we do not prune columns to just ID & Text, cuDF will now fail with inconsistent metadata errors.
"

sarahyurick · 2024-10-21T21:28:31Z

Related PRs:

sarahyurick · 2024-10-23T19:57:41Z

Another TODO: Support for .json.gz.

sarahyurick added the enhancement New feature or request label May 3, 2024

sarahyurick mentioned this issue May 3, 2024

High level fuzzy duplicates module #46

Merged

3 tasks

ryantwolf mentioned this issue Jul 3, 2024

Enable Sem-dedup #130

Merged

3 tasks

sarahyurick mentioned this issue Sep 6, 2024

Update Fuzzy dedup params for long strings support. #77

Merged

sarahyurick mentioned this issue Oct 14, 2024

[BUG] Semdedup Embedding Restart not working cleanly #211

Closed

sarahyurick self-assigned this Oct 22, 2024

ayushdg mentioned this issue Oct 22, 2024

Skip reading files with incorrect extension #318

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better mimic DocumentDataset's `read_` functions to Dask's `read_` functions #50

Better mimic DocumentDataset's `read_` functions to Dask's `read_` functions #50

sarahyurick commented May 3, 2024

ayushdg commented May 3, 2024

sarahyurick commented Sep 9, 2024 •

edited

Loading

sarahyurick commented Sep 9, 2024 •

edited

Loading

sarahyurick commented Sep 9, 2024

sarahyurick commented Oct 21, 2024 •

edited

Loading

sarahyurick commented Oct 23, 2024

Better mimic DocumentDataset's read_* functions to Dask's read_* functions #50

Better mimic DocumentDataset's read_* functions to Dask's read_* functions #50

Comments

sarahyurick commented May 3, 2024

ayushdg commented May 3, 2024

sarahyurick commented Sep 9, 2024 • edited Loading

sarahyurick commented Sep 9, 2024 • edited Loading

sarahyurick commented Sep 9, 2024

sarahyurick commented Oct 21, 2024 • edited Loading

sarahyurick commented Oct 23, 2024

Better mimic DocumentDataset's `read_` functions to Dask's `read_` functions #50

Better mimic DocumentDataset's `read_` functions to Dask's `read_` functions #50

sarahyurick commented Sep 9, 2024 •

edited

Loading

sarahyurick commented Sep 9, 2024 •

edited

Loading

sarahyurick commented Oct 21, 2024 •

edited

Loading