Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better mimic DocumentDataset's read_* functions to Dask's read_* functions #50

Open
sarahyurick opened this issue May 3, 2024 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@sarahyurick
Copy link
Collaborator

Right now, DocumentDataset has a couple of read_* functions:
(1)

def read_json(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(2)

def read_parquet(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

(3)

def read_pickle(
    cls,
    input_files,
    backend="pandas",
    files_per_partition=1,
    add_filename=False,
)

It would be good if these functions could support Dask's read_json and read_parquet parameters (there is no read_pickle function in Dask but we can perhaps look to Pandas for this).

In addition to this, we can restructure our to_* functions as well.

@sarahyurick sarahyurick added the enhancement New feature or request label May 3, 2024
@ayushdg
Copy link
Collaborator

ayushdg commented May 3, 2024

I believe the reason we have a custom read_json implementation is the ability to specify files_per_partition and combine multiple files into a single read_json call from cudf which isn't supported in dask dataframe. Since parquet and a few others have support for many params ootb, it makes sense to mimic dask in the parquet case.

@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Sep 9, 2024

From #46:
"
https://github.com/NVIDIA/NeMo-Curator/blob/main/examples/fuzzy_deduplication.py#L53-L60

Is there a reason you didn't use DocumentDataset.read_parquet? I would prefer to use that or expand its flexibility such that you can do what you need to do.

Yeah, the DocumentDataset.read_parquet functionality is a bit lacking in column support and a few other missing config options. I'd prefer the DocumentDataset.read_parquet method to mimic Dask's read_parquet for the time being.

I would be interested in that discussion as well. My intuition is that we should mimic the behavior of Dask as much as possible, but there might be good reasons to deviate.

Yeah, I agree that the goal should be to mimic Dask's read_* functions as best as possible, probably with kwargs.
"

@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Sep 9, 2024

From #130:
"
Couple of things here:

  • After reading the body of this function, the num_samples parameter is misleading in its name. A sample typically refers to a single document, while in this case it appears to be referring to a file. Can this be renamed to num_files or num_shards?

  • I am not a fan of having another unique function for reading files/figuring out what files to be read. It makes code much more confusing and harder to maintain. I want to enforce some kind of consistency. Even within semantic dedup, each of the three CLI scripts have a different way of reading in files:

    • compute_embeddings.py (this script) uses read_data with the new get_input_files function.
    • clustering.py uses dask_cudf.read_parquet.
    • extract_dedup_data.py reads the files in deep in SemanticClusterLevelDedup.compute_semantic_match_dfs, which eventually calls cudf.read_parquet

    We already have noted that our file operations are not easy to work with, and our future plans are only going to get harder as we introduce more ways of reading in files.

    The way I want users (and us) to read in files (right now) is this:

    • Use DocumentDataset.read_* whenever you know the datatype at the time of writing the script.
    • Use read_data whenever you don't. We should eventually make a similar function directly in DocumentDataset, but that's beside the point.

    read_data should be the way to go with in the CLI scripts. get_remaining_files or get_all_files_paths_under can be helpers for that function if needed (I'm not a fan of having two helpers in the first place either, but again, beside the point). I'd rather not have a new helper method like this in the mix too. In this case, perhaps we could merge this function with the get_remaining_files function. See my comment below for more on that.

    Furthermore, we shouldn't need to be working around our file operations. If we feel that we need to do that, we should modify them instead to fit our usecase. I know we're on a crunch right now, but anything you can do to get us closer to the ideal case I mentioned above would be great.

"

and

"
I agree with the spirit of having consistent IO format but we wont be able to do it till we address #50, like

  • compute_embeddings.py (this script) uses read_data with the new get_input_files function.
    Agreed, merging with read_data.

  • clustering.py uses dask_cudf.read_parquet because we don't have a block-wise support which is important from performance. Once we fix it, I am happy to revisit this.

  • SemanticClusterLevelDedup.compute_semantic_match_dfs calls cudf.read_parquet, Unfortunately there is no straightforward way for this. We should pick the tool as needed especially for complex workflows so I think we are stuck there.

For now, I will link #50 here and merge get_remaining_files. I hope that's a good middle path.
"

@sarahyurick
Copy link
Collaborator Author

From #77:
"
Do you think we should move away from input_meta in favor of a keyword like dtype (like Pandas' and cuDF's read_json) and having the user configure prune_columns themselves?

I'm generally in favor of overhauling the IO helpers in the current setup for something better. When we tackle #50. I'll share more thoughts there, but moving to encouraging users using the read_xyz api's is easier.
We can then have a common helper that based on the filetype directs to the relevant read_xyz api rather than the other way around where read_json goes to a common read method that handles different formats.

Regarding: prune_columns specifically: This change is important in newer versions of rapids because many public datasets like rpv1 do not have consistent metadata across all their files. If we do not prune columns to just ID & Text, cuDF will now fail with inconsistent metadata errors.
"

@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Oct 21, 2024

@sarahyurick
Copy link
Collaborator Author

Another TODO: Support for .json.gz.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants