Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dataset directory structure and download procedure #7

Merged
merged 24 commits into from
Nov 7, 2024

Conversation

ntBre
Copy link
Collaborator

@ntBre ntBre commented Nov 4, 2024

This PR updates the directory structure in the datasets directory, simplifies the scripts to add new datasets, and uses the new scripts on the industry dataset. The main change to the filtering step is the addition of the ConformerRMSDFilter, which reduced the final number of records fairly substantially from ~76000 to 64658 records.

As the branch name indicates, another goal of this PR was to retain all of the provenance information associated with retrieving datasets. I think the combination of

  1. the retrieval Python script (new_dataset.py)
  2. the Slurm script (submit.sh)
  3. the Slurm log (which includes the output of mamba list)

should have all of this kind of information we need.

@ntBre ntBre requested review from amcisaac and lilyminium November 4, 2024 15:58
Copy link

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good @ntBre! As mentioned I mostly have docs requests --

Would it be possible to rename new_dataset.py to something more informative (e.g. download_and_filter_dataset.py? I really like having verbs in a script name to make it obvious what it does; right now I'm not 100% sure if it's "new" because it downloads a new dataset, or if it's the new version of an old dataset.py script.

In a similar vein, since this is somewhat public-facing, could you please add a description or helpstring of what new_dataset.py does? This could be either a top-level docstring (e.g. https://geo-python.github.io/site/notebooks/L4/writing-scripts.html#Use-a-docstring) or use the description argument of ArgumentParser (or both).

Following discussion in the science team meeting, would it be possible to update the README to make it clear that the submit.sh is the suggested/recommended way to run the script and that if you change anything (e.g. run on a different computer, account, run with micromamba) then the script should be modified to suit? May seem obvious but also may stop a bit of future confusion :-)

Otherwise LGTM!

(FYI for future): we discussed a couple provenance/versioning issues in the science team meeting, but punted them for more discussion at the next QCSubmit meeting where they overlap with the existing agenda somewhat.

datasets/new_dataset.py Outdated Show resolved Hide resolved
datasets/README.md Outdated Show resolved Hide resolved
datasets/submit.sh Show resolved Hide resolved
datasets/new_dataset.py Outdated Show resolved Hide resolved
Copy link

@amcisaac amcisaac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think it'd be helpful to have more documentation, I generally agree with Lily's comments but left some of my own as well. In general though, it looks good!

datasets/README.md Outdated Show resolved Hide resolved
datasets/submit.sh Show resolved Hide resolved
datasets/new_dataset.py Outdated Show resolved Hide resolved
datasets/new_dataset.py Outdated Show resolved Hide resolved
datasets/submit.sh Outdated Show resolved Hide resolved
@ntBre
Copy link
Collaborator Author

ntBre commented Nov 5, 2024

Okay, I think I've handled all of the suggestions. I'm assuming that the dataset itself and the functionality of the Python script looked good since there were no comments on that.

I have just realized that my actual dataset run looks a bit weird now since it uses the Config I now deleted, but I think it's probably best to leave it rather than try to make it look like it was run with the updated script or, even worse, waste the compute to rerun it.

@amcisaac
Copy link

amcisaac commented Nov 5, 2024

Looks good to me, I think the added docs make it a lot easier to understand as an outside user!

Yes I think the functionality looks good, we may want to add additional filters or something in the future and could pass them or something like that but for now it looks great!

If you're concerned about the provenance of the current dataset, you could consider leaving the old input file and Config class as an option for how to run it, but not documenting it in the readme so that it doesn't get used in the future. Or alternatively keeping it as two different documented ways to run the same script so that you can add more options on as you mentioned. If you don't want to do that, I think you could also just make a note in the readme somewhere that the current dataset was filtered with a slightly different script; the history should be here on GH so it's not like it's not findable. Not sure if @lilyminium has an opinion, I'm happy with any of those options.

Copy link

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM -- thanks for adding the docs @ntBre, they're a huge improvement and very clear!

On the dataset itself -- yeah I think @amcisaac had good suggestions, personally I'd lean towards having a copy of the old script and config in the directory and rename it a legacy script (or leave it in a separate directory called legacy). Instead of mentionign it in the upper level README, you could also add to a directory-specific README that this is the old version of the script that was actually used to run the dataset. It saves future viewers a bit of digging around GitHub history, which is helpful, but isn't confusing that there's two very similar ways to run the same script.

I'll approve to unblock merging.

@ntBre
Copy link
Collaborator Author

ntBre commented Nov 6, 2024

Awesome, thank you both! I'll add a copy of the original script and a README to the industry dataset directory and then merge this!

@ntBre ntBre merged commit e2d4c9e into master Nov 7, 2024
@ntBre ntBre deleted the dataset-provenance branch November 7, 2024 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants