Update dataset directory structure and download procedure #7

ntBre · 2024-11-04T15:58:31Z

This PR updates the directory structure in the datasets directory, simplifies the scripts to add new datasets, and uses the new scripts on the industry dataset. The main change to the filtering step is the addition of the ConformerRMSDFilter, which reduced the final number of records fairly substantially from ~76000 to 64658 records.

As the branch name indicates, another goal of this PR was to retain all of the provenance information associated with retrieving datasets. I think the combination of

the retrieval Python script (new_dataset.py)
the Slurm script (submit.sh)
the Slurm log (which includes the output of mamba list)

should have all of this kind of information we need.

lilyminium

Overall looks good @ntBre! As mentioned I mostly have docs requests --

Would it be possible to rename new_dataset.py to something more informative (e.g. download_and_filter_dataset.py? I really like having verbs in a script name to make it obvious what it does; right now I'm not 100% sure if it's "new" because it downloads a new dataset, or if it's the new version of an old dataset.py script.

In a similar vein, since this is somewhat public-facing, could you please add a description or helpstring of what new_dataset.py does? This could be either a top-level docstring (e.g. https://geo-python.github.io/site/notebooks/L4/writing-scripts.html#Use-a-docstring) or use the description argument of ArgumentParser (or both).

Following discussion in the science team meeting, would it be possible to update the README to make it clear that the submit.sh is the suggested/recommended way to run the script and that if you change anything (e.g. run on a different computer, account, run with micromamba) then the script should be modified to suit? May seem obvious but also may stop a bit of future confusion :-)

Otherwise LGTM!

(FYI for future): we discussed a couple provenance/versioning issues in the science team meeting, but punted them for more discussion at the next QCSubmit meeting where they overlap with the existing agenda somewhat.

datasets/new_dataset.py

datasets/README.md

datasets/submit.sh

datasets/new_dataset.py

amcisaac

I also think it'd be helpful to have more documentation, I generally agree with Lily's comments but left some of my own as well. In general though, it looks good!

datasets/README.md

datasets/submit.sh

datasets/new_dataset.py

datasets/submit.sh

ntBre · 2024-11-05T14:58:25Z

Okay, I think I've handled all of the suggestions. I'm assuming that the dataset itself and the functionality of the Python script looked good since there were no comments on that.

I have just realized that my actual dataset run looks a bit weird now since it uses the Config I now deleted, but I think it's probably best to leave it rather than try to make it look like it was run with the updated script or, even worse, waste the compute to rerun it.

amcisaac · 2024-11-05T18:03:35Z

Looks good to me, I think the added docs make it a lot easier to understand as an outside user!

Yes I think the functionality looks good, we may want to add additional filters or something in the future and could pass them or something like that but for now it looks great!

If you're concerned about the provenance of the current dataset, you could consider leaving the old input file and Config class as an option for how to run it, but not documenting it in the readme so that it doesn't get used in the future. Or alternatively keeping it as two different documented ways to run the same script so that you can add more options on as you mentioned. If you don't want to do that, I think you could also just make a note in the readme somewhere that the current dataset was filtered with a slightly different script; the history should be here on GH so it's not like it's not findable. Not sure if @lilyminium has an opinion, I'm happy with any of those options.

lilyminium

Overall LGTM -- thanks for adding the docs @ntBre, they're a huge improvement and very clear!

On the dataset itself -- yeah I think @amcisaac had good suggestions, personally I'd lean towards having a copy of the old script and config in the directory and rename it a legacy script (or leave it in a separate directory called legacy). Instead of mentionign it in the upper level README, you could also add to a directory-specific README that this is the old version of the script that was actually used to run the dataset. It saves future viewers a bit of digging around GitHub history, which is helpful, but isn't confusing that there's two very similar ways to run the same script.

I'll approve to unblock merging.

ntBre · 2024-11-06T20:44:02Z

Awesome, thank you both! I'll add a copy of the original script and a README to the industry dataset directory and then merge this!

ntBre added 13 commits October 17, 2024 16:44

ignore __pycache__

1e1731f

restart datasets directory focused on new provenance plans

103cfb1

add some todos for tomorrow

4a0cc9a

finish converting to cached dataset and add config

b117e40

resurrect submit.sh

d3765bd

pin to 0.1.0beta1 for yammbs

e296176

move nprocs to cli arg, make submit.sh more specific

d818eed

add readme for datasets

b248ec3

make executable

49ab35f

missing out_dir

a3c2dc8

update submit script with resources that actually work

8306274

update gitattributes for new directory layout

4d5cda2

add OpenFF Industry Benchmark Season 1 v1

aa8d558

ntBre requested review from amcisaac and lilyminium November 4, 2024 15:58

lilyminium requested changes Nov 5, 2024

View reviewed changes

datasets/new_dataset.py Outdated Show resolved Hide resolved

datasets/README.md Outdated Show resolved Hide resolved

datasets/submit.sh Show resolved Hide resolved

datasets/new_dataset.py Outdated Show resolved Hide resolved

amcisaac suggested changes Nov 5, 2024

View reviewed changes

datasets/README.md Outdated Show resolved Hide resolved

datasets/submit.sh Show resolved Hide resolved

datasets/new_dataset.py Outdated Show resolved Hide resolved

datasets/new_dataset.py Outdated Show resolved Hide resolved

datasets/submit.sh Outdated Show resolved Hide resolved

ntBre added 9 commits November 5, 2024 09:03

delete config in favor of cli args

4320be7

rename dataset script

4744c0d

add top-level docstring

e499294

make nprocs and chunksize optional with same defaults as filter

308f45f

fix arg order

9aeaad9

reuse docstring in argparse help

dba5966

update and expand usage docs

85c43ae

update usage instructions

5964b1c

add help to individual argparse entries

c5272a5

amcisaac approved these changes Nov 6, 2024

View reviewed changes

lilyminium approved these changes Nov 6, 2024

View reviewed changes

ntBre added 2 commits November 7, 2024 13:14

add readme and old version of dataset script for posterity

8e9e32a

add link manually

f2dba52

ntBre merged commit e2d4c9e into master Nov 7, 2024

ntBre deleted the dataset-provenance branch November 7, 2024 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update dataset directory structure and download procedure #7

Update dataset directory structure and download procedure #7

ntBre commented Nov 4, 2024

lilyminium left a comment

amcisaac left a comment

ntBre commented Nov 5, 2024

amcisaac commented Nov 5, 2024

lilyminium left a comment

ntBre commented Nov 6, 2024

Update dataset directory structure and download procedure #7

Update dataset directory structure and download procedure #7

Conversation

ntBre commented Nov 4, 2024

lilyminium left a comment

Choose a reason for hiding this comment

amcisaac left a comment

Choose a reason for hiding this comment

ntBre commented Nov 5, 2024

amcisaac commented Nov 5, 2024

lilyminium left a comment

Choose a reason for hiding this comment

ntBre commented Nov 6, 2024