-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dataset directory structure and download procedure #7
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good @ntBre! As mentioned I mostly have docs requests --
Would it be possible to rename new_dataset.py
to something more informative (e.g. download_and_filter_dataset.py
? I really like having verbs in a script name to make it obvious what it does; right now I'm not 100% sure if it's "new" because it downloads a new dataset, or if it's the new version of an old dataset.py script.
In a similar vein, since this is somewhat public-facing, could you please add a description or helpstring of what new_dataset.py
does? This could be either a top-level docstring (e.g. https://geo-python.github.io/site/notebooks/L4/writing-scripts.html#Use-a-docstring) or use the description
argument of ArgumentParser
(or both).
Following discussion in the science team meeting, would it be possible to update the README to make it clear that the submit.sh
is the suggested/recommended way to run the script and that if you change anything (e.g. run on a different computer, account, run with micromamba) then the script should be modified to suit? May seem obvious but also may stop a bit of future confusion :-)
Otherwise LGTM!
(FYI for future): we discussed a couple provenance/versioning issues in the science team meeting, but punted them for more discussion at the next QCSubmit meeting where they overlap with the existing agenda somewhat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think it'd be helpful to have more documentation, I generally agree with Lily's comments but left some of my own as well. In general though, it looks good!
Okay, I think I've handled all of the suggestions. I'm assuming that the dataset itself and the functionality of the Python script looked good since there were no comments on that. I have just realized that my actual dataset run looks a bit weird now since it uses the |
Looks good to me, I think the added docs make it a lot easier to understand as an outside user! Yes I think the functionality looks good, we may want to add additional filters or something in the future and could pass them or something like that but for now it looks great! If you're concerned about the provenance of the current dataset, you could consider leaving the old input file and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM -- thanks for adding the docs @ntBre, they're a huge improvement and very clear!
On the dataset itself -- yeah I think @amcisaac had good suggestions, personally I'd lean towards having a copy of the old script and config in the directory and rename it a legacy script (or leave it in a separate directory called legacy). Instead of mentionign it in the upper level README, you could also add to a directory-specific README that this is the old version of the script that was actually used to run the dataset. It saves future viewers a bit of digging around GitHub history, which is helpful, but isn't confusing that there's two very similar ways to run the same script.
I'll approve to unblock merging.
Awesome, thank you both! I'll add a copy of the original script and a README to the industry dataset directory and then merge this! |
This PR updates the directory structure in the
datasets
directory, simplifies the scripts to add new datasets, and uses the new scripts on the industry dataset. The main change to the filtering step is the addition of theConformerRMSDFilter
, which reduced the final number of records fairly substantially from ~76000 to 64658 records.As the branch name indicates, another goal of this PR was to retain all of the provenance information associated with retrieving datasets. I think the combination of
new_dataset.py
)submit.sh
)mamba list
)should have all of this kind of information we need.