Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dataset directory structure and download procedure #7

Merged
merged 24 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1 @@
datasets/cache/*.json filter=lfs diff=lfs merge=lfs -text
datasets/*/*.json filter=lfs diff=lfs merge=lfs -text
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
oe_license.txt
oe_license.txt
/__pycache__/
20 changes: 0 additions & 20 deletions datasets/Makefile

This file was deleted.

1,274 changes: 1,274 additions & 0 deletions datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/2024-10-31.2183998.out

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/cache.json
Git LFS file not shown
Git LFS file not shown
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
ds_name: "OpenFF Industry Benchmark Season 1 v1.1"
chunksize: 32
3 changes: 3 additions & 0 deletions datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/raw.json
Git LFS file not shown
145 changes: 32 additions & 113 deletions datasets/README.md
Original file line number Diff line number Diff line change
@@ -1,113 +1,32 @@
This directory contains datasets in the JSON-serialized
`OptimizationResultCollection` format from [qcsubmit][qcsubmit], as well as
scripts for retrieving and post-processing them.

| Type | File | Description |
|-----------|------------------------|--------------------------------------------------------------------------------|
| Script | download.py | Download a named dataset from [qcarchive][qcarchive] |
| | filter.py | Filter out problematic records from a dataset |
| | cache_dataset.py | Convert a `qcsubmit.ResultCollection` into a cached version[^1] |
| | submit.sh | General Slurm script for running Make commands |
| | Makefile | Makefile showing how each file is produced |
| Dataset | industry.json | OpenFF Industry Benchmark Season 1 v1.1 |
| | tm-supp0.json | OpenFF Torsion Benchmark Supplement v1.0 |
| | tm-supp.json | OpenFF Torsion Multiplicity Optimization Benchmarking Coverage Supplement v1.0 |
| | filtered-tm-supp0.json | Filtered version of tm-supp0.json |
| | filtered-tm-supp.json | Filtered version of tm-supp.json |
| | filtered-industry.json | Filtered version of industry.json |
| Directory | cache | Contains cached versions of the datasets |

## Adding a dataset
The summary of steps for adding a new dataset is below, with more detailed
instructions in the following sections.

1. Add a rule to download it to the `Makefile`
2. Run a command like `./submit.sh make cache/filtered-your-dataset.json NPROCS=16
CHUNKSIZE=32` on HPC3
3. Update the table in the README

### 1. Add a rule to the Makefile
The easiest way to do this is to copy an existing rule. For example, the
existing rule to create `industry.json` is:

``` make
industry.json: download.py
python download.py "OpenFF Industry Benchmark Season 1 v1.1" -o $@ -p
```

This say `industry.json` depends on the `download.py` script (it will be remade
if `download.py` has changed), and the steps to produce `industry.json` from
`download.py` require running the `download.py` script with the dataset's name,
an output path, and the `-p/--pretty-print` flag for `download.py`. `$@` is a
built-in Make variable set to the "target" or the thing on the left of the colon
in the rule definition. After copying this definition and replacing
`industry.json` with the desired output filename and `"OpenFF Industry Benchmark
Season 1 v1.1"` with the name of your dataset, you should be ready for step 2.

### 2. Run submit.sh
`submit.sh` is a shell script that generates a Slurm script to run on HPC3. As
you can see if you provide the `-h` flag, it takes several options, summarized
in the table below:

| Flag | Description | Default |
|------|--------------------------------------------------|---------|
| -h | Print usage information and exit | False |
| -d | Dry run, print Slurm input instead of submitting | False |
| -t | Set the requested number of CPU hours | 72 |
| -m | Set the requested amount of RAM, in GB | 32 |
| -n | Set the requested number of CPUs per task | 8 |

After these optional arguments, `submit.sh` takes any number of commands, which
are passed directly into the generated Slurm script. For the example invocation
above (`./submit.sh make cache/filtered-your-dataset.json NPROCS=16 CHUNKSIZE=32`), the
generated Slurm script will look like:

``` text
#!/bin/bash
#SBATCH -J filter-dataset
#SBATCH -p standard
#SBATCH -t 72:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32gb
#SBATCH --account dmobley_lab
#SBATCH --export ALL
#SBATCH --constraint=fastscratch
#SBATCH --output=logs/2024-08-14.2719087.out

date
hostname
echo $SLURM_JOB_ID

source ~/.bashrc
mamba activate yammbs-dataset-submission

echo $OE_LICENSE

make cache/filtered-your-dataset.json NPROCS=16 CHUNKSIZE=32

date
```

The Makefile defines "pattern rules" for converting any `*.json` file into its
filtered version `filtered-*.json` (using the `*` shell wildcard instead of the
`%` make pattern wildcard). Similarly, it also defines a rule for creating any
`cache/*.json` from `*.json`, so after only defining a rule to make
`your-dataset.json`, you can still ask Make to build
`cache/filtered-your-dataset.json` to generate the original `your-dataset.json`,
as well as the filtered and cached versions.

### 3. Update README
Currently I have been adding both the plain dataset and the filtered version to
the README, but this is a bit redundant, especially since all of the datasets
have the same filters applied so far.

<!-- Refs -->
[qcsubmit]: https://github.com/openforcefield/openff-qcsubmit
[yammbs]: https://github.com/openforcefield/yammbs
[qcarchive]: https://qcarchive.molssi.org/

[^1]: The "caching" here calls `OptimizationResultCollection.to_records`, which
contacts QCArchive to retrieve the full dataset and extracts only the fields
needed by [yammbs][yammbs]. `to_records` can be quite expensive (and
network-dependent), so this saves a lot of time in repeated `yammbs` runs.
## Adding a new dataset
The general steps for adding a new dataset are:
1. Run `download_and_filter_dataset.py` passing as arguments:
* The dataset name on QCArchive
* (optional) The number of CPUs to use in [multiprocessing.Pool][pool],
defaults to 1
* (optional) The chunk size for [multiprocessing.Pool.imap][imap], defaults
to 1
2. Move your input file and any log files (if run as a batch job, for example)
into the created dataset directory
3. Commit the results to the repo
4. Open a PR for review before merging

### Submission script
`submit.sh` is an example Slurm submission script for running
`download_and_filter_dataset.py` on UCI's HPC3. It may need to be modified to
work on other clusters, but please do not include these changes as part of
dataset submission. Basic usage is just `./submit.sh "Name of QCA dataset"`, but
it also supports a few flags to control the time requested (`-t` in hours), the
memory requested (`-m` in GB), the number of CPUs (`-n`), and the [imap][imap]
chunk size (`-c`) as described above. These must come before the name of the
input file on the command line. There's also a "dry run" flag (`-d`) that prints
the generated `sbatch` input instead of running it immediately.

#### Conda environment
The example submission script activates an environment called
`yammbs-dataset-submission`, so you'll need to have one of those available. You
can create such an environment using the provided [env.yaml
file](../devtools/env.yaml).

[pool]: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
[imap]: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap
3 changes: 0 additions & 3 deletions datasets/cache/filtered-industry.json

This file was deleted.

3 changes: 0 additions & 3 deletions datasets/cache/filtered-tm-supp.json

This file was deleted.

21 changes: 0 additions & 21 deletions datasets/cache_dataset.py

This file was deleted.

21 changes: 0 additions & 21 deletions datasets/download.py

This file was deleted.

Loading