openforcefield · ntBre · Nov 7, 2024 · Oct 17, 2024 · Oct 17, 2024 · Oct 17, 2024
diff --git a/.gitattributes b/.gitattributes
@@ -1 +1 @@
-datasets/cache/*.json filter=lfs diff=lfs merge=lfs -text
+datasets/*/*.json filter=lfs diff=lfs merge=lfs -text
diff --git a/.gitignore b/.gitignore
@@ -1 +1,2 @@
-oe_license.txt
+oe_license.txt
+/__pycache__/
diff --git a/datasets/Makefile b/datasets/Makefile
diff --git a/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/2024-10-31.2183998.out b/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/2024-10-31.2183998.out
diff --git a/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/cache.json b/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/cache.json
diff --git a/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/filtered.json b/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/filtered.json
diff --git a/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/industry.yaml b/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/industry.yaml
@@ -0,0 +1,2 @@
+ds_name: "OpenFF Industry Benchmark Season 1 v1.1"
+chunksize: 32
diff --git a/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/raw.json b/datasets/OpenFF-Industry-Benchmark-Season-1-v1.1/raw.json
diff --git a/datasets/README.md b/datasets/README.md
@@ -1,113 +1,32 @@
-This directory contains datasets in the JSON-serialized
-`OptimizationResultCollection` format from [qcsubmit][qcsubmit], as well as
-scripts for retrieving and post-processing them.
-
-| Type      | File                   | Description                                                                    |
-|-----------|------------------------|--------------------------------------------------------------------------------|
-| Script    | download.py            | Download a named dataset from [qcarchive][qcarchive]                           |
-|           | filter.py              | Filter out problematic records from a dataset                                  |
-|           | cache_dataset.py       | Convert a `qcsubmit.ResultCollection` into a cached version[^1]                |
-|           | submit.sh              | General Slurm script for running Make commands                                 |
-|           | Makefile               | Makefile showing how each file is produced                                     |
-| Dataset   | industry.json          | OpenFF Industry Benchmark Season 1 v1.1                                        |
-|           | tm-supp0.json          | OpenFF Torsion Benchmark Supplement v1.0                                       |
-|           | tm-supp.json           | OpenFF Torsion Multiplicity Optimization Benchmarking Coverage Supplement v1.0 |
-|           | filtered-tm-supp0.json | Filtered version of tm-supp0.json                                              |
-|           | filtered-tm-supp.json  | Filtered version of tm-supp.json                                               |
-|           | filtered-industry.json | Filtered version of industry.json                                              |
-| Directory | cache                  | Contains cached versions of the datasets                                       |
-
-## Adding a dataset
-The summary of steps for adding a new dataset is below, with more detailed
-instructions in the following sections.
-
-1. Add a rule to download it to the `Makefile`
-2. Run a command like `./submit.sh make cache/filtered-your-dataset.json NPROCS=16
-   CHUNKSIZE=32` on HPC3
-3. Update the table in the README
-
-### 1. Add a rule to the Makefile
-The easiest way to do this is to copy an existing rule. For example, the
-existing rule to create `industry.json` is:
-
-``` make
-industry.json: download.py
-	python download.py "OpenFF Industry Benchmark Season 1 v1.1" -o $@ -p
-```
-
-This say `industry.json` depends on the `download.py` script (it will be remade
-if `download.py` has changed), and the steps to produce `industry.json` from
-`download.py` require running the `download.py` script with the dataset's name,
-an output path, and the `-p/--pretty-print` flag for `download.py`. `$@` is a
-built-in Make variable set to the "target" or the thing on the left of the colon
-in the rule definition. After copying this definition and replacing
-`industry.json` with the desired output filename and `"OpenFF Industry Benchmark
-Season 1 v1.1"` with the name of your dataset, you should be ready for step 2.
-
-### 2. Run submit.sh
-`submit.sh` is a shell script that generates a Slurm script to run on HPC3. As
-you can see if you provide the `-h` flag, it takes several options, summarized
-in the table below:
-
-| Flag | Description                                      | Default |
-|------|--------------------------------------------------|---------|
-| -h   | Print usage information and exit                 | False   |
-| -d   | Dry run, print Slurm input instead of submitting | False   |
-| -t   | Set the requested number of CPU hours            | 72      |
-| -m   | Set the requested amount of RAM, in GB           | 32      |
-| -n   | Set the requested number of CPUs per task        | 8       |
-
-After these optional arguments, `submit.sh` takes any number of commands, which
-are passed directly into the generated Slurm script. For the example invocation
-above (`./submit.sh make cache/filtered-your-dataset.json NPROCS=16 CHUNKSIZE=32`), the
-generated Slurm script will look like:
-
-``` text
-#!/bin/bash
-#SBATCH -J filter-dataset
-#SBATCH -p standard
-#SBATCH -t 72:00:00
-#SBATCH --nodes=1
-#SBATCH --cpus-per-task=8
-#SBATCH --mem=32gb
-#SBATCH --account dmobley_lab
-#SBATCH --export ALL
-#SBATCH --constraint=fastscratch
-#SBATCH --output=logs/2024-08-14.2719087.out
-
-date
-hostname
-echo $SLURM_JOB_ID
-
-source ~/.bashrc
-mamba activate yammbs-dataset-submission
-
-echo $OE_LICENSE
-
-make cache/filtered-your-dataset.json NPROCS=16 CHUNKSIZE=32
-
-date
-```
-
-The Makefile defines "pattern rules" for converting any `*.json` file into its
-filtered version `filtered-*.json` (using the `*` shell wildcard instead of the
-`%` make pattern wildcard). Similarly, it also defines a rule for creating any
-`cache/*.json` from `*.json`, so after only defining a rule to make
-`your-dataset.json`, you can still ask Make to build
-`cache/filtered-your-dataset.json` to generate the original `your-dataset.json`,
-as well as the filtered and cached versions.
-
-### 3. Update README
-Currently I have been adding both the plain dataset and the filtered version to
-the README, but this is a bit redundant, especially since all of the datasets
-have the same filters applied so far.
-
-<!-- Refs -->
-[qcsubmit]: https://github.com/openforcefield/openff-qcsubmit
-[yammbs]: https://github.com/openforcefield/yammbs
-[qcarchive]: https://qcarchive.molssi.org/
-
-[^1]: The "caching" here calls `OptimizationResultCollection.to_records`, which
-    contacts QCArchive to retrieve the full dataset and extracts only the fields
-    needed by [yammbs][yammbs]. `to_records` can be quite expensive (and
-    network-dependent), so this saves a lot of time in repeated `yammbs` runs.
+## Adding a new dataset
+The general steps for adding a new dataset are:
+1. Run `download_and_filter_dataset.py` passing as arguments:
+   * The dataset name on QCArchive
+   * (optional) The number of CPUs to use in [multiprocessing.Pool][pool],
+     defaults to 1
+   * (optional) The chunk size for [multiprocessing.Pool.imap][imap], defaults
+     to 1
+2. Move your input file and any log files (if run as a batch job, for example)
+   into the created dataset directory
+3. Commit the results to the repo
+4. Open a PR for review before merging
+
+### Submission script
+`submit.sh` is an example Slurm submission script for running
+`download_and_filter_dataset.py` on UCI's HPC3. It may need to be modified to
+work on other clusters, but please do not include these changes as part of
+dataset submission. Basic usage is just `./submit.sh "Name of QCA dataset"`, but
+it also supports a few flags to control the time requested (`-t` in hours), the
+memory requested (`-m` in GB), the number of CPUs (`-n`), and the [imap][imap]
+chunk size (`-c`) as described above. These must come before the name of the
+input file on the command line. There's also a "dry run" flag (`-d`) that prints
+the generated `sbatch` input instead of running it immediately.
+
+#### Conda environment
+The example submission script activates an environment called
+`yammbs-dataset-submission`, so you'll need to have one of those available. You
+can create such an environment using the provided [env.yaml
+file](../devtools/env.yaml).
+
+[pool]: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool
+[imap]: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap
diff --git a/datasets/cache/filtered-industry.json b/datasets/cache/filtered-industry.json
diff --git a/datasets/cache/filtered-tm-supp.json b/datasets/cache/filtered-tm-supp.json
diff --git a/datasets/cache_dataset.py b/datasets/cache_dataset.py
diff --git a/datasets/download.py b/datasets/download.py
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		datasets/cache/*.json filter=lfs diff=lfs merge=lfs -text
		datasets//.json filter=lfs diff=lfs merge=lfs -text
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		ds_name: "OpenFF Industry Benchmark Season 1 v1.1"
		chunksize: 32