Not all records being found for Hessian Dataset when converting from Optimization Dataset #297

amcisaac · 2024-10-02T19:28:37Z

I have an optimization dataset, and a corresponding single point dataset with Hessians for all the final molecules. Typically we fetch the hessian records using dataset.to_basic_result_collection(driver="hessian"), but this is not fetching all the Hessian records, even though they should exist.

Reproducing example:

from qcportal import PortalClient
from openff.qcsubmit.results import OptimizationResultCollection
from openff.qcsubmit.results import BasicResultCollection

client = PortalClient(address="https://api.qcarchive.molssi.org:443/")

opt_set = OptimizationResultCollection.from_server(client = client, datasets = ['OpenFF Sulfur Optimization Training Coverage Supplement v1.0'],spec_name='default')

opt_to_hess_set = opt_set.to_basic_result_collection(driver="hessian")

print(opt_to_hess_set.n_results) # 260

print(opt_set.n_results) # 899

hess_set = BasicResultCollection.from_server(client=client, datasets=['OpenFF Sulfur Hessian Training Coverage Supplement v1.0'],spec_name='default')

print(hess_set.n_results) # 899

# Check whether all the Basic Results actually have Hessians
hess_rec = hess_set.to_records()

n_hess = 0
for record, molecule in hess_rec:
    if len(record.return_result)>1:
        n_hess += 1
    

print(n_hess) # 899

n_hess = 0
for record, molecule in hess_rec:
    if record.specification.driver == 'hessian':
        n_hess += 1
    

print(n_hess) # 899

Expected behavior:

test_opt = OptimizationResultCollection.from_server(client=client,datasets = ['OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy'],spec_name='default')

test_opt_to_hess = test_opt.to_basic_result_collection(driver="hessian")

print(test_opt.n_results) # 197

test_opt_to_hess.n_results # 197

The text was updated successfully, but these errors were encountered:

amcisaac · 2024-10-02T19:47:10Z

cc @ntBre

ntBre · 2024-10-02T20:05:19Z

I can reproduce this too. I think this is likely a problem with the dataset itself or its representation on QCArchive rather than a bug in qcsubmit. Extracting some of the code from OptimizationResultCollection.to_basic_result_collection shows that the server only has 260 Hessian records for the final molecule IDs from the optimization dataset:

from openff.qcsubmit.results import (
    BasicResultCollection,
    OptimizationResultCollection,
)
from openff.qcsubmit.utils import _CachedPortalClient, portal_client_manager

client = _CachedPortalClient("https://api.qcarchive.molssi.org:443", ".")
opt_name = "OpenFF Sulfur Optimization Training Coverage Supplement v1.0"
hess_name = "OpenFF Sulfur Hessian Training Coverage Supplement v1.0"

with portal_client_manager(lambda _: client):
    opt = OptimizationResultCollection.from_server(client, [opt_name])

    mol_ids = [rec.final_molecule_id for rec, _mol in opt.to_records()]
    print(len(mol_ids)) # => 899
    sp_records = list(client.query_singlepoints(
        molecule_id=mol_ids, driver="hessian"
    ))
    print(len(sp_records)) # => 260

The big open question to me is whether or not using OptimizationResultCollection.create_basic_dataset (which I didn't know about before today) would have fixed this issue from the initial submission, where I effectively did a manual implementation of that method.

As far as I can tell, create_basic_dataset only propagates the extras and specifications.keywords fields from the underlying optimization record, so I don't think it would have helped. In fact, diffing the dataset in qca-dataset-submission with the dataset created with this code:

    bds = opt.create_basic_dataset("tmp", "descdesc", "tagtagtag", driver="hessian")
    bds.export_dataset("/tmp/bds.json")

shows exactly that: additional keywords are present but otherwise the molecule entries are unchanged. If the different keywords caused the mismatch, I would expect none of the records here to match up instead of some fraction of them. So I think the root cause is QCArchive failing to identify these molecules as corresponding to the same parent records.

amcisaac · 2024-10-02T23:34:02Z

I've gone through and checked that each optimization record has a corresponding hessian record with matching SMILES string and RMSD (all-atom, no symmetry check) < 0.00001 A. So it shouldn't be an issue with incorrect molecules.

I did notice that not all of the records are in the same order between the two datasets, but only 79 are out of order and they are randomly spread throughout the dataset so I don't think it's like, processing the first 260 that are in order and then having issues due to an order change.

amcisaac · 2024-10-03T00:07:32Z

Update: at @lilyminium 's suggestion I looked at the differences in the absolute coordinates, it looks like requiring a 1e-9 cutoff for the difference between coordinates in the two datasets leads to 639 molecules coming up as "not the same" e.g.:

np.any(np.abs(hess_geom - opt_geom).magnitude > 0.000000001)

I don't know what QCA uses as their criterion, but I'd guess this is probably related to QCA's criterion for considering molecules "the same"

ntBre · 2024-10-03T18:51:22Z

Assuming QCA uses the qcelemental.models.Molecule class to represent molecules, their __eq__ implementation compares the hashes of the molecules, which includes a hashed version of the geometry rounded to 8 decimal places. At first I thought that didn't align with the 1e-9 number you found, but now I think I've convinced myself that rounding to 8 places is the same as comparing 9 places, at least when one of them rounds up. That could also explain why only some records are affected. The coordinates are likely close in the 9th decimal place, but sometimes rounding will cause them to differ in the 8th place. So really it would be better if QCA used your formula above instead of this hashing approach.

a = np.array([4.5e-9])
b = np.array([5.1e-9])
a - b  # => array([-6.e-10])
np.any(np.abs(a - b) > 1e-9) # => np.False_ indicating equality
np.around(a, 8) == np.around(b, 8) # => array([False]), indicating inequality

This is a slightly confusing example since both of the last two lines print false, but the first test is false for the difference being greater than 1e-9, making the arrays equal, while the second test is the one done by QCElemental showing two molecules differing by 6e-10 to be different (not equal).

amcisaac · 2024-10-04T01:34:16Z

Yeah that makes a lot of sense, if one rounds up and one rounds down, two numbers that differ in the 9th decimal place wouldn't pass that check (as you demonstrated). When I printed out the coordinate differences, most were on the order of 1e-9, so that makes sense. I think this is definitely an issue with QCA

…e molecule IDs (#303) * add failing test case based on #297 * fix test for #297 but break other create_basic_dataset test * mock get_molecules for test_optimization_create_basic_dataset * update doc string for new test * minimize example dataset from #297 * combine create_basic_dataset tests and delete unused get_molecules * fix typo * update release notes

ntBre added a commit that referenced this issue Oct 15, 2024

add failing test case based on #297

556853f

ntBre mentioned this issue Oct 15, 2024

Update OptimizationResultCollection.create_basic_dataset to preserve molecule IDs #303

Merged

2 tasks

ntBre added a commit that referenced this issue Oct 16, 2024

fix test for #297 but break other create_basic_dataset test

42a711a

ntBre added a commit that referenced this issue Oct 17, 2024

minimize example dataset from #297

bd3bf3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not all records being found for Hessian Dataset when converting from Optimization Dataset #297

Not all records being found for Hessian Dataset when converting from Optimization Dataset #297

amcisaac commented Oct 2, 2024 •

edited

Loading

amcisaac commented Oct 2, 2024

ntBre commented Oct 2, 2024

amcisaac commented Oct 2, 2024

amcisaac commented Oct 3, 2024 •

edited

Loading

ntBre commented Oct 3, 2024

amcisaac commented Oct 4, 2024

Not all records being found for Hessian Dataset when converting from Optimization Dataset #297

Not all records being found for Hessian Dataset when converting from Optimization Dataset #297

Comments

amcisaac commented Oct 2, 2024 • edited Loading

amcisaac commented Oct 2, 2024

ntBre commented Oct 2, 2024

amcisaac commented Oct 2, 2024

amcisaac commented Oct 3, 2024 • edited Loading

ntBre commented Oct 3, 2024

amcisaac commented Oct 4, 2024

amcisaac commented Oct 2, 2024 •

edited

Loading

amcisaac commented Oct 3, 2024 •

edited

Loading