Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not all records being found for Hessian Dataset when converting from Optimization Dataset #297

Open
amcisaac opened this issue Oct 2, 2024 · 6 comments

Comments

@amcisaac
Copy link

amcisaac commented Oct 2, 2024

I have an optimization dataset, and a corresponding single point dataset with Hessians for all the final molecules. Typically we fetch the hessian records using dataset.to_basic_result_collection(driver="hessian"), but this is not fetching all the Hessian records, even though they should exist.

Reproducing example:

from qcportal import PortalClient
from openff.qcsubmit.results import OptimizationResultCollection
from openff.qcsubmit.results import BasicResultCollection

client = PortalClient(address="https://api.qcarchive.molssi.org:443/")

opt_set = OptimizationResultCollection.from_server(client = client, datasets = ['OpenFF Sulfur Optimization Training Coverage Supplement v1.0'],spec_name='default')

opt_to_hess_set = opt_set.to_basic_result_collection(driver="hessian")

print(opt_to_hess_set.n_results) # 260

print(opt_set.n_results) # 899

hess_set = BasicResultCollection.from_server(client=client, datasets=['OpenFF Sulfur Hessian Training Coverage Supplement v1.0'],spec_name='default')

print(hess_set.n_results) # 899

# Check whether all the Basic Results actually have Hessians
hess_rec = hess_set.to_records()

n_hess = 0
for record, molecule in hess_rec:
    if len(record.return_result)>1:
        n_hess += 1
    

print(n_hess) # 899

n_hess = 0
for record, molecule in hess_rec:
    if record.specification.driver == 'hessian':
        n_hess += 1
    

print(n_hess) # 899

Expected behavior:

test_opt = OptimizationResultCollection.from_server(client=client,datasets = ['OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy'],spec_name='default')

test_opt_to_hess = test_opt.to_basic_result_collection(driver="hessian")

print(test_opt.n_results) # 197

test_opt_to_hess.n_results # 197
@amcisaac
Copy link
Author

amcisaac commented Oct 2, 2024

cc @ntBre

@ntBre
Copy link
Contributor

ntBre commented Oct 2, 2024

I can reproduce this too. I think this is likely a problem with the dataset itself or its representation on QCArchive rather than a bug in qcsubmit. Extracting some of the code from OptimizationResultCollection.to_basic_result_collection shows that the server only has 260 Hessian records for the final molecule IDs from the optimization dataset:

from openff.qcsubmit.results import (
    BasicResultCollection,
    OptimizationResultCollection,
)
from openff.qcsubmit.utils import _CachedPortalClient, portal_client_manager

client = _CachedPortalClient("https://api.qcarchive.molssi.org:443", ".")
opt_name = "OpenFF Sulfur Optimization Training Coverage Supplement v1.0"
hess_name = "OpenFF Sulfur Hessian Training Coverage Supplement v1.0"

with portal_client_manager(lambda _: client):
    opt = OptimizationResultCollection.from_server(client, [opt_name])

    mol_ids = [rec.final_molecule_id for rec, _mol in opt.to_records()]
    print(len(mol_ids)) # => 899
    sp_records = list(client.query_singlepoints(
        molecule_id=mol_ids, driver="hessian"
    ))
    print(len(sp_records)) # => 260

The big open question to me is whether or not using OptimizationResultCollection.create_basic_dataset (which I didn't know about before today) would have fixed this issue from the initial submission, where I effectively did a manual implementation of that method.

As far as I can tell, create_basic_dataset only propagates the extras and specifications.keywords fields from the underlying optimization record, so I don't think it would have helped. In fact, diffing the dataset in qca-dataset-submission with the dataset created with this code:

    bds = opt.create_basic_dataset("tmp", "descdesc", "tagtagtag", driver="hessian")
    bds.export_dataset("/tmp/bds.json")

shows exactly that: additional keywords are present but otherwise the molecule entries are unchanged. If the different keywords caused the mismatch, I would expect none of the records here to match up instead of some fraction of them. So I think the root cause is QCArchive failing to identify these molecules as corresponding to the same parent records.

@amcisaac
Copy link
Author

amcisaac commented Oct 2, 2024

I've gone through and checked that each optimization record has a corresponding hessian record with matching SMILES string and RMSD (all-atom, no symmetry check) < 0.00001 A. So it shouldn't be an issue with incorrect molecules.

I did notice that not all of the records are in the same order between the two datasets, but only 79 are out of order and they are randomly spread throughout the dataset so I don't think it's like, processing the first 260 that are in order and then having issues due to an order change.

@amcisaac
Copy link
Author

amcisaac commented Oct 3, 2024

Update: at @lilyminium 's suggestion I looked at the differences in the absolute coordinates, it looks like requiring a 1e-9 cutoff for the difference between coordinates in the two datasets leads to 639 molecules coming up as "not the same" e.g.:

np.any(np.abs(hess_geom - opt_geom).magnitude > 0.000000001)

I don't know what QCA uses as their criterion, but I'd guess this is probably related to QCA's criterion for considering molecules "the same"

@ntBre
Copy link
Contributor

ntBre commented Oct 3, 2024

Assuming QCA uses the qcelemental.models.Molecule class to represent molecules, their __eq__ implementation compares the hashes of the molecules, which includes a hashed version of the geometry rounded to 8 decimal places. At first I thought that didn't align with the 1e-9 number you found, but now I think I've convinced myself that rounding to 8 places is the same as comparing 9 places, at least when one of them rounds up. That could also explain why only some records are affected. The coordinates are likely close in the 9th decimal place, but sometimes rounding will cause them to differ in the 8th place. So really it would be better if QCA used your formula above instead of this hashing approach.

a = np.array([4.5e-9])
b = np.array([5.1e-9])
a - b  # => array([-6.e-10])
np.any(np.abs(a - b) > 1e-9) # => np.False_ indicating equality
np.around(a, 8) == np.around(b, 8) # => array([False]), indicating inequality

This is a slightly confusing example since both of the last two lines print false, but the first test is false for the difference being greater than 1e-9, making the arrays equal, while the second test is the one done by QCElemental showing two molecules differing by 6e-10 to be different (not equal).

@amcisaac
Copy link
Author

amcisaac commented Oct 4, 2024

Yeah that makes a lot of sense, if one rounds up and one rounds down, two numbers that differ in the 9th decimal place wouldn't pass that check (as you demonstrated). When I printed out the coordinate differences, most were on the order of 1e-9, so that makes sense. I think this is definitely an issue with QCA

ntBre added a commit that referenced this issue Oct 15, 2024
ntBre added a commit that referenced this issue Oct 17, 2024
ntBre added a commit that referenced this issue Nov 6, 2024
…e molecule IDs (#303)

* add failing test case based on #297

* fix test for #297 but break other create_basic_dataset test

* mock get_molecules for test_optimization_create_basic_dataset

* update doc string for new test

* minimize example dataset from #297

* combine create_basic_dataset tests and delete unused get_molecules

* fix typo

* update release notes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants