-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not all records being found for Hessian Dataset when converting from Optimization Dataset #297
Comments
cc @ntBre |
I can reproduce this too. I think this is likely a problem with the dataset itself or its representation on QCArchive rather than a bug in qcsubmit. Extracting some of the code from OptimizationResultCollection.to_basic_result_collection shows that the server only has 260 Hessian records for the final molecule IDs from the optimization dataset: from openff.qcsubmit.results import (
BasicResultCollection,
OptimizationResultCollection,
)
from openff.qcsubmit.utils import _CachedPortalClient, portal_client_manager
client = _CachedPortalClient("https://api.qcarchive.molssi.org:443", ".")
opt_name = "OpenFF Sulfur Optimization Training Coverage Supplement v1.0"
hess_name = "OpenFF Sulfur Hessian Training Coverage Supplement v1.0"
with portal_client_manager(lambda _: client):
opt = OptimizationResultCollection.from_server(client, [opt_name])
mol_ids = [rec.final_molecule_id for rec, _mol in opt.to_records()]
print(len(mol_ids)) # => 899
sp_records = list(client.query_singlepoints(
molecule_id=mol_ids, driver="hessian"
))
print(len(sp_records)) # => 260 The big open question to me is whether or not using OptimizationResultCollection.create_basic_dataset (which I didn't know about before today) would have fixed this issue from the initial submission, where I effectively did a manual implementation of that method. As far as I can tell, bds = opt.create_basic_dataset("tmp", "descdesc", "tagtagtag", driver="hessian")
bds.export_dataset("/tmp/bds.json") shows exactly that: additional keywords are present but otherwise the molecule entries are unchanged. If the different keywords caused the mismatch, I would expect none of the records here to match up instead of some fraction of them. So I think the root cause is QCArchive failing to identify these molecules as corresponding to the same parent records. |
I've gone through and checked that each optimization record has a corresponding hessian record with matching SMILES string and RMSD (all-atom, no symmetry check) < 0.00001 A. So it shouldn't be an issue with incorrect molecules. I did notice that not all of the records are in the same order between the two datasets, but only 79 are out of order and they are randomly spread throughout the dataset so I don't think it's like, processing the first 260 that are in order and then having issues due to an order change. |
Update: at @lilyminium 's suggestion I looked at the differences in the absolute coordinates, it looks like requiring a 1e-9 cutoff for the difference between coordinates in the two datasets leads to 639 molecules coming up as "not the same" e.g.:
I don't know what QCA uses as their criterion, but I'd guess this is probably related to QCA's criterion for considering molecules "the same" |
Assuming QCA uses the a = np.array([4.5e-9])
b = np.array([5.1e-9])
a - b # => array([-6.e-10])
np.any(np.abs(a - b) > 1e-9) # => np.False_ indicating equality
np.around(a, 8) == np.around(b, 8) # => array([False]), indicating inequality This is a slightly confusing example since both of the last two lines print false, but the first test is false for the difference being greater than 1e-9, making the arrays equal, while the second test is the one done by QCElemental showing two molecules differing by 6e-10 to be different (not equal). |
Yeah that makes a lot of sense, if one rounds up and one rounds down, two numbers that differ in the 9th decimal place wouldn't pass that check (as you demonstrated). When I printed out the coordinate differences, most were on the order of 1e-9, so that makes sense. I think this is definitely an issue with QCA |
…e molecule IDs (#303) * add failing test case based on #297 * fix test for #297 but break other create_basic_dataset test * mock get_molecules for test_optimization_create_basic_dataset * update doc string for new test * minimize example dataset from #297 * combine create_basic_dataset tests and delete unused get_molecules * fix typo * update release notes
I have an optimization dataset, and a corresponding single point dataset with Hessians for all the final molecules. Typically we fetch the hessian records using
dataset.to_basic_result_collection(driver="hessian")
, but this is not fetching all the Hessian records, even though they should exist.Reproducing example:
Expected behavior:
The text was updated successfully, but these errors were encountered: