-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infer bond orders and formal charges #1828
Comments
This seems like a solid idea. My major concerns are:
Anyway, the biggest thing I'd like to see to approve this moving forward in our ecosystem is a benchmark that it actually works. So the first step toward either of these outcomes is running the InChI benchmark. Assorted notes:
|
That's a great idea for the test. I'll try writing something along those lines and see how MDAnalysis does. |
The following test goes through all the monomers from the DES370K dataset (just under 400 of them). For each one it performs the series of transformations SDF->OpenFF Molecule->PDB->MDAnalysis Unverse->RDKit Molecule->OpenFF Molecule. It doesn't find any errors. from openff.toolkit import Molecule
import MDAnalysis as mda
import os
dir = '/Users/peastman/workspace/spice-dataset/des370k/SDFS'
errors = 0
for filename in os.listdir(dir):
mol = Molecule(os.path.join(dir, filename), allow_undefined_stereo=True)
mol.to_file('temp.pdb', 'PDB')
u = mda.Universe('temp.pdb')
mol2 = Molecule(u.atoms.convert_to('RDKit', force=True), allow_undefined_stereo=True)
if mol.to_inchi() != mol2.to_inchi():
print(filename, mol.to_inchi(), mol2.to_inchi())
errors += 1
print(errors, 'errors') That's the biggest set of SDF files I happened to have sitting around. I have SMILES strings for about 400,000 PubChem molecules that can make a much bigger test. |
Here is how the Free Software Foundation interprets it according to their FAQ. When you load a GPL library into memory, you link it to all the other code running in that process. All that code together becomes a derived product and must be licensed under the GPL. Therefore, any code that you load into the same process as a GPL library must be available under a GPL-compatible license (one that allows relicensing it as GPL). That isn't a problem for OpenFF Toolkit itself because the MIT license is GPL-compatible. But the OpenEye toolkit is not. If you load both the OpenEye toolkit and MDAnalysis into the same process, you're violating the license. |
I'm also not incredibly familiar with licenses, but very simplistically the way I understand it is code that runs |
I'm in favor of not depending on GPL code at runtime, certainly not adding new GPL dependencies now that I am aware of this argument that side-by-side imports are a license violation. (I previously thought Python's import library got around GPL via magic I didn't understand, but now I'm not sure.) The toolkit tries to load In [1]: from openff.toolkit import Molecule
In [2]: import os, sys
In [3]: os.path.isfile("/Users/mattthompson/.oe_license.txt")
Out[3]: True
In [4]: "openeye"in sys.modules
Out[4]: True so I'm already needing to go back and see if some tests I wrote in the past are infected (presumably this is why OpenFE has mixed licenses in its ecosystem?). |
This version of the test runs through the PubChem molecules. from openff.toolkit import Molecule
import MDAnalysis as mda
import os
errors = 0
for line in open('/Users/peastman/workspace/spice-dataset/pubchem/sorted.txt'):
id, smiles = line.split()
mol = Molecule.from_smiles(smiles, allow_undefined_stereo=True)
mol.generate_conformers(n_conformers=1)
mol.to_file('temp.pdb', 'PDB')
u = mda.Universe('temp.pdb')
mol2 = Molecule(u.atoms.convert_to('RDKit', force=True), allow_undefined_stereo=True)
if mol.to_inchi() != mol2.to_inchi():
print('Error:', smiles)
print(mol.to_inchi())
print(mol2.to_inchi())
errors += 1
print(errors, 'errors') It does report some errors. Here are a few examples.
It looks to me like these are mostly cases where it's getting the total charge wrong. Unlike the MDAnalysis routine, the |
This version uses RDKit to read the PDB file and fill in missing information. from openff.toolkit import Molecule
from rdkit import Chem
errors = 0
for line in open('/Users/peastman/workspace/spice-dataset/pubchem/sorted.txt'):
id, smiles = line.split()
mol = Molecule.from_smiles(smiles, allow_undefined_stereo=True)
mol.generate_conformers(n_conformers=1)
mol.to_file('temp.pdb', 'PDB')
rdmol = Chem.MolFromPDBFile('temp.pdb', removeHs=False)
mol2 = Molecule(rdmol, allow_undefined_stereo=True)
if mol.to_inchi() != mol2.to_inchi():
print('Error:', smiles)
print(mol.to_inchi())
print(mol2.to_inchi())
errors += 1
print(errors, 'errors') When it fails, it generally knows something has gone wrong and prints an error message.
|
I ran 10,000 PubChem molecules through the above code. Here's how it did. 9862 succeeded. It made it through all the transformations, and the final molecule was identical to the initial one. 122 reported errors in reading the PDB file and 16 made it through, but the final molecule was different from the initial one. Here are some of the molecules in that last category.
|
I think this would be a great feature - but it does add ambiguity when hydrogens are missing. If I pass in a PDB with graph Since PDB files in particular usually do not include hydrogens, we should be very careful doing this by default. Even a check that fails if there are no hydrogens would not be sufficient to make this safe, as PDB files commonly include non-polar hydrogens. I would be in favour of a false-by-default, well documented |
Just chiming in on a couple of thoughts with my MDA & OpenFE hats on:
To clarify, the strategy here is that the core openfe toolkit isn't importing
Couple of thoughts here:
|
It looks like most of those errors aren't reproducible. They happen when RDKit incorrectly infers the bonds based on coordinates. Since the script calls That does suggest a workaround. It's easy to check whether it inferred the correct set of bonds. If it didn't, generate a new random conformation and try again. Of course, it would be even better if RDKit would use the actual bonds specified in the PDB file, not insist on ignoring them and selecting new bonds based on coordinates. |
I have a bit of an update on this. I realized the tests above were kind of cheating. I had OpenFF generate a conformer, wrote it to a PDB file, and had RDKit read the file and try to infer bond orders from it. But here's the problem: OpenFF already knew the bond orders at the start, and made use of them in generating the conformer. Without that information, it couldn't generate realistic coordinates, and without realistic coordinates, RDKit couldn't determine the bond orders. Oops! Instead I tried to get RDKit to infer bond orders just from the topology without needing a conformation. In the process I discovered a couple of bugs. A new RDKit version with fixes for those bugs was just released today, allowing me to get back to it. Here is the new code to create an RDKit molecule from an OpenFF molecule and determine bond orders. rdmol = Chem.EditableMol(Chem.Mol())
for atom in mol.atoms:
a = Chem.Atom(atom.atomic_number)
a.SetNoImplicit(True)
rdmol.AddAtom(a)
for bond in mol.bonds:
rdmol.AddBond(bond.atom1_index, bond.atom2_index, Chem.BondType.SINGLE)
rdmol = rdmol.GetMol()
rdDetermineBonds.DetermineBondOrders(rdmol, int(mol.total_charge.m), embedChiral=False)
That last case doesn't necessarily mean it was wrong. Someone more knowledgeable about this than me would have to look at them and decide. In some cases I suspect the problem may be in the original specification. Like this one: Original: I'm not a chemist, but that just looks really strange to me. Positive charges on the oxygens??? The one produced by RDKit looks a lot more plausible. |
I haven't had enough time to make a coherent writeup or mental model of this area, but I have been poking at this from a few angles. Instead of remaining silent for another few months until I have a grand plan, I'll post two notable things that I've found so far The MDAnalysis guesser works for a sizeable protein and correctly figures out total chargeimport MDAnalysis as mda
from openff.toolkit import Topology, Molecule
from rdkit.Chem import rdmolops
u = mda.Universe('../../examples/toolkit_showcase/5tbm_prepared.pdb', guess_bonds=True)
rdmol = u.atoms.convert_to('RDKit', force=True)
mol_frags = rdmolops.GetMolFrags(rdmol, asMols = True)
largest_rdmol = max(mol_frags, key=lambda m: m.GetNumAtoms())
prot_from_mda = Molecule(largest_rdmol, allow_undefined_stereo=True)
And the MDAnalysis output is "correct" (that is, it is isomorphic to the prot_from_off = Topology.from_pdb("../../examples/toolkit_showcase/5tbm_prepared.pdb").molecule(0)
prot_from_mda.is_isomorphic_with(prot_from_off,
aromatic_matching=False)
The RDKit guesser is too slow for proteins, and needs to know total chargeimport openmm
import openmm.app
pdb = openmm.app.PDBFile("../../examples/toolkit_showcase/5tbm_prepared.pdb")
from rdkit import Chem
from rdkit.Chem import rdDetermineBonds
rdmol = Chem.EditableMol(Chem.Mol())
for atom in pdb.topology.atoms():
a = Chem.Atom(atom.element.atomic_number)
a.SetNoImplicit(True)
rdmol.AddAtom(a)
for bond in pdb.topology.bonds():
rdmol.AddBond(bond.atom1.index, bond.atom2.index, Chem.BondType.SINGLE)
rdmol = rdmol.GetMol()
rdDetermineBonds.DetermineBondOrders(rdmol, -5, embedChiral=False)
Next stepsI'm unsure about concrete future steps. The simplest and safest option is to say "it seems like maybe the MDAnalysis guesser works, the code is above in this post, use at your own risk." We can basically make that statement now, but from past experiences with users, statements about capabilities that include the word "maybe" have zero to negative value. Given the high value to users, I would like to see if this can go further though. This aligns with our ongoing "See how much of the PDB we can model" effort, so I'm asking @Yoshanuikabundi to include the MDA guesser in the pipelines he's evaluating. I'm also seeing this benefiting from my in-progress work to support loading additional substructures like nucleic acids and user-specified residues in |
Anecdotally, a user in industry reached out to me directly about the MDA guesser:
|
Take a look at the errors I listed above in #1828 (comment). Those were all cases where it got the total charge wrong. Are they cause for concern? |
Glancing through those, there are a few pathologies. Many of the molecules are weird, but not out of scope (some of these species make me think "ionic liquids", which I know some of our users are studying). Likely cause for concernThe molecule is only made of HCNO and no Hs are added or removed, but it disagrees on the total charge. This is a silent error that will affect parameter assignment for a molecule that's well within OpenFF's current domain of applicability.
Mis-guessing valence state of hypervalent-capable atoms in weird molsThis one mis-guesses either P or N valence states
Likewise, I+7 vs I+5
Something's also wrong with valence states here but I'm not familiar enough with iodine chemistry to say which specific ones
Bad inputs?Intuitively, this molecule doesn't seem stable at all.
Some other bug in the workflowThis one seems unrelated - Something in the process added two Hs
|
I'm going to cc @cbouy here who is still somewhat actively working on improvements (at least he had things I think I promised to review a year ago and I didn't...) to the MDA rdkit parser and might have some views on these failures. |
I tried taking the molecule with just HCNO and used OpenFF to translate it from SMILES to InChI and back to SMILES. from openff.toolkit import Molecule
smiles = '[O-][n+]1cc2c3c[n+]([O-])c4ccccc4c3[n+]([O-])nc2c2ccccc21'
mol1 = Molecule.from_smiles(smiles, allow_undefined_stereo=True)
inchi = mol1.to_inchi()
mol2 = Molecule.from_inchi(inchi)
print(smiles)
print(mol1.to_smiles())
print(inchi)
print(mol2.to_smiles()) Here is the output.
The original SMILES has six charged atoms (three N and three O), while the final molecule has only two of each. Let's compare them. Here is the original molecule. And here is the final one. You can see at a glance that something's wrong with the final molecule. There are carbons that form five bonds. For comparison I also used PubChem Sketcher to plot the InChI string. Unlike SMILES, InChI doesn't specify formal charges or bond orders, so anything reading it necessarily has to infer them. Here's what it plotted. This also is obviously wrong. There are neutral nitrogens forming five bonds. Here's my best attempt at making sense of this. Formal charges and bond orders are, of course, just an abstraction. The real physics involves electrons that can move around freely through the molecule. Formal charges and bond orders are an approximate way of describing where the electrons tend to spend their time. The description is only approximate, and sometimes is arbitrary. But it's still useful. In any case, it's an abstraction based on rules. If we want to use it, we need to follow those rules. There isn't always a single right description, but there definitely are wrong descriptions, and in this case we're getting wrong descriptions. The original SMILES string follows the rules. The final one doesn't. And this is done purely with OpenFF Toolkit, just starting from a valid SMILES string and transforming between supposedly equivalent representations. If you allow creating a Molecule from an InChI description, it's impossible to avoid inferring formal charges and bond orders. Apparently there's already code in there that does it. And in this case, it does it incorrectly. |
Hi there 👋
Haha no worries, that PR does not really have improvements to the algorithm itself, although it gives users the ability to provide their own bond-order inferring function, which might be useful given the above. Regarding failing cases, we have a benchmark of the algorithm on a subset of ChEMBL (2M mols with 2 to 50 heavy atoms) in this repo. Note that a compound containing one of these substructures may still be correctly inferred, the algorithm in the MDA inferrer is dependent on the order of atoms so the benchmark tries each molecule with different atom ordering, and marks a molecule as failing if one of ordering failed. Also this was generated for an older version of MDA (2.4.3) and v2.5 introduces some changes that fix some of these cases. As you've already noticed, hypervalence, especially N+ in conjugated systems, is a bit of a pain for this algorithm. I might have some ideas specifically for nitrogen that I could explore during the MDAnalysis UGM in a few weeks. |
Is your feature request related to a problem? Please describe.
Creating a Topology requires you to provide bond orders and formal charges. Sometimes that information is not available, like when reading a PDB file or converting an OpenMM Topology. In that case, you need to provide SMILES strings for each molecule. That itself may be hard to determine, such as if all you have is the PDB file, or if it contains a protein.
Describe the solution you'd like
As long as all hydrogens are present, you can determine the bond orders and formal charges from the elements and bonds. That would make some workflows much easier. This article describes an algorithm for doing it. This routine in MDAnalysis implements it. We can't copy the code directly since it's GPL, but it's fine to use it as a reference for how the algorithm works.
Describe alternatives you've considered
RDKit has a routine called
determineBondOrders()
that does something similar. However, it starts by throwing out all the existing bonds and determining new ones based on coordinates. That isn't a reliable thing to do.The text was updated successfully, but these errors were encountered: