-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use public datasets to test canonicalization for invariance #51
Comments
The ChEMBL db requires some preprocessing, which can be done using OpenBabel:
|
Agreed 😄 Alternatively to |
Started to build a query for PubChem today and interestingly, it contains explicit hydrogens, at least when you access it directly, not via a molfile: PubChem CID: 39967625 |
Key parts of code for from tucan.element_properties import ELEMENT_PROPS
from tucan.canonicalization import canonicalize_molecule
from tucan.serialization import serialize_molecule
from tucan.visualization import draw_molecules
import pubchempy as pcp
import networkx as nx
import matplotlib.pyplot as plt
def pubchem_to_tucan(cid):
c = pcp.Compound.from_cid(cid)
atoms = c.to_dict(properties=['atoms'])
bonds = c.to_dict(properties=['bonds'])
m = nx.Graph()
for atom in atoms["atoms"]:
error = False
keys = atom.keys()
for k in keys:
if(k == "aid"):
**atom1 = atom[k]**
elif(k == "number"):
atomic_number = atom[k]
elif(k == "element"):
element_symbol = atom[k]
elif(k == "x"):
xcoord = atom[k]
elif(k == "y"):
ycoord = atom[k]
#else:
# error = True
if(error == False):
zcoord = 0
an = ELEMENT_PROPS[element_symbol]["atomic_number"]
**m.add_node(int(atom1))
attrs = {atom1: {"node_label": atom1, "atomic_number": atomic_number, "partition": 0, "element_symbol": element_symbol, "element_color": (208,208,224),
"x_coord": float(xcoord), "y_coord": float(ycoord), "z_coord": float(zcoord)
}
}
nx.set_node_attributes(m, attrs)**
else:
print("Error - incorrect or no atom definition")
for bond in bonds["bonds"]:
error = False
keys = bond.keys()
for k in keys:
if(k == "aid1"):
**atom1 = bond[k]**
elif(k == "aid2"):
**atom2 = bond[k]**
elif(k == "order"):
bond_order = bond[k]
#else:
# error = True
if(error == False):
**m.add_edge(int(atom1), int(atom2))**
else:
print("Error - incorrect or no bond definition")
m_canonicalized = canonicalize_molecule(m)
tucan_string = serialize_molecule(m_canonicalized)
print(f"PubChem CID: {cid}")
print(c.molecular_formula)
print(c.molecular_weight)
print(c.iupac_name)
print(isomeric_smiles["isomeric_smiles"])
print(canonical_smiles["canonical_smiles"])
print(inchi_string["inchi"])
print(tucan_string)
print(len(tucan_string)/len(inchi_string["inchi"]))
#draw_molecules([m, m_canonicalized], ["before canonicalization", "after canonicalization"], highlight="atomic_number", title="PubChem CID "+str(cid))
#draw_molecules([m, m_canonicalized], ["before canonicalization", "after canonicalization"], highlight="partition", title="PubChem CID "+str(cid))
#plt.show()
return m |
Here is the link to pull request which adds |
Here is the link to pull request which adds |
Still want to add two more: graph_from_csd() |
Recap of what I did for testing against ChEMBL:
The chunk SDfiles are not deleted, so they will be reused in the next Snakemake run. Concerning PubChem: Question (1): Can we "misuse" PubChemPy to read the entries in such a XML and then proceed building a nx.Graph as demonstrated in A dump of PubChem's compounds is also available as SDfiles with embedded V2000 Molfiles, each with a maximum of 500k records. I'd prefer to split this into smaller pieces. |
You're right, both functions will not be very efficient to query and evaluate the whole Still, for me, it was the easiest way to access both databases and in particular I like the way In order to process the XML data, you don't even need PubChemPy - generally could just pull out the required tags in the If you want to "brute-force" crunch the whole PubChem entries it would possibly be the easiest to just go through all the XMLs, just need a couple of hundred GB to download and unpack them all ;-) Would like to suggest to have both functions, one for online query of PubChem by CID and another one to sequentially work the downloaded ones. |
Prepare test pipelines with snakemake in order to deploy test jobs (SLURM) on HPC.
ChEMBL (latest version, 30)
https://www.ebi.ac.uk/chembl/
Cambridge Structural Database (CSD)
https://www.ccdc.cam.ac.uk/solutions/csd-core/components/csd/
The text was updated successfully, but these errors were encountered: