Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use public datasets to test canonicalization for invariance #51

Closed
JanCBrammer opened this issue Mar 24, 2022 · 12 comments
Closed

Use public datasets to test canonicalization for invariance #51

JanCBrammer opened this issue Mar 24, 2022 · 12 comments
Labels

Comments

@JanCBrammer
Copy link
Collaborator

Prepare test pipelines with snakemake in order to deploy test jobs (SLURM) on HPC.

ChEMBL (latest version, 30)

https://www.ebi.ac.uk/chembl/

Cambridge Structural Database (CSD)

https://www.ccdc.cam.ac.uk/solutions/csd-core/components/csd/

@flange-ipb
Copy link
Collaborator

The ChEMBL db requires some preprocessing, which can be done using OpenBabel:

  • V2000 Molfiles are embedded into a single sdf file, so that needs to be converted into V3000 Molfiles.
  • Most hydrogens are implicit and need to be added explicitly.

obabel -h -isdf chembl_30.sdf -omol -Ochembl_30.mol -x3 produces a single file where the Molfiles are separated by a $$$$ line (similar to the sdf format). The -m option produces a single Molfile for each molecule, but we probably don't want to cope with 2136187 files. ;)

@JanCBrammer
Copy link
Collaborator Author

JanCBrammer commented Jul 15, 2022

but we probably don't want to cope with 2136187 files. ;)

Agreed 😄

Alternatively to chembl_30.sdf I tried chembl_30_chemreps.txt during my initial research on how to access the compounds. The latter contains InChI and canonical SMILES for each of the compounds, i.e., it would allow for skipping the molfile and generating a graph directly from one of the descriptors (e.g., from SMILES using RDKit).

@schatzsc
Copy link
Collaborator

Started to build a query for PubChem today and interestingly, it contains explicit hydrogens, at least when you access it directly, not via a molfile:

pubchem

PubChem CID: 39967625
C10H9F5N2O
268.18
N-methyl-2-(2,3,4,5,6-pentafluoro-N-methylanilino)acetamide
CNC(=O)CN(C)C1=C(C(=C(C(=C1F)F)F)F)F
CNC(=O)CN(C)C1=C(C(=C(C(=C1F)F)F)F)F
InChI=1S/C10H9F5N2O/c1-16-4(18)3-17(2)10-8(14)6(12)5(11)7(13)9(10)15/h3H2,1-2H3,(H,16,18)
C10H9F5N2O/(1-10)(2-10)(3-10)(4-11)(5-11)(6-11)(7-12)(8-12)(9-20)(10-20)(11-21)(12-14)(12-21)(13-15)(13-16)(13-21)(14-20)(14-22)(15-17)(15-23)(16-18)(16-24)(17-19)(17-25)(18-19)(18-26)(19-27)

@schatzsc
Copy link
Collaborator

The query is done via PubChemPy which is also on GitHub

@schatzsc
Copy link
Collaborator

schatzsc commented Oct 15, 2022

Key parts of code for pubchem_reader.py here:

from tucan.element_properties import ELEMENT_PROPS
from tucan.canonicalization import canonicalize_molecule
from tucan.serialization import serialize_molecule
from tucan.visualization import draw_molecules

import pubchempy as pcp
import networkx as nx
import matplotlib.pyplot as plt

def pubchem_to_tucan(cid):
    c = pcp.Compound.from_cid(cid)

    atoms = c.to_dict(properties=['atoms'])
    bonds = c.to_dict(properties=['bonds'])

    m = nx.Graph()

    for atom in atoms["atoms"]:
        error = False
        keys = atom.keys()
        for k in keys:
            if(k == "aid"):
                **atom1 = atom[k]**
            elif(k == "number"):
                atomic_number = atom[k]
            elif(k == "element"):
                element_symbol = atom[k]
            elif(k == "x"):
                xcoord = atom[k]
            elif(k == "y"):
                ycoord = atom[k]
            #else:
            #    error = True
        if(error == False):
            zcoord = 0
            an = ELEMENT_PROPS[element_symbol]["atomic_number"]
            **m.add_node(int(atom1))
            attrs = {atom1: {"node_label": atom1, "atomic_number": atomic_number, "partition": 0, "element_symbol": element_symbol, "element_color": (208,208,224),
                             "x_coord": float(xcoord), "y_coord": float(ycoord), "z_coord": float(zcoord)
                            }
                    }
            nx.set_node_attributes(m, attrs)**
        else:
            print("Error - incorrect or no atom definition")

    for bond in bonds["bonds"]:
        error = False
        keys = bond.keys()
        for k in keys:
            if(k == "aid1"):
                **atom1 = bond[k]**
            elif(k == "aid2"):
                **atom2 = bond[k]**
            elif(k == "order"):
                bond_order = bond[k]
            #else:
            #    error = True
        if(error == False):
            **m.add_edge(int(atom1), int(atom2))**
        else:
            print("Error - incorrect or no bond definition")

    m_canonicalized = canonicalize_molecule(m)
    tucan_string = serialize_molecule(m_canonicalized)

    print(f"PubChem CID: {cid}")
    print(c.molecular_formula)
    print(c.molecular_weight)
    print(c.iupac_name)
    print(isomeric_smiles["isomeric_smiles"])
    print(canonical_smiles["canonical_smiles"])
    print(inchi_string["inchi"])
    print(tucan_string)

    print(len(tucan_string)/len(inchi_string["inchi"]))

    #draw_molecules([m, m_canonicalized], ["before canonicalization", "after canonicalization"], highlight="atomic_number", title="PubChem CID "+str(cid))
    #draw_molecules([m, m_canonicalized], ["before canonicalization", "after canonicalization"], highlight="partition", title="PubChem CID "+str(cid))
    #plt.show()

    return m

pubchem_reader.zip

@schatzsc
Copy link
Collaborator

Here is the link to pull request which adds graph_from_pubchem() routine:

LINK

@schatzsc
Copy link
Collaborator

Here is the link to pull request which adds graph_from_chembl() routine:

LINK

@schatzsc
Copy link
Collaborator

Still want to add two more:

graph_from_csd()
graph_from_tautomerDB()

@flange-ipb
Copy link
Collaborator

Recap of what I did for testing against ChEMBL:

  • general strategy:
    • split the database into chunks of SDfiles, each with 1000 compounds
    • process each chunk with one CPU process
  • steps:
    • download the ChEMBL database as sqlite db dump (single file); the table compound_structures has 2136187 rows with V2000 Molfiles
    • chunk SDfile: split table content by selection of a certain range of rowids (most importantly: This IO operation is more or less constant in time!); convert to V3000 Molfile, add implicit hydrogens and concat to SDF via RDKit
    • for each chunk: unpack SDfile via RDKit and proceed with TUCAN's tests

The chunk SDfiles are not deleted, so they will be reused in the next Snakemake run.

Concerning PubChem:
PubChemPy can only use PubChem's web-API and we certainly don't want to query for 112 million compounds this way. I guess the call pcp.Compound.from_cid(cid) just fetches and processes some XML. Interestingly, the dump of those XMLs can be downloaded here.

Question (1): Can we "misuse" PubChemPy to read the entries in such a XML and then proceed building a nx.Graph as demonstrated in graph_from_pubchem? That's more or less all about changing PubChemPy's data source.

A dump of PubChem's compounds is also available as SDfiles with embedded V2000 Molfiles, each with a maximum of 500k records. I'd prefer to split this into smaller pieces.

@schatzsc
Copy link
Collaborator

You're right, both functions will not be very efficient to query and evaluate the whole ChEMBL and PubChem databases. The first one seems to be still do-able on a separate computer with some time but network overhead will likely dominate the time.

Still, for me, it was the easiest way to access both databases and in particular I like the way PubChem does this (although the metal-containing structures in there are still utter nonsense).

In order to process the XML data, you don't even need PubChemPy - generally could just pull out the required tags in the <PC-Atoms_aid>, <PC-Atoms_element>, and <PC-Bonds-aid1> as well as <PC-Bonds-aid2> blocks:

xml

If you want to "brute-force" crunch the whole PubChem entries it would possibly be the easiest to just go through all the XMLs, just need a couple of hundred GB to download and unpack them all ;-)

Would like to suggest to have both functions, one for online query of PubChem by CID and another one to sequentially work the downloaded ones.

@schatzsc
Copy link
Collaborator

Interesting side-note: the XML files also contain the hydrogen atoms, so at least internally PubChem is explicit hydrogens

xml2

@schatzsc
Copy link
Collaborator

I suggest to close this general topic now and continue discussion separately for ChEMBL, PubChem and CSD in the respective threads:

ChEMBL
PubChem
CSD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants