Use public datasets to test canonicalization for invariance #51

JanCBrammer · 2022-03-24T13:18:12Z

Prepare test pipelines with snakemake in order to deploy test jobs (SLURM) on HPC.

ChEMBL (latest version, 30)

https://www.ebi.ac.uk/chembl/

Cambridge Structural Database (CSD)

https://www.ccdc.cam.ac.uk/solutions/csd-core/components/csd/

flange-ipb · 2022-07-14T08:14:51Z

The ChEMBL db requires some preprocessing, which can be done using OpenBabel:

V2000 Molfiles are embedded into a single sdf file, so that needs to be converted into V3000 Molfiles.
Most hydrogens are implicit and need to be added explicitly.

obabel -h -isdf chembl_30.sdf -omol -Ochembl_30.mol -x3 produces a single file where the Molfiles are separated by a $$$$ line (similar to the sdf format). The -m option produces a single Molfile for each molecule, but we probably don't want to cope with 2136187 files. ;)

JanCBrammer · 2022-07-15T07:28:57Z

but we probably don't want to cope with 2136187 files. ;)

Agreed 😄

Alternatively to chembl_30.sdf I tried chembl_30_chemreps.txt during my initial research on how to access the compounds. The latter contains InChI and canonical SMILES for each of the compounds, i.e., it would allow for skipping the molfile and generating a graph directly from one of the descriptors (e.g., from SMILES using RDKit).

schatzsc · 2022-10-15T03:36:35Z

Started to build a query for PubChem today and interestingly, it contains explicit hydrogens, at least when you access it directly, not via a molfile:

PubChem CID: 39967625
C10H9F5N2O
268.18
N-methyl-2-(2,3,4,5,6-pentafluoro-N-methylanilino)acetamide
CNC(=O)CN(C)C1=C(C(=C(C(=C1F)F)F)F)F
CNC(=O)CN(C)C1=C(C(=C(C(=C1F)F)F)F)F
InChI=1S/C10H9F5N2O/c1-16-4(18)3-17(2)10-8(14)6(12)5(11)7(13)9(10)15/h3H2,1-2H3,(H,16,18)
C10H9F5N2O/(1-10)(2-10)(3-10)(4-11)(5-11)(6-11)(7-12)(8-12)(9-20)(10-20)(11-21)(12-14)(12-21)(13-15)(13-16)(13-21)(14-20)(14-22)(15-17)(15-23)(16-18)(16-24)(17-19)(17-25)(18-19)(18-26)(19-27)

schatzsc · 2022-10-15T03:39:19Z

The query is done via PubChemPy which is also on GitHub

schatzsc · 2022-10-15T03:43:59Z

Key parts of code for pubchem_reader.py here:

from tucan.element_properties import ELEMENT_PROPS
from tucan.canonicalization import canonicalize_molecule
from tucan.serialization import serialize_molecule
from tucan.visualization import draw_molecules

import pubchempy as pcp
import networkx as nx
import matplotlib.pyplot as plt

def pubchem_to_tucan(cid):
    c = pcp.Compound.from_cid(cid)

    atoms = c.to_dict(properties=['atoms'])
    bonds = c.to_dict(properties=['bonds'])

    m = nx.Graph()

    for atom in atoms["atoms"]:
        error = False
        keys = atom.keys()
        for k in keys:
            if(k == "aid"):
                **atom1 = atom[k]**
            elif(k == "number"):
                atomic_number = atom[k]
            elif(k == "element"):
                element_symbol = atom[k]
            elif(k == "x"):
                xcoord = atom[k]
            elif(k == "y"):
                ycoord = atom[k]
            #else:
            #    error = True
        if(error == False):
            zcoord = 0
            an = ELEMENT_PROPS[element_symbol]["atomic_number"]
            **m.add_node(int(atom1))
            attrs = {atom1: {"node_label": atom1, "atomic_number": atomic_number, "partition": 0, "element_symbol": element_symbol, "element_color": (208,208,224),
                             "x_coord": float(xcoord), "y_coord": float(ycoord), "z_coord": float(zcoord)
                            }
                    }
            nx.set_node_attributes(m, attrs)**
        else:
            print("Error - incorrect or no atom definition")

    for bond in bonds["bonds"]:
        error = False
        keys = bond.keys()
        for k in keys:
            if(k == "aid1"):
                **atom1 = bond[k]**
            elif(k == "aid2"):
                **atom2 = bond[k]**
            elif(k == "order"):
                bond_order = bond[k]
            #else:
            #    error = True
        if(error == False):
            **m.add_edge(int(atom1), int(atom2))**
        else:
            print("Error - incorrect or no bond definition")

    m_canonicalized = canonicalize_molecule(m)
    tucan_string = serialize_molecule(m_canonicalized)

    print(f"PubChem CID: {cid}")
    print(c.molecular_formula)
    print(c.molecular_weight)
    print(c.iupac_name)
    print(isomeric_smiles["isomeric_smiles"])
    print(canonical_smiles["canonical_smiles"])
    print(inchi_string["inchi"])
    print(tucan_string)

    print(len(tucan_string)/len(inchi_string["inchi"]))

    #draw_molecules([m, m_canonicalized], ["before canonicalization", "after canonicalization"], highlight="atomic_number", title="PubChem CID "+str(cid))
    #draw_molecules([m, m_canonicalized], ["before canonicalization", "after canonicalization"], highlight="partition", title="PubChem CID "+str(cid))
    #plt.show()

    return m

pubchem_reader.zip

schatzsc · 2022-10-15T21:22:43Z

Here is the link to pull request which adds graph_from_pubchem() routine:

LINK

schatzsc · 2022-10-16T00:37:18Z

Here is the link to pull request which adds graph_from_chembl() routine:

LINK

schatzsc · 2022-10-16T00:40:33Z

Still want to add two more:

graph_from_csd()
graph_from_tautomerDB()

flange-ipb · 2022-10-17T12:19:12Z

Recap of what I did for testing against ChEMBL:

general strategy:
- split the database into chunks of SDfiles, each with 1000 compounds
- process each chunk with one CPU process
steps:
- download the ChEMBL database as sqlite db dump (single file); the table compound_structures has 2136187 rows with V2000 Molfiles
- chunk SDfile: split table content by selection of a certain range of rowids (most importantly: This IO operation is more or less constant in time!); convert to V3000 Molfile, add implicit hydrogens and concat to SDF via RDKit
- for each chunk: unpack SDfile via RDKit and proceed with TUCAN's tests

The chunk SDfiles are not deleted, so they will be reused in the next Snakemake run.

Concerning PubChem:
PubChemPy can only use PubChem's web-API and we certainly don't want to query for 112 million compounds this way. I guess the call pcp.Compound.from_cid(cid) just fetches and processes some XML. Interestingly, the dump of those XMLs can be downloaded here.

Question (1): Can we "misuse" PubChemPy to read the entries in such a XML and then proceed building a nx.Graph as demonstrated in graph_from_pubchem? That's more or less all about changing PubChemPy's data source.

A dump of PubChem's compounds is also available as SDfiles with embedded V2000 Molfiles, each with a maximum of 500k records. I'd prefer to split this into smaller pieces.

schatzsc · 2022-10-17T13:08:39Z

You're right, both functions will not be very efficient to query and evaluate the whole ChEMBL and PubChem databases. The first one seems to be still do-able on a separate computer with some time but network overhead will likely dominate the time.

Still, for me, it was the easiest way to access both databases and in particular I like the way PubChem does this (although the metal-containing structures in there are still utter nonsense).

In order to process the XML data, you don't even need PubChemPy - generally could just pull out the required tags in the <PC-Atoms_aid>, <PC-Atoms_element>, and <PC-Bonds-aid1> as well as <PC-Bonds-aid2> blocks:

If you want to "brute-force" crunch the whole PubChem entries it would possibly be the easiest to just go through all the XMLs, just need a couple of hundred GB to download and unpack them all ;-)

Would like to suggest to have both functions, one for online query of PubChem by CID and another one to sequentially work the downloaded ones.

schatzsc · 2022-10-17T13:23:01Z

Interesting side-note: the XML files also contain the hydrogen atoms, so at least internally PubChem is explicit hydrogens

schatzsc · 2023-01-12T12:26:36Z

I suggest to close this general topic now and continue discussion separately for ChEMBL, PubChem and CSD in the respective threads:

ChEMBL
PubChem
CSD

flange-ipb mentioned this issue Aug 4, 2022

Invariance tests with ChEMBL #70

Merged

schatzsc added the dataset label Oct 17, 2022

flange-ipb mentioned this issue Nov 18, 2022

Improve workflow for ChEMBL database test #98

Open

11 tasks

schatzsc closed this as completed Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use public datasets to test canonicalization for invariance #51

Use public datasets to test canonicalization for invariance #51

JanCBrammer commented Mar 24, 2022

flange-ipb commented Jul 14, 2022

JanCBrammer commented Jul 15, 2022 •

edited

Loading

schatzsc commented Oct 15, 2022

schatzsc commented Oct 15, 2022

schatzsc commented Oct 15, 2022 •

edited by JanCBrammer

Loading

schatzsc commented Oct 15, 2022

schatzsc commented Oct 16, 2022

schatzsc commented Oct 16, 2022

flange-ipb commented Oct 17, 2022

schatzsc commented Oct 17, 2022

schatzsc commented Oct 17, 2022

schatzsc commented Jan 12, 2023

Use public datasets to test canonicalization for invariance #51

Use public datasets to test canonicalization for invariance #51

Comments

JanCBrammer commented Mar 24, 2022

ChEMBL (latest version, 30)

Cambridge Structural Database (CSD)

flange-ipb commented Jul 14, 2022

JanCBrammer commented Jul 15, 2022 • edited Loading

schatzsc commented Oct 15, 2022

schatzsc commented Oct 15, 2022

schatzsc commented Oct 15, 2022 • edited by JanCBrammer Loading

schatzsc commented Oct 15, 2022

schatzsc commented Oct 16, 2022

schatzsc commented Oct 16, 2022

flange-ipb commented Oct 17, 2022

schatzsc commented Oct 17, 2022

schatzsc commented Oct 17, 2022

schatzsc commented Jan 12, 2023

JanCBrammer commented Jul 15, 2022 •

edited

Loading

schatzsc commented Oct 15, 2022 •

edited by JanCBrammer

Loading