forked from timvdm/OpenSMILES
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdiscussion_summary.txt
495 lines (312 loc) · 17.3 KB
/
discussion_summary.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
SUMMARY OF MAILING LIST KEY POINTS
TODO:
Tim bullet 4
--------------------------------------------------------------------------------
Tim Vandermeersch:
Is "[0S]" the same as "[S]"?
CC=(CC)CC
Examples in aromaticity section.
--------------------------------------------------------------------------------
Tim Vandermeersch:
Two comments about the specification:
* section 2.2 Grammar: definition chiral for trigonal bipyrimidal has
too many class (currently TB1, ..., TB30, should be TB1, ..., TB20).
* Grammar: '|' missing between SPACE and TAB
* section 3.9.6: Second paragraph: "The shorthand notations '@' and
'@@' correspond to clockwise and anti-clockwise tetrahedral chirality,
and are the same a '@TH1' and '@TH2', respectively. Likewise, in an
allenal configuration, the shorthand notations '@' and '@@' correspond
to '@AL1' and '@AL2', respectively." The words "clockwise" and
"anti-clockwise" should be exchanged for clarity I think.
* OpenSMILES Section 3.9.2 (Tetrahedral Centers):
"A chiral center can be written starting from any of its neighbor
atoms, and the choice of whether to list the remaining neighbor in
clockwise or anticlockwise order is also arbitrary."
I think the first part should be changed. A chiral atom may also be
specified as the first atom. This is even required when writing a
canonical smiles where the canonicalization algorithm places a chiral
atom to be the one with the lowest rank.
The SMILES [C@H]1=[C@@H][C@@H]=[C@@H][C@@H]=[C@@H][C@@H]=[C@@H]1 in
section 6.3 also allows this. Most toolkits support this. See also:
http://baoilleach.blogspot.be/2009/03/clockwisdom-of-smiles-part-ii.html
Also, I think the phrasing of "The the remaining three neighbor atoms
are written by listing them in anticlockwise order" could be clarified
with:
"... by listing the bonds connecting them to the chiral center in
anticlockwise order of in the SMILES string ..."
It's actually the bond's order in the SMILES string that should be
considered. These are different than the atom order when one or more
ring bonds originate from the chiral atom. (see link to Noel's blog)
Noel O'Boyle:
I noted some of these in the past too, but didn't have the chance to
change them.
--------------------------------------------------------------------------------
Andrew Dalke:
* The OpenSMILES spec does not place bound on the acceptable range of charges.
My view is that we say OpenSMILES parsers must support charges from -15 to +15
(that's 5 bits). Support for a wider range is implementation specific.
This would go into the "Programming Practices" section.
There was general agreement that this is acceptable. See the thread
"spec clarification on charge limits" for a full discussion, including
my observation that existing PubChem records have charges from -4 to +9.
* support the empty string as valid SMILES for an empty molecule
* support for [b] (omission in the spec)
* only support for @TB1 ... @TB20 (.. @TB30 is an error in the spec)
* support for '0' as the first digit of two digit chiral indicators,
that is, I support both @OH01 and @OH1.
* The OpenSMILES list of element symbols ends with:
Hs'|'Mt'|'Ds'|'Rg'
There have been three IUPAC elements since then: copernicium (Cn),
flerovium (Fl) and livermorium (Lv).
These likely should be added to OpenSMILES. However, this is a nearly
theoretical point as only a few score atoms of each have been detected
and it won't be in any cheminformatics databases.
There is a slight problem of adding them. Currently [Cn] means to
match an atom which is an aliphatic carbon and is an aromatic nitrogen.
Which of course means it matches nothing. This is a round-about way
to write "don't match".
With the above change, Cn would match copernicium atoms, so it's not
completely backwards compatible.
We could also say that elements after Rg must be written in '#' form,
as '[#112]' for Cn.
Or we could do nothing. Which annoys the OCD part of me. ;) I would
prefer adding Cn, Dl, and Lv as possible symbols.
--------------------------------------------------------------------------------
OLDER ARCHIVE
Updated through 10/05/2007 10:40 PM.
--------------------------------------------------------------------------------
Greg Landrum: I would like to see something ... that flags the format type
... [and] version number in the format code. ...I'm not sure what form this
format/version coding would take, but we could borrow from InChI and do
something like: "SMILESv1=c1ccccc1".
Others suggested that this was a good idea but was outside the SMILES spec
itself.
--------------------------------------------------------------------------------
Polymers and Crystals: The proposed extension to allow polymers and
crystals using (for example) C&1&1&1 for graphite was vetoed by one
participant (Andrew?), who points out that it makes things like
substructure searching, aromaticity detection, canonicalization, and many
other things, very difficult.
--------------------------------------------------------------------------------
Andrew:
Get rid of [Mg++] notation, require [Mg+2] form.
ORGANIC_SYMBOL should include 's' and 'p'
Is "[0S]" the same as "[S]"?
uses "atomic mass" instead of "atomic weight."
What is the minimum hcount that a library must support? 9 hydrogens?
What's the minimum valence that a library must support? 15?
Agrees that '$' for quadruple bond should be allowed.
Disagrees that '.' is a valid bond.
See Andrew: 09/30/07 11:10 for better wording on hydrogen and charge.
Move "Isotopes' section in with atoms instead of separate.
--------------------------------------------------------------------------------
Change:
> If no isotope is specified, an atom is assumed to have either the
> naturally occuring isotopes for that element, or the isotope of the
> element is unknown, or the isotope is unspecified. SMILES has no
> way to distinguish these three cases.
Rewrite to say it's the naturally-occuring isotopic mixture.
--------------------------------------------------------------------------------
Missing ring-closure bonds - Allow all three forms
C=1CCCCC1 recommended form
C1CCCCC=1
C=1CCCCC=1
are all valid. Conflicts such as C-1CCCCC=1 are illegal
--------------------------------------------------------------------------------
Missing items in spec (Craig -- not posted to list)
* Atom can't be bonded to itself
* Only one bond can join two atoms,
Thus the following are illegal:
C12CCCCC12
C1C1
C11
--------------------------------------------------------------------------------
Last branch can be in parentheses, e.g.
C(C)(C)(C)
should be legal.
--------------------------------------------------------------------------------
Make any whitespace (space, tab, CR, LF) the offical end of the SMILES string
by rule. Simplifies parsing.
--------------------------------------------------------------------------------
Document structure
Separate the introductory material:
* Motivation
* Background: valence model
* History
Move SMILES file discussion etc. to separate doc, or to "best practices"
--------------------------------------------------------------------------------
Section 2.3, "see below" regarding aromaticity needs to be a hyperlink
--------------------------------------------------------------------------------
Minor point: say "is not present" - missing implies that it
should be there, while the preferred form is without the digit
if there is only one hydrogen.
--------------------------------------------------------------------------------
Suggested H-count description:
- Hydrogens are specified as "H" or "Hn" where "n" is a digit
from 0 to 9. Examples are "H", "H3" and "H1" meaning
that there is 1, 3 and 1 attached hydrogen, respectively.
Suggested charge description:
- Positive charge is specified by "+", "++", or "+n" where
n is a digit from 0 to 9. Examples are "+", "++", "+2", "+4"
and "+0" meaning charges of +1, +2, +2, +4 and 0, respectively.
A charge of +0 means that the atom is uncharged.
- Negative charge is specified by "-", "--", or "-n" where ...
n is a digit from 0 to 9. Examples are "-", "--", "-2", "-4"
and "-0" meaning charges of -1, -2, -2, -4 and 0, respectively.
A charge of -0 means that the atom is uncharged.
--------------------------------------------------------------------------------
Suggested text for "organic subset":
An element in the "organic subset" of B, C, N, O, P, S, F, Cl, Br, I,
and * (the "wildcard" atom) can be written using only its atomic
symbol (that is, without the square brackets, H-count, etc.). When
written in this form the atom:
- has no formal charge
- has undefined atomic mass
- does not specify any chirality
The hydrogen count is implied from the possible valences for the element
(given XXX) and the sum of the orders of the bond types. If the sum is
equal to a possible valence or greater than the highest possible valence
then the hydrogen count is 0. Otherwise the implied hydrogen count is
the difference between the sum and the next highest possible valence.
--------------------------------------------------------------------------------
> An atom with three or more bonds is called a branched atom
Is that the usual terminology? I'm not used to it. I also
see that the term "branched" does not occur elsewhere in the doc.
--------------------------------------------------------------------------------
Controversial subject: Is C.1CCCCC.1 legal? I say yes, everybody else says no.
--------------------------------------------------------------------------------
Need clarification of term "ring bond": it's usually, but not always, a
ring bond.
--------------------------------------------------------------------------------
Andrew argues for allowing different models of chemistry. Craig strongly
disagrees with this idea, because it means a given SMILES can have
different interpretations.
--------------------------------------------------------------------------------
Normalized SMILES: Several people argued for reusing ring digits to avoid
the %nn form as much as possible.
--------------------------------------------------------------------------------
Several people suggested adopting OpenEye's atom-class feature,
e.g. [CH2:1]
--------------------------------------------------------------------------------
For completeness we should do all forms of chirality.
Craig's comment: Contrary to several people's assertion that the other
forms of chirality are rare, they're not. All of the forms of chiral
centers are fairly common in large databases. The only reason they're not
seen much is because most file formats and toolkits can't handle them.
--------------------------------------------------------------------------------
Several people suggested adding R groups: "to support existing Cactus and
JChem/Marvin usage, OEChem allows "[R2]" to be used as short-hand for [*:2]
--------------------------------------------------------------------------------
Andrew suggested a "Dumb SMILES" normalization, where single (non aromatic)
bonds between aromatic atoms would always be written as '-'. In this form, a very
simple parser that doesn't have the Hueckel's rule code to detect
aromaticity can still figure out every bond's type.
Craig's comment: This would be a nice feature for all "normalized" SMILES.
Single bonds between aromatic atoms are just a small percentage of all
bonds, so it wouldn't add much lenght to most SMILES.
--------------------------------------------------------------------------------
Missing a rule that says your normalized SMILES must use the minimum number
of ring closures. Otherwise, C1.C12.C23.C34.C4 is considered "normalized"
--------------------------------------------------------------------------------
For [atomic weights], the Blue Obelisk Data Repository should be used [1]
--------------------------------------------------------------------------------
Question: To what extent should we reject completely absurd structures
such as C(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)(C)C? Daylight generates
a warning when non-standard valences are encountered.
--------------------------------------------------------------------------------
AROMATICITY
Are boron and silicon aromatic?
What about selenium and arsenic?
Rich A's assertion: Lowercase (aromatic) is nothing more than: "if you see
a lower case atom symbol, subtract one from the implicit hydrogen count you
would otherwise assign."
Craig's question: Once you do this, can you deduce aromaticity
using Hueckel's rule, for atoms AND bonds? See the example above:
c12coccc1ccc2
Rich later says:
These two simple definitions encompass all uses of the lower case SMILES
atom notation I'm aware of. And they bypass the sticky problem of trying to
work out what aromaticity really means. (We don't really want to go there,
do we?)
... these would all be legal under this simplified definition:
Ccc - propylene
c1ccc1 - cyclobutadiene
c1ccC#Cc1 - benzyne
co - formaldehyde
cccc - butadiene
And Chris M. replies:
Of course, this reduced hydrogen count interpretation ties in nicely
with the use of "c" as a radical center.
Cc - ethyl radical
ccc - allyl radical
But maybe you want to represent furan as c1cocc1 where the o is not
H-deficient (Daylight Depict accepts this, as well as c1cOcc1). But then
you might write pyrrole as c1cncc1 (which it does not; it accepts
c1c[nH]cc1 and c1c[NH]cc1). I think all these forms should be
acceptable, since they are unambiguous
Rich replies:
Using the simplified rule for lower-case atom labels eliminates the need to
write the brackets around pyrrole/indole nitrogen (c1cncc1 is
acceptable). I like the idea of c1cocc1 and c1cOcc1 both being acceptable
for furan as well (the underlying assumption being that reduction in
implicit hydrogen count can never lead to a negative number).
Greg replies:
This new rule simplifies pyrrole, so you get to write c1cncc1 instead
of c1c[nH]cc1, but it is really confusing (to me at least) when you
compare pyrrole with pyridine, c1ccncc1. Sometimes an aromatic N with
two explicit neighbors has an implicit H, other times it doesn't.
Egon's suggestion:
Let's give that some more detail:
"a lowercase element symbol indicates a organic subset element
participating in a delocalized bonding system which is part of a ring,
such as aromatic and anti-aromatic rings, and labeled as 'aromatic'.".
"Bonds between 'aromatic' atoms are only labeled aromatic if they
participate in the delocalized bonding system which is part of a ring,
and the same bonding system as defined for the aromatic atoms between
the bond is defined."
"a ':' bond indicates that a bond participates in a delocalized bonding
system which is part of a ring, and is labeled as 'aromatic'"
More from Rich:
Hmm. Under the simplified reduced-implicit-hydrogen interpretation of lower
case atom symbols c1cncc1 is actually the pyrrole radical, if I'm not
mistaken. (Greg later agrees with this)
So under this system you'd have:
furan:
c1cocc1
c1cOcc1
pyrrole:
c1cNcc1
c1c[nH]cc1 (but not c1cncc1)
Greg provided some informal rules:
here are my "rules" for lower case atom symbols:
1) They're only allowed in rings.
2) They are only allowed for a subset of atoms (to be agreed upon later).
3) Aromaticity is *not* defined solely using "primitive rings" i.e. it
is possible to have an aromatic fused ring system where one or all of
the subrings is/are not aromatic. An example of this is: c12coccc1ccc2
in which neither of the subrings have 4N+2 electrons, but the fused
system does.
4) A bond of unspecified type between two aromatic atoms is either
aromatic or single. Examples of aromatic-aromatic single bonds are the
bond connecting the rings in biphenyl c1ccccc1c1ccccc1 and the fusing
bond in the previous example.
Craig: Has anyone come up with an example where a single bond or aromatic
bond *must* be written to correctly specify an aromatic system using
Daylight's rules? Or can single/aromatic bonds always be distinguished?
A good example to include: the molecule c12coccc1ccc2 (C12=COC=CC1=CC=C2 if
you'd prefer it kekulized) has a fused 6-ring and 5-ring; neither ring is
itself aromatic, but the fused system is (10 pi electrons). The ring
fusion bond is a ring bond, it's between two aromatic systems, but it's not
an aromatic bond (again, it's not part of the conjugation path).
--------------------------------------------------------------------------------
Andrew provided a new grammar. See 10/02/2007 2:47
--------------------------------------------------------------------------------
Andrew rewrote the whole document. ;-) See 10/03/2007 7:36
Andrew's version is more formal.
--------------------------------------------------------------------------------
In section 3.2 of the draft spec, it would be useful to state explictly
what happens when one or more of the bonds to the tetrahedral chiral
center is a ring closure. It's needed. I have just corrected a bug in
the OpenBabel parser where this had not been properly handled.
I think this means, "what does the ring digit represent?"
--------------------------------------------------------------------------------