Daily Python: Andrew Dalke: Molfile "S SKP"

In the last couple of essays I described some of the parts of a SDF record then pointed out some of the ways to break simple SDF record tokenizers. In this essay I'll point out an documentation curiosity which makes it even harder to parse a molfile with simple tools, though until I wrote this essay I had never seen it in actual use.

The S SKP property block line

Years ago I decided to make a low-level grammar for everything mentioned in the CTFile Formats specification. I got to the following in the section titled The Properties Block:

Most lines in the properties block are identified by a prefix of the form M  XXX where two spaces separate the M and XXX. Exceptions are:

A aaa, V aaa vvvvvv, and G aaappp, which indicate the following ISIS/Desktop properties: atom alias, atom value, and group abbreviation (called residue in ISIS), respectively.

S  SKPnnn which causes the next nnn lines to be ignored.

The prefix: M  END terminates the properties block.

Example SKPs

That means the following (modified from ChEMBL43280) should be a valid SDF record:

methylamine
     RDKit          2D

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.4125   -0.0179    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4125   -0.0179    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
S  SKP  1
Hello!
M  END
$$$$

Toolkit support, part 1

Do different chemistry toolkits support it? I'll try with RDKit, OEChem, and Open Babel:

>>> print(open("methylamine.sdf").read())
methylamine
     RDKit          2D

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.4125   -0.0179    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4125   -0.0179    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
S  SKP  1
Hello!
M  END
$$$$

>>> from rdkit import Chem
>>> for mol in Chem.ForwardSDMolSupplier("methylamine.sdf"):
...   print(mol.GetProp("_Name"))
...
methylamine
>>> from openeye.oechem import *
>>> for mol in oemolistream("methylamine.sdf").GetOEGraphMols():
...   print(mol.GetTitle())
...
methylamine
>>> from openbabel import pybel
>>> for mol in pybel.readfile("sdf", "methylamine.sdf"):
...   print(mol.title)
...
methylamine
>>>

Looks can be deceiving

That suggests that all three toolkits handle that line. However, the following text occurs slightly further in the Properties Block section of the CTFile documentation:

All lines that are not understood by the program are ignored.

Hmm. So it could be that the three toolkits ignore the SKP line and the Hello! line.

What about something more perverse - instead of the Hello! line in the skip section, use a line formatted like a CHG property, and see if the toolkit parses interprets it as a charge. To help confirm that the CHG line is formatted correctly I'll have two records in the SD file where the only difference is the SKP line:

methylamine1
     RDKit          2D

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.4125   -0.0179    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4125   -0.0179    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
M  CHG  1   1   1
M  END
$$$$
methylamine2
     RDKit          2D

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.4125   -0.0179    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.4125   -0.0179    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  0
S  SKP  1
M  CHG  1   1   1
M  END
$$$$

I'll use the following short program to have the three toolkits parse the file:

import sys
if len(sys.argv) != 2:
    raise SystemExit(f"{sys.argv[0]} filename.sdf -- show how toolkits parse the SD file")
filename = sys.argv[1]

print("# RDKit")
from rdkit import Chem
for mol in Chem.ForwardSDMolSupplier(filename):
  print(mol.GetProp("_Name"), Chem.MolToSmiles(mol))

print("\n# OEChem")
from openeye.oechem import *
for mol in oemolistream(filename).GetOEGraphMols():
  print(OEMolToSmiles(mol), mol.GetTitle())

print("\n# Open Babel")
from openbabel import pybel
for mol in pybel.readfile("sdf", filename):
  print(mol.write("smi"), end="")

The output is:

# RDKit
methylamine1 C[NH3+]
methylamine2 CN

# OEChem
C[NH3+] methylamine1
C[NH3+] methylamine2

# Open Babel
[NH3+]C methylamine1
[NH3+]C methylamine2

Of the three toolkits, it looks like only RDKit supports the SKP line. And the reason RDKit supports it is because I pointed it out to Greg Landrum, the main RDKit developer.

Other comments

As a historical note, the BUGS FIXED section of the Daylight v4.95 release notes from August 1, 2011 includes Handling of S SKP lines in molfiles (#1042).

I also know, from talking to Wolf-Dietrich Ihlenfeld that CACTVS supports SKP lines, and has a way to extract the contents of those blocks. (See below.)

By the way, if you want something even more perverse, try using M END followed by $$$$ in your skip section!

Is it important?

Almost certainly not.

In 2009 I pointed this oddity out to someone with experience and access to the Symyx tools. (Symyx acquired MDL in 2007. In 2010 the Symyx tools became part of Accelrys, which then became BIOVIA in 2014.) He tried it out and found that some Symyx tools - which should support the format the best - handled the SKP line, while some others did not. (In this case some means at least one for both cases.)

As general principle, if a widely-used file format has a feature which isn't used, then that feature becomes unusable. If the format isn't designed for graceful forwards- and backwards-compatibility then it cannot be reintroduced.

Why? Most projects accumulate code over time, with many different components. The SDF format (or at least a subset of it) is so widely supported that those components will likely have several if not many different SDF parsers - most of which don't handle SKP lines. And why don't they handle the SKP lines? Because code that that isn't written does not require development time, debugging time, testing time, nor documention time. Plus, most people write parsers based on what they need, not off-hand, one-line mentions in the format specification.

FASTA format

The FASTA format in bioinformatics is another example of this principle. Here's an example bacteriorhodopsin protein sequence, P02945:

>sp|P02945|BACR_HALSA Bacteriorhodopsin OS=Halobacterium salinarum (strain ATCC 700922 / JCM 11081 / NRC-1) OX=64091 GN=bop PE=1 SV=2
MLELLPTAVEGVSQAQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITT
LVPAIAFTMYLSMLLGYGLTMVPFGGEQNPIYWARYADWLFTTPLLLLDLALLVDADQGT
ILALVGADGIMIGTGLVGALTKVYSYRFVWWAISTAAMLYILYVLFFGFTSKAESMRPEV
ASTFKVLRNVTVVLWSAYPVVWLIGSEGAGIVPLNIETLLFMVLDVSAKVGFGLILLRSR
AIFGEAEAPEPSAGDGAAATSD

There first line of the record starts with a >, followed by a description. The remaining lines contain sequence data.

However, this is more correctly called the NCBI FASTA format because the original Pearson FASTA program also accepted lines starting with a ;. Quoting fasta.1:

The fasta programs use a standard text format sequence file. Lines beginning with ‘>’ or ‘;’ are considered comments and ignored; sequences can be upper or lower case, blanks, tabs and unrecognizable characters are ignored.

though internally getseq.c treates the first > as the sequence title, and not simply an ignorable comment. On the other hand, format.doc says the standard protein library format had a title line starting with >, and doesn't mention other forms of comments.

Rather ambiguous, yes?

When I worked in bioinformatics, back in the late 1990s and early 2000s, I never came across a FASTA file using a ; comment, nor one using more than a single < line. On the other hand, I reviewed over a dozen FASTA parsers and found that most would break or give wrong interpretations if those Pearson FASTA features were used.

The consequence is that it's effectively impossible to distribute a FASTA file now with semicolon comments in them, as those files won't be readable by the vast majority of tools which claim to read FASTA files.

Just like using SKP lines in an SDF record, even though allowed by the specification, will break most tools which claim to read SD files.

(BTW, I'm annoyed by the Wikipedia entry for the FASTA format, as you can infer from its Talk section.)

Who ordered that?

I. I. Rabi famously said of the muon - who ordered that? I felt the same thing when I came across the SKP line. Why is it there?

I still don't know why it's there, but I do know that it's used.

CACTVS and SKP

Some years back I talked with Wolf-Dietrich Ihlenfeld, the CACTVS developer. He said that one of his Japanese customers use it, and CACTVS supports being able to extract data from that field.

Checking now, Xemistry's documentation page Cactvs Toolkit Standard Property Definition Summary says that E_JSTDATA configures JST extended molfile data in skip section.

Elsewhere in that page it's clear that JST means the Japan Science and Technology Agency.

JICST extended MOLfile

Some judicious web searches later and I found Conversion of the data of JICST Chemical Dictionary Database into MOLfile, January 1997, Journal of Information Processing and Management 39(12):939-949 by Chikako MAEDA, Toru SOMA, and Takashi HORIE. In Japanese. Which I can't read.

However, the PDF is available, and it contains an abstract in English:

JICST Chemical Dictionary Database is a chemical substance registry file which JICST has originally constructed mainly for organic compounds. Registration of data started in 1986 and by the end of 1996, about 750 thousand compounds were registered in its own format. Nowadays, the progress in computer network and computing performance has changed the information environment of a researcher, and it is essential to convert the data of the Chemical Dictionary Database into “standard” format, which enables the data to be exchanged with many kinds of chemical software. Therefore, we converted the present format of the Chemical Dictionary Database into MOLfile developed by MDL Information Systems, Inc, and then loaded it on workstation to construct a client/server type retrieval system.

as well as an example, for trans-cinnamic acid!

example JST molfile

I've transcribed it by hand.

00002024I
JEctconv1EMMDDYYHHmm2D
JICST extension molfile 1.0
 13 13  0  0  0  0  0  0  0  0  6 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  4  0  0  0  0  0  0  r  0  0
    0.0000    0.0000    0.0000 C   0  0  0  1  0  4  0  0  0  0  0  0  c  5  0
    0.0000    0.0000    0.0000 C   0  0  0  1  0  4  0  0  0  0  0  0  r  0  0
    0.0000    0.0000    0.0000 C   0  0  0  1  0  4  0  0  0  0  0  0  r  0  0
    0.0000    0.0000    0.0000 C   0  0  0  1  0  4  0  0  0  0  0  0  c  5  0
    0.0000    0.0000    0.0000 C   0  0  0  1  0  4  0  0  0  0  0  0  r  0  0
    0.0000    0.0000    0.0000 C   0  0  0  1  0  4  0  0  0  0  0  0  r  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  4  0  0  0  0  0  0  c  0  0
    0.0000    0.0000    0.0000 C   0  0  0  1  0  4  0  0  0  0  0  0  r  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  2  0  0  0  0  0  0  c  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  2  0  0  0  0  0  0  c  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0  c  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0  0  0  0  0  0  0  0  c  0  0
  1  4  4  0  0  0  0
  1  3  4  0  0  0  0
  1  2  1  0  0  0  0
  2  5  2  0  0  0  0
  2 12  1  0  0  0  0
  3  6  4  0  0  0  0
  4  7  4  0  0  0  0
  5 13  1  0  0  0  0
  5  8  1  0  0  0  0
  6  9  4  0  0  0  0
  7  9  4  0  0  0  0
  8 10  4  0  0  0  0
  8 11  4  0  0  0  0
S  SKP  4
DISO 0
DION 0
RISO 0
OPTR 0
M  END

While I think I typed in it correctly, RDKit refuses to parse it, saying non-ring atom 7 marked aromatic. I haven't been able to diagnose it. I will observe that bond type 4 is documented as for SSS queries only, and let you figure out the rest of what's going on.

A close examination shows 1) a non-empty comment line (the third line), 2), the atom block has three additional fields compared to a normal record, and 3) the SKP section with four lines in it. Further, the data in the skip block cannot be confused with other lines in the property definition block, so all toolkits should simply ignore them.

I'm pretty sure that the CACTVS property A_JSTDATA (Extended fields in JST-type molfile) refers to these three additional atom fields.

In English?

I was able to use ConvertIO's online Japanese OCR tool to convert the PDF into a MS Word-formatted file, then Google Translate to convert the result to English. The resulting document was suprisingly readable - great job you programmers!

Here's the main reason for extending the molfile:

Therefore, in the conversion from to MOLfile, some items were added to MOLfile so that all necessary items could be migrated, and JICST extended MOLfile was defined. This JICST extended MOLfile is expected to be used as a mass file for identifying chemical structures in data registration systems that will be reconstructed in the future.

…The last three items in the Comment Block and Atom Block are the "extended" parts added to the original MOLfile.

The third-to-last column in the atom block appears to be r if the atom is in a ring, or c for chain. I haven't figured out the other two.

And I haven't figured out what the DISO, DION, RISO, and OPTR lines are supposed to mean.

There's also a few mentions of the use of CACTVS, including:

For coordinate information, we requested the same university to create a tool for batch processing based on the function of CACTVS csed4) created by the University of Erlangen in Germany, and used this tool to generate coordinates for most structures.

I think that's pretty solid evidence tying the aforementioned CACTVS functionality to the needs of JST.

Do you know what SKP is for?

My main guess is that the SKP block was added in order to support backwards-compatible extensions. It gives the option to add significantly incomatible structure in a future version, so long as that data can be safely ignored other programs.

However, that is just a guess. If you know something more definite, email me!

from Planet Python
via read more

Daily Python

Monday, September 21, 2020

Andrew Dalke: Molfile "S SKP"