This is an old revision of the document!

The mcule structure registration system

The mcule structure registration system is primarily designed to handle chemical structures coming from different data sources, mainly from chemical suppliers, and load the structures into the mcule database. This is a non-trivial task which requires a careful structure check and preparation procedure. To reach a high curation level, the registration system should ensure database quality in terms of structure correctness, uniqueness and reliability as well as maintain a high level of data standardization.

Key features: high level data curation, stereochemical standardization, robust novelty check and isomer detection, handling salts & organometallics

Registration challenges

Standardization

Primary data sources of the mcule database are chemical supplier databases. Compounds from different supplier catalogs are often represented by different structure drawing standards. Some of these non-standard representations (e.g. salts, organometallic complexes and functional groups) can lead to difficulties e.g. during structure novelty check. Correct interpretation of suppliers’ stereochemical notations is also crucial. This is probably the most problematic area since the IUPAC stereo recommendations have been shown to be very difficult to implement in a cheminformatic system.

To interpret stereochemistry correctly, one should be aware of both the IUPAC convenction on stereo drawing and the stereo specification of SD file (file format used to store chemical structures by most chemical suppliers). Moreover, there are cases, where both of these convenctions are violated and the registration system need to handle such cases as well. Therefore, chemical data from different data sources should be very carefully analyzed, and stereo configurations need to be cleaned up to prevent the use of unreliable information and the misinterpretation of non-standard notations.

Need for data curation

Input structures might contain many kinds of problems. If these problems are not analyzed and errors are not corrected, misdrawn or insufficiently defined structures with e.g. incorrect valence states could enter into the final database. It is important to mention that not only the obviously wrong structures should be eliminated, but also the unreliable, or just potentially misdrawn ones. Considering the diversity of suppliers’ libraries and the number of potential problems that can arise, a carefully designed registration system is necessary to cope with as many error types as possible.

Checking structure novelty

This is a very challenging task that cannot be solved perfectly in automated systems. Key problems are tautomerism and stereoisomerism.

Common forms of tautomerism can be detected with a rule-based system. But perceiving less common tautomer forms remains a problem even for experts. Stability of potential tautomeric forms cannot be well estimated, and the lack of appropriate computational methods can be only replaced with experience.

Correct handling of stereoisomerism is associated with correct tautomer detection. Without the correct identification of mobile hydrogens the symmetries of the structure can be underestimated. In addition, as the complexity of stereo representation increases the detection of identical isomers is getting more difficult. Here you are facing a normalization problem.

Registration step types

The structure registration process involves many structure check & preparation steps, many novelty check algorithms, and a component separation algorithm in a fixed sequential order (not exactly in the order as listed in this documentation). The number of distinct registration steps in the system is more than 80.

Steps are primarily classified by their function. Structure check steps do not modify the structures, while preparation steps do. Preparations can be further grouped to (i) standardization, (ii) normalization and (iii) structure correction steps. Standardization steps modify the notations used to represent a given structure, while normalization steps keep the notations intact but do transformations between equivalent structures. These latter two preparation steps both uniformize structures and prepare them for the novelty check.

Step	Function
Structure check	Prevent registration of incorrect, misdrawn or undesirable structures
Structure standardization	Change the notation system; structure representation is changed, structures are not
Structure normalization	Select a representative structure from equivalent structures; structure is changed, notation system is kept intact
Structure correction	Fixing some errors, remove problematic structural parts

By design, the registration system not only filters out clearly wrong structures but also tries to detect potentially incorrect ones that cannot be handled automatically. The latter ones are not registered, and are awaiting for further manual correction and validation. Analyzing these registration cases can help to continuously improve our registration system. Adding new rules increases the level of automatization and decreases the need for manual curation.

Process outline

The whole registration process can be divided into seven different stages. It begins with the revision of stereo configurations, structure check/preparation steps (stage A, B) followed by component separation (stage C). Thereafter component uniqueness is checked and mcule IDs are assigned (stage D, E). This is performed with or without considering tautomerism and protonation, resulting the assignation of tautomer and protonation state independent compound identifiers (stage D) as well as tautomer and protonation state dependent structure identifiers (stage E). Finally, based on component identity, multicomponent entries are also registered at both the tautomer and protonation state independent (stage F) and dependent levels (stage G).

Stage A	Enforcing standard stereo representation; non-standard stereo notations are changed, unreliable part of stereo configurations is removed (after consulting with chemical supplier)
Stage B	Product integrity and structure checks, functional group standardization, enforcing proper organometallic & salt representation, removing undesirable structures
Stage C	Common counterions are disconnected and disconnected components are separated
Stage D & E	Components are normalized, unique components are registered and new mcule IDs assigned at both compound (D) and structure (E) levels
Stage F & G	Full multicomponent structures are normalized, unique structures are registered with new mcule IDs assigned at both compound (F) and structure (G) levels

As a result, input entries as well as their components are registered at two levels: tautomer and protonation state independent compound level with tautomer detection and tautomer and protonation state dependent structure level without tautomer detection.

online documentation

Table of Contents