regsys
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revisionNext revisionBoth sides next revision | ||
regsys [2012/10/15 10:43] – sanmark | regsys [2013/02/27 08:10] – rkiss | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== | + | ====== |
- | The mcule structure | + | The mcule database is curated by **MAC (Mcule Advanced Curation)** that involves a rigorous molecule |
- | **Key features:** high level data curation, stereochemical standardization, | + | ==== Quality is important ==== |
- | ===Registration challenges=== | + | The design of screening libraries and the development of predictive drug discovery models **all start with a high quality database**. Chemical correctness is crucial because mis-drawn and imperfectly defined structures result in incorrect models, misleading predictions and inconsistent hits. Problematic structures should therefore be eliminated at the earliest possible stage from a drug discovery pipeline. |
+ | |||
+ | The mcule structure registration system is primarily designed to correctly handle chemical structures coming from different data sources, mainly from chemical suppliers, and load the structures into the mcule database. This is a non-trivial task which requires a careful structure check and preparation procedure. To reach a high curation level, the registration system should ensure database quality in terms of structure correctness, | ||
+ | |||
+ | **All molecules with an MCULE ID have been processed by MAC**. User uploaded molecules are not processed by MAC by default. We plan to enable this option in future. | ||
+ | |||
+ | **Key features:** high level data curation, stereochemical standardization, | ||
+ | |||
+ | ==== Registration challenges | ||
Line 12: | Line 20: | ||
Primary data sources of the mcule database are chemical supplier databases. Compounds from different supplier catalogs are often represented by different structure drawing standards. Some of these non-standard representations (e.g. salts, organometallic complexes and functional groups) can lead to difficulties e.g. during structure novelty check. Correct interpretation of suppliers’ stereochemical notations is also crucial. This is probably the most problematic area since the [[http:// | Primary data sources of the mcule database are chemical supplier databases. Compounds from different supplier catalogs are often represented by different structure drawing standards. Some of these non-standard representations (e.g. salts, organometallic complexes and functional groups) can lead to difficulties e.g. during structure novelty check. Correct interpretation of suppliers’ stereochemical notations is also crucial. This is probably the most problematic area since the [[http:// | ||
- | To interpret stereochemistry correctly, one should be aware of both the [[http:// | + | To interpret stereochemistry correctly, one should be aware of both the [[http:// |
==Need for data curation== | ==Need for data curation== | ||
- | Input structures might contain many kinds of problems. If these problems are not analyzed and errors are not corrected, misdrawn or insufficiently defined structures with e.g. incorrect valence states could enter into the final database. It is important to mention | + | Input structures might contain many kinds of problems. If these problems are not analyzed and errors are not corrected, misdrawn or insufficiently defined structures with e.g. incorrect valence states could enter the final database. It is important to note that not only should |
==Checking structure novelty== | ==Checking structure novelty== | ||
Line 24: | Line 32: | ||
Common forms of tautomerism can be detected with a rule-based system. But perceiving less common tautomer forms remains a problem even for experts. Stability of potential tautomeric forms cannot be well estimated, and the lack of appropriate computational methods can be only replaced with experience. | Common forms of tautomerism can be detected with a rule-based system. But perceiving less common tautomer forms remains a problem even for experts. Stability of potential tautomeric forms cannot be well estimated, and the lack of appropriate computational methods can be only replaced with experience. | ||
- | Correct handling of stereoisomerism is associated with correct tautomer detection. Without the correct identification of mobile hydrogens the symmetries of the structure can be underestimated. In addition, as the complexity of stereo representation increases the detection of identical isomers is getting more difficult. Here you are facing | + | Correct handling of stereoisomerism is associated with correct tautomer detection. Without the correct identification of mobile hydrogens the symmetries of the structure can be underestimated. In addition, as the complexity of stereo representation increases the detection of identical isomers is getting more difficult. Here one is faced with a normalization problem. |
===== Registration step types ===== | ===== Registration step types ===== | ||
Line 30: | Line 38: | ||
The structure registration process involves many structure check & preparation steps, many novelty check algorithms, and a component separation algorithm in a fixed sequential order (not exactly in the order as listed in this documentation). The number of distinct registration steps in the system is more than 80. | The structure registration process involves many structure check & preparation steps, many novelty check algorithms, and a component separation algorithm in a fixed sequential order (not exactly in the order as listed in this documentation). The number of distinct registration steps in the system is more than 80. | ||
- | Steps are primarily classified by their function. Structure check steps do not modify the structures, while preparation steps do. Preparations can be further grouped | + | Steps are primarily classified by their function. Structure check steps do not modify the structures, while preparation steps do. Preparations can be further grouped |
^Step ^Function ^ | ^Step ^Function ^ | ||
Line 38: | Line 46: | ||
|Structure correction |Fixing some errors, remove problematic structural parts | | |Structure correction |Fixing some errors, remove problematic structural parts | | ||
- | By design, the registration system not only filters out clearly wrong structures but also tries to detect potentially incorrect ones that cannot be handled automatically. The latter ones are not registered, and are awaiting for further manual correction and validation. Analyzing these registration cases can help to continuously | + | By design, the registration system not only filters out clearly wrong structures but also tries to detect potentially incorrect ones that cannot be handled automatically. The latter ones are not registered, and are awaiting for further manual correction and validation. Analyzing these registration cases can help in continuously |
===== Process outline ===== | ===== Process outline ===== | ||
- | The whole registration process can be divided into seven different stages. It begins with the revision of stereo configurations, | + | The whole registration process can be divided into seven different stages. It begins with the revision of stereo configurations, |
|Stage A |Enforcing [[stereonotations|standard stereo representation]]; | |Stage A |Enforcing [[stereonotations|standard stereo representation]]; | ||
Line 55: | Line 63: | ||
==== Stage A. Stereo clean up ==== | ==== Stage A. Stereo clean up ==== | ||
- | //Summary: enforcing standard stereo representation; | + | //Summary: enforcing standard stereo representation; |
The registration system can be flexibly configured to interpret input stereo configurations according to the available information received from the chemical supplier. In the lack of information we keep only the reliable parts of the configuration and remove all the others. | The registration system can be flexibly configured to interpret input stereo configurations according to the available information received from the chemical supplier. In the lack of information we keep only the reliable parts of the configuration and remove all the others. | ||
Line 61: | Line 69: | ||
During the registration of supplier catalogs or external libraries we always use the following procedure: | During the registration of supplier catalogs or external libraries we always use the following procedure: | ||
- | * we contact the compound supplier and ask questions | + | * we contact the compound supplier and inquire |
- | * based on the answers we configure the registration system to use a proper stereo clean-up schema | + | * based on the answers we configure the registration system to use the proper stereo clean-up schema |
* we try to store as many details about stereochemistry as possible, but keep reliable information only | * we try to store as many details about stereochemistry as possible, but keep reliable information only | ||
* the applied stereo interpretation rules are always confirmed by the chemical supplier | * the applied stereo interpretation rules are always confirmed by the chemical supplier | ||
Line 76: | Line 84: | ||
//Summary: product integrity and structure checks, functional group standardization, | //Summary: product integrity and structure checks, functional group standardization, | ||
- | This registration stage aims the elimination/ | + | This registration stage aims the elimination/ |
In this section, we list some of the most important check and preparation steps grouped by their purpose and complemented with some examples. | In this section, we list some of the most important check and preparation steps grouped by their purpose and complemented with some examples. | ||
Line 82: | Line 90: | ||
==Constitution check== | ==Constitution check== | ||
- | Formal charges and valence states are checked, cases where the placement of hydrogens is ambiguous are detected. In some cases our system requires explicit hydrogens, where the valence state cannot be determined automatically. This mainly | + | Formal charges and valence states are checked, cases where the placement of hydrogens is ambiguous are detected. In some cases our system requires explicit hydrogens, where the valence state cannot be determined automatically. This happens |
+ | |||
+ | {{ : | ||
==Configuration check== | ==Configuration check== | ||
- | The stereo configuration in the input structures can be incorrect or ambiguous. In case of tetrahedral configuration wedge bonds denote the configuration around stereocenters. Wedge bonds can be problematic in the following cases | + | The stereo configuration in the input structures can be incorrect or ambiguous. In case of tetrahedral configuration wedge bonds denote the configuration around stereocenters. Wedge bonds can be problematic in the following cases: |
- | * Unordered List Itemthey | + | * they are drawn to atoms that are not stereocenters |
* they have wrong direction (wide end points to stereocenter) | * they have wrong direction (wide end points to stereocenter) | ||
* they are drawn between stereocenters indicating perspective drawing | * they are drawn between stereocenters indicating perspective drawing | ||
Line 94: | Line 104: | ||
Similar problems can arise in case of cis/trans configurations that are also detected by the system. You can get further information about stereo drawing rules including the proper geometry of wedge bonds in the IUPAC documentation of [[http:// | Similar problems can arise in case of cis/trans configurations that are also detected by the system. You can get further information about stereo drawing rules including the proper geometry of wedge bonds in the IUPAC documentation of [[http:// | ||
+ | |||
+ | {{ : | ||
==Product integrity check== | ==Product integrity check== | ||
Line 102: | Line 114: | ||
Even if the SDF is correct and all information is represented within the chemical structure, there is still a possibility that the structure is insufficiently specified or misdrawn. In certain cases we can detect such structures. For example, missing or extra hydrogens can be detected for special structural patterns. Moreover, purchasable product entries should have a net zero charge. Charged products usually indicates a missing or an extra counterion. | Even if the SDF is correct and all information is represented within the chemical structure, there is still a possibility that the structure is insufficiently specified or misdrawn. In certain cases we can detect such structures. For example, missing or extra hydrogens can be detected for special structural patterns. Moreover, purchasable product entries should have a net zero charge. Charged products usually indicates a missing or an extra counterion. | ||
+ | |||
+ | {{ : | ||
+ | {{ : | ||
==Functional group check & standardization== | ==Functional group check & standardization== | ||
In these steps common functional groups such as nitro and azide groups are transformed to their neutral form. This standardization is necessary to get all relevant results from a [[substructuresearch|substructure search]]. Besides standardization, | In these steps common functional groups such as nitro and azide groups are transformed to their neutral form. This standardization is necessary to get all relevant results from a [[substructuresearch|substructure search]]. Besides standardization, | ||
+ | |||
+ | {{ : | ||
==Enforce standard salt & organometallic compound representation== | ==Enforce standard salt & organometallic compound representation== | ||
In the mcule database salts and organometallic complexes should be represented as disconnected and connected, respectively. Component separation is performed in the next registration stage where typical counterions are separated automatically. Cases, where salts cannot be distinguished from organometallic complexes cannot be processed automatically and are marked as problematic. Disconnected organometallic complexes where the reconnection of metals cannot be performed automatically are also marked as problematic. | In the mcule database salts and organometallic complexes should be represented as disconnected and connected, respectively. Component separation is performed in the next registration stage where typical counterions are separated automatically. Cases, where salts cannot be distinguished from organometallic complexes cannot be processed automatically and are marked as problematic. Disconnected organometallic complexes where the reconnection of metals cannot be performed automatically are also marked as problematic. | ||
+ | |||
+ | {{ : | ||
==Elimination of undesirable structures== | ==Elimination of undesirable structures== | ||
In the mcule database free radicals and isotopes are currently not supported. These structures have less relevance in drug discovery and are undesirable in virtual screening. Moreover, they are not supported well by some of cheminformatic tools that are implemented in the mcule system. In this step we detect such structures and prevent their registration. | In the mcule database free radicals and isotopes are currently not supported. These structures have less relevance in drug discovery and are undesirable in virtual screening. Moreover, they are not supported well by some of cheminformatic tools that are implemented in the mcule system. In this step we detect such structures and prevent their registration. | ||
+ | |||
+ | {{ : | ||
==SDF checks== | ==SDF checks== | ||
Line 124: | Line 145: | ||
In this stage we separate components of the incoming structure. In common salts counterions can be disconnected and separated from the main component automatically. Bonds to the main component are deleted and proper charges are placed on both components. | In this stage we separate components of the incoming structure. In common salts counterions can be disconnected and separated from the main component automatically. Bonds to the main component are deleted and proper charges are placed on both components. | ||
+ | {{ : | ||
==== Stage D & E. Component registration ==== | ==== Stage D & E. Component registration ==== | ||
//Summary: individual components’ structures are normalized, unique components are registered with new mcule IDs assigned at the tautomer and protonation state independent (D) and dependent (E) levels (steps in the D & E stages are very similar except for novelty check)// | //Summary: individual components’ structures are normalized, unique components are registered with new mcule IDs assigned at the tautomer and protonation state independent (D) and dependent (E) levels (steps in the D & E stages are very similar except for novelty check)// | ||
Line 132: | Line 154: | ||
In the mcule system there are [[stereonotations|four stereo configuration types]]: absolute, relative, racemic and unknown (the “unknown” type is used to denote uncertain configurations, | In the mcule system there are [[stereonotations|four stereo configuration types]]: absolute, relative, racemic and unknown (the “unknown” type is used to denote uncertain configurations, | ||
+ | |||
+ | {{ : | ||
Normalization is needed because certain configurations can be represented with multiple structures and/or [[stereonotations|stereo configuration types]]: replacing configurations around atoms and/or the configuration type can result in stereochemically equivalent structures. This can primarily happen when the configuration is only partially specified, containing atoms with both unknown/ | Normalization is needed because certain configurations can be represented with multiple structures and/or [[stereonotations|stereo configuration types]]: replacing configurations around atoms and/or the configuration type can result in stereochemically equivalent structures. This can primarily happen when the configuration is only partially specified, containing atoms with both unknown/ | ||
- | ==Novelty | + | ==Component novelty |
- | Main novelty check step is performed in stage D, aiming | + | Main novelty check step is performed in stage D, focusing on the identification of different tautomer forms and protonation states of the same compound. In stage E different tautomers and protonation states are treated and registered as different structures. |
- | In the mcule registration system the novelty check of the individual components is based on non-standard IUPAC InChI identifiers[*]. The InChI software performs a lot of normalization steps and can detect common forms of tautomerism. It can also perceive protonation states of the same compounds in most cases. InChI strings therefore serve as a good starting point of novelty check. Different structures with identical InChIs can be considered as different representations of the same compound. | + | In the mcule registration system the novelty check of the individual components is based on non-standard IUPAC InChI identifiers. The InChI software performs a lot of normalization steps and can detect common forms of tautomerism. It can also perceive protonation states of the same compounds in most cases. InChI strings therefore serve as a good starting point of novelty check. Different structures with identical InChIs can be considered as different representations of the same compound. |
In stage D we use a novelty check algorithm that is based on the InChI strings but can detect an even broader set of potential tautomers than a simple InChI comparison. The system is capable to fully prevent the registration of duplicates as long as they are prototopic tautomers. | In stage D we use a novelty check algorithm that is based on the InChI strings but can detect an even broader set of potential tautomers than a simple InChI comparison. The system is capable to fully prevent the registration of duplicates as long as they are prototopic tautomers. | ||
+ | {{ : | ||
+ | ==== Stage E & F. Multicomponent structure registration ==== | ||
+ | //Summary: additional checks are performed, component types are assigned, and unique structures are registered with new mcule IDs assigned at the tautomer and protonation state independent (F) and dependent (G) levels (steps in the F & G stages are very similar)// | ||
+ | |||
+ | ==Novelty check== | ||
+ | |||
+ | Novelty check of multicomponent entries is based on component identity and component multiplicity. Two structures are treated as identical when they contain identical components with the same multiplicities. | ||
+ | |||
+ | This novelty check method needs some checks and preparations. The number of identical components (multiplicity) should be reduced to the lowest possible value, when they don’t store additional information. The presence of multiple components can indicate relative / racemic stereochemistry according to the IUPAC recommendations. These cases should be identified. | ||
+ | |||
+ | It is also important to identify contaminants. They should be eliminated and stored as a property of the product. Products with different contaminants are still related to the same compound. | ||
+ | |||
+ | ==Component type assignment== | ||
+ | |||
+ | After novelty check components are analyzed. The system can identify counterions and potential solvents. In the latter case registration is problematic. For example, water can be a crystal water or can denote the solvent. Depending on its role it should be removed from or retained in the structure. Also, the deprotonated form of water (hydroxide) can serve as a counterion. | ||
+ | |||
+ | In most cases the system is also capable of identifying the main components, which can serve as the input set for virtual screens. | ||
+ | |||
+ | You can see below the index page of compound [[https:// | ||
+ | (uncertainty is marked by crossed double bond). Counter ions are marked, and component multiplicities are assigned correctly by the system. | ||
+ | |||
+ | {{: |
regsys.txt · Last modified: 2013/10/19 11:36 by rkiss