User Tools

Site Tools


regsys

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
regsys [2012/10/15 10:45] – [Stage D & E. Component registration] sanmarkregsys [2013/10/10 16:03] flack
Line 1: Line 1:
-====== The mcule structure registration system ======+====== Mcule Advanced Curation (MAC) ======
  
-The mcule structure registration system is primarily designed to handle chemical structures coming from different data sourcesmainly from chemical suppliers, and load the structures into the mcule databaseThis is a non-trivial task which requires a careful structure check and preparation procedure. To reach a high curation level, the registration system should ensure database quality in terms of structure correctness, uniqueness and reliability as well as maintain a high level of data standardization.+The mcule database is curated by **MAC (Mcule Advanced Curation)** that involves a rigorous molecule registration system based on more than 80 structural checksstandardizationpreparation and correction stepsMAC guarantees high quality search results and avoids common errors arising from mis-drawn and incorrect structures that can critically affect the quality of computational calculations and the efficiency of experimental results.
  
-**Key features:** high level data curation, stereochemical standardization, robust novelty check and isomer detection, handling salts & organometallics+**Key features of MAC:** high level data curation, stereochemical standardization, robust novelty check and isomer detection, correct handling of salts & organometallics
  
-===Registration challenges===+Continue reading for more information about MAC, or check our presentations from the 244th National Meeting of American Chemical Society: 
 + 
 +[[http://mcule-blog.s3.amazonaws.com/acs12/mcule_ACS12_Phi_libraries.pdf|Evaluation of data quality in currently available compound libraries (slides)]] 
 + 
 +[[http://mcule-blog.s3.amazonaws.com/acs12/mcule_ACS12_libraries.jpg|Evaluation of data quality in currently available compound libraries (poster)]] 
 + 
 + 
 +==== Quality is important ==== 
 + 
 +The design of screening libraries and the development of predictive drug discovery models **all start with a high quality database**. Chemical correctness is crucial because mis-drawn and imperfectly defined structures result in incorrect models, misleading predictions and inconsistent hits. Problematic structures should therefore be eliminated at the earliest possible stage from a drug discovery pipeline. 
 + 
 +The mcule structure registration system is primarily designed to correctly handle chemical structures coming from different data sources, mainly from chemical suppliers, and load the structures into the mcule database. This is a non-trivial task which requires a careful structure check and preparation procedure. To reach a high curation level, the registration system should ensure database quality in terms of structure correctness, uniqueness and reliability as well as maintain a high level of data standardization. 
 + 
 +**All molecules with an MCULE ID have been processed by MAC**. User uploaded molecules are not processed by MAC by default. We plan to enable this option in future. 
 + 
 +==== Registration challenges ====
  
  
Line 12: Line 27:
 Primary data sources of the mcule database are chemical supplier databases. Compounds from different supplier catalogs are often represented by different structure drawing standards. Some of these non-standard representations (e.g. salts, organometallic complexes and functional groups) can lead to difficulties e.g. during structure novelty check. Correct interpretation of suppliers’ stereochemical notations is also crucial. This is probably the most problematic area since the [[http://pac.iupac.org/publications/pac/pdf/2006/pdf/7810x1897.pdf|IUPAC stereo recommendations]] have been shown to be very difficult to implement in a cheminformatic system. Primary data sources of the mcule database are chemical supplier databases. Compounds from different supplier catalogs are often represented by different structure drawing standards. Some of these non-standard representations (e.g. salts, organometallic complexes and functional groups) can lead to difficulties e.g. during structure novelty check. Correct interpretation of suppliers’ stereochemical notations is also crucial. This is probably the most problematic area since the [[http://pac.iupac.org/publications/pac/pdf/2006/pdf/7810x1897.pdf|IUPAC stereo recommendations]] have been shown to be very difficult to implement in a cheminformatic system.
  
-To interpret stereochemistry correctly, one should be aware of both the [[http://pac.iupac.org/publications/pac/pdf/2006/pdf/7810x1897.pdf|IUPAC convenction on stereo drawing]] and the stereo specification of [[http://accelrys.com/products/informatics/cheminformatics/ctfile-formats/no-fee.php|SD file]] (file format used to store chemical structures by most chemical suppliers). Moreover, there are cases, where both of these convenctions are violated and the registration system need to handle such cases as well. Therefore, chemical data from different data sources should be very carefully analyzed, and stereo configurations need to be cleaned up to prevent the use of unreliable information and the misinterpretation of non-standard notations.+To interpret stereochemistry correctly, one should be aware of both the [[http://pac.iupac.org/publications/pac/pdf/2006/pdf/7810x1897.pdf|IUPAC convenction on stereo drawing]] and the stereo specification of [[http://accelrys.com/products/informatics/cheminformatics/ctfile-formats/no-fee.php|SD file]] format used to store chemical structures by most chemical suppliers. Moreover, there are cases, where both of these conventions are violated, which the registration system also needs to be prepared to handle. Therefore, chemical data from different data sources should be very carefully analyzed, and stereo configurations need to be cleaned up to prevent the use of unreliable information and the misinterpretation of non-standard notations.
  
 ==Need for data curation== ==Need for data curation==
  
-Input structures might contain many kinds of problems. If these problems are not analyzed and errors are not corrected, misdrawn or insufficiently defined structures with e.g. incorrect valence states could enter into the final database. It is important to mention that not only the obviously wrong structures should be eliminated, but also the unreliable, or just potentially misdrawn ones. Considering the diversity of suppliers’ libraries and the number of potential problems that can arise, a carefully designed registration system is necessary to cope with as many error types as possible.+Input structures might contain many kinds of problems. If these problems are not analyzed and errors are not corrected, misdrawn or insufficiently defined structures with e.g. incorrect valence states could enter the final database. It is important to note that not only should the obviously wrong structures be eliminated, but also the unreliable, or even potentially misdrawn ones. Considering the diversity of suppliers’ libraries and the number of potential problems that can arise, a carefully designed registration system is necessary to cope with as many error types as possible.
  
 ==Checking structure novelty== ==Checking structure novelty==
Line 24: Line 39:
 Common forms of tautomerism can be detected with a rule-based system. But perceiving less common tautomer forms remains a problem even for experts. Stability of potential tautomeric forms cannot be well estimated, and the lack of appropriate computational methods can be only replaced with experience. Common forms of tautomerism can be detected with a rule-based system. But perceiving less common tautomer forms remains a problem even for experts. Stability of potential tautomeric forms cannot be well estimated, and the lack of appropriate computational methods can be only replaced with experience.
  
-Correct handling of stereoisomerism is associated with correct tautomer detection. Without the correct identification of mobile hydrogens the symmetries of the structure can be underestimated. In addition, as the complexity of stereo representation increases the detection of identical isomers is getting more difficult. Here you are facing a normalization problem.+Correct handling of stereoisomerism is associated with correct tautomer detection. Without the correct identification of mobile hydrogens the symmetries of the structure can be underestimated. In addition, as the complexity of stereo representation increases the detection of identical isomers is getting more difficult. Here one is faced with a normalization problem.
  
 ===== Registration step types ===== ===== Registration step types =====
Line 30: Line 45:
 The structure registration process involves many structure check & preparation steps, many novelty check algorithms, and a component separation algorithm in a fixed sequential order (not exactly in the order as listed in this documentation). The number of distinct registration steps in the system is more than 80. The structure registration process involves many structure check & preparation steps, many novelty check algorithms, and a component separation algorithm in a fixed sequential order (not exactly in the order as listed in this documentation). The number of distinct registration steps in the system is more than 80.
  
-Steps are primarily classified by their function. Structure check steps do not modify the structures, while preparation steps do. Preparations can be further grouped to (i) standardization, (ii) normalization and (iii) structure correction steps. Standardization steps modify the notations used to represent a given structure, while normalization steps keep the notations intact but do transformations between equivalent structures. These latter two preparation steps both uniformize structures and prepare them for the novelty check.+Steps are primarily classified by their function. Structure check steps do not modify the structures, while preparation steps do. Preparations can be further grouped into (i) standardization, (ii) normalization and (iii) structure correction steps. Standardization steps modify the notations used to represent a given structure, while normalization steps keep the notations intact but do transformations between equivalent structures. These latter two preparation steps both uniformize structures and prepare them for the novelty check.
  
 ^Step ^Function ^ ^Step ^Function ^
Line 38: Line 53:
 |Structure correction |Fixing some errors, remove problematic structural parts | |Structure correction |Fixing some errors, remove problematic structural parts |
  
-By design, the registration system not only filters out clearly wrong structures but also tries to detect potentially incorrect ones that cannot be handled automatically. The latter ones are not registered, and are awaiting for further manual correction and validation. Analyzing these registration cases can help to continuously improve our registration system. Adding new rules increases the level of automatization and decreases the need for manual curation.+By design, the registration system not only filters out clearly wrong structures but also tries to detect potentially incorrect ones that cannot be handled automatically. The latter ones are not registered, and are awaiting for further manual correction and validation. Analyzing these registration cases can help in continuously improving our registration system. Adding new rules increases the level of automatization and decreases the need for manual curation.
  
 ===== Process outline ===== ===== Process outline =====
  
-The whole registration process can be divided into seven different stages. It begins with the revision of stereo configurations, structure check/preparation steps (stage A, B) followed by component separation (stage C). Thereafter component uniqueness is checked and mcule IDs are assigned (stage D, E). This is performed with or without considering tautomerism and protonation, resulting the assignation of tautomer and protonation state independent [[mculeid|compound identifiers]] (stage D) as well as tautomer and protonation state dependent [[mculeid|structure identifiers]] (stage E). Finally, based on component identity, multicomponent entries are also registered at both the tautomer and protonation state independent (stage F) and dependent levels (stage G).+The whole registration process can be divided into seven different stages. It begins with the revision of stereo configurations, structure check/preparation steps (stage A, B) followed by component separation (stage C). Thereafter component uniqueness is checked and mcule IDs are assigned (stage D, E). This is performed with or without considering tautomerism and protonation, resulting the assignment of tautomer and protonation state independent [[mculeid|compound identifiers]] (stage D) as well as tautomer and protonation state dependent [[mculeid|structure identifiers]] (stage E). Finally, based on component identity, multicomponent entries are also registered at both the tautomer and protonation state independent (stage F) and dependent levels (stage G).
  
 |Stage A |Enforcing [[stereonotations|standard stereo representation]]; non-standard stereo notations are changed, unreliable part of stereo configurations is removed (after consulting with chemical supplier) | |Stage A |Enforcing [[stereonotations|standard stereo representation]]; non-standard stereo notations are changed, unreliable part of stereo configurations is removed (after consulting with chemical supplier) |
Line 55: Line 70:
  
 ==== Stage A. Stereo clean up ==== ==== Stage A. Stereo clean up ====
-//Summary: enforcing standard stereo representation; non-standard stereo notations[*] are corrected, unreliable part of the stereo configuration is removed//+//Summary: enforcing standard stereo representation; non-standard stereo notations are corrected, unreliable part of the stereo configuration is removed//
  
 The registration system can be flexibly configured to interpret input stereo configurations according to the available information received from the chemical supplier. In the lack of information we keep only the reliable parts of the configuration and remove all the others. The registration system can be flexibly configured to interpret input stereo configurations according to the available information received from the chemical supplier. In the lack of information we keep only the reliable parts of the configuration and remove all the others.
Line 61: Line 76:
 During the registration of supplier catalogs or external libraries we always use the following procedure: During the registration of supplier catalogs or external libraries we always use the following procedure:
  
-  * we contact the compound supplier and ask questions about the used stereochemical representation +  * we contact the compound supplier and inquire about the used stereochemical representation 
-  * based on the answers we configure the registration system to use proper stereo clean-up schema+  * based on the answers we configure the registration system to use the proper stereo clean-up schema
   * we try to store as many details about stereochemistry as possible, but keep reliable information only   * we try to store as many details about stereochemistry as possible, but keep reliable information only
   * the applied stereo interpretation rules are always confirmed by the chemical supplier   * the applied stereo interpretation rules are always confirmed by the chemical supplier
Line 76: Line 91:
 //Summary: product integrity and structure checks, functional group standardization, enforce proper organometallic & salt representation// //Summary: product integrity and structure checks, functional group standardization, enforce proper organometallic & salt representation//
  
-This registration stage aims the elimination/correction of problematic structures and the preparation of structures for subsequent steps, especially for component separation. Structures are considered to be problematic if they are chemically incorrect, uncertain/ambiguous, misdrawn or can have missing components.+This registration stage aims the elimination/correction of problematic structures and the preparation of structures for subsequent steps, especially for component separation. Structures are considered to be problematic if they are chemically incorrect, uncertain/ambiguous, misdrawn or have missing components.
  
 In this section, we list some of the most important check and preparation steps grouped by their purpose and complemented with some examples. In this section, we list some of the most important check and preparation steps grouped by their purpose and complemented with some examples.
Line 82: Line 97:
 ==Constitution check== ==Constitution check==
  
-Formal charges and valence states are checked, cases where the placement of hydrogens is ambiguous are detected. In some cases our system requires explicit hydrogens, where the valence state cannot be determined automatically. This mainly happens in case of inorganic atoms. +Formal charges and valence states are checked, cases where the placement of hydrogens is ambiguous are detected. In some cases our system requires explicit hydrogens, where the valence state cannot be determined automatically. This  happens mainly with inorganic atoms.  
 + 
 +{{ :regsys:reg_sys_1.png |}}
  
 ==Configuration check== ==Configuration check==
  
-The stereo configuration in the input structures can be incorrect or ambiguous. In case of tetrahedral configuration wedge bonds denote the configuration around stereocenters. Wedge bonds can be problematic in the following cases+The stereo configuration in the input structures can be incorrect or ambiguous. In case of tetrahedral configuration wedge bonds denote the configuration around stereocenters. Wedge bonds can be problematic in the following cases:
  
-  * Unordered List Itemthey are drawn to wrong atoms that are not stereocenters+  * they are drawn to atoms that are not stereocenters
   * they have wrong direction (wide end points to stereocenter)   * they have wrong direction (wide end points to stereocenter)
   * they are drawn between stereocenters indicating perspective drawing   * they are drawn between stereocenters indicating perspective drawing
Line 94: Line 111:
  
 Similar problems can arise in case of cis/trans configurations that are also detected by the system. You can get further information about stereo drawing rules including the proper geometry of wedge bonds in the IUPAC documentation of [[http://pac.iupac.org/publications/pac/pdf/2006/pdf/7810x1897.pdf|Graphical Representation of Stereochemical Configuration]]. Similar problems can arise in case of cis/trans configurations that are also detected by the system. You can get further information about stereo drawing rules including the proper geometry of wedge bonds in the IUPAC documentation of [[http://pac.iupac.org/publications/pac/pdf/2006/pdf/7810x1897.pdf|Graphical Representation of Stereochemical Configuration]].
 +
 +{{ :regsys:reg_sys_2.png |}}
  
 ==Product integrity check== ==Product integrity check==
Line 102: Line 121:
  
 Even if the SDF is correct and all information is represented within the chemical structure, there is still a possibility that the structure is insufficiently specified or misdrawn. In certain cases we can detect such structures. For example, missing or extra hydrogens can be detected for special structural patterns. Moreover, purchasable product entries should have a net zero charge. Charged products usually indicates a missing or an extra counterion. Even if the SDF is correct and all information is represented within the chemical structure, there is still a possibility that the structure is insufficiently specified or misdrawn. In certain cases we can detect such structures. For example, missing or extra hydrogens can be detected for special structural patterns. Moreover, purchasable product entries should have a net zero charge. Charged products usually indicates a missing or an extra counterion.
 +
 +{{ :regsys:reg_sys_3.png |}}
 +{{ :regsys:reg_sys_4.png |}}
  
 ==Functional group check & standardization== ==Functional group check & standardization==
  
 In these steps common functional groups such as nitro and azide groups are transformed to their neutral form. This standardization is necessary to get all relevant results from a [[substructuresearch|substructure search]]. Besides standardization, several misdrawn forms of these functional groups are detected. In these steps common functional groups such as nitro and azide groups are transformed to their neutral form. This standardization is necessary to get all relevant results from a [[substructuresearch|substructure search]]. Besides standardization, several misdrawn forms of these functional groups are detected.
 +
 +{{ :regsys:reg_sys_5.png |}}
  
 ==Enforce standard salt & organometallic compound representation== ==Enforce standard salt & organometallic compound representation==
  
 In the mcule database salts and organometallic complexes should be represented as disconnected and connected, respectively. Component separation is performed in the next registration stage where typical counterions are separated automatically. Cases, where salts cannot be distinguished from organometallic complexes cannot be processed automatically and are marked as problematic. Disconnected organometallic complexes where the reconnection of metals cannot be performed automatically are also marked as problematic. In the mcule database salts and organometallic complexes should be represented as disconnected and connected, respectively. Component separation is performed in the next registration stage where typical counterions are separated automatically. Cases, where salts cannot be distinguished from organometallic complexes cannot be processed automatically and are marked as problematic. Disconnected organometallic complexes where the reconnection of metals cannot be performed automatically are also marked as problematic.
 +
 +{{ :regsys:reg_sys_7.png |}}
  
 ==Elimination of undesirable structures== ==Elimination of undesirable structures==
  
 In the mcule database free radicals and isotopes are currently not supported. These structures have less relevance in drug discovery and are undesirable in virtual screening. Moreover, they are not supported well by some of cheminformatic tools that are implemented in the mcule system. In this step we detect such structures and prevent their registration. In the mcule database free radicals and isotopes are currently not supported. These structures have less relevance in drug discovery and are undesirable in virtual screening. Moreover, they are not supported well by some of cheminformatic tools that are implemented in the mcule system. In this step we detect such structures and prevent their registration.
 +
 +{{ :regsys:reg_sys_8.png |}}
  
 ==SDF checks== ==SDF checks==
Line 124: Line 152:
 In this stage we separate components of the incoming structure. In common salts counterions can be disconnected and separated from the main component automatically. Bonds to the main component are deleted and proper charges are placed on both components. In this stage we separate components of the incoming structure. In common salts counterions can be disconnected and separated from the main component automatically. Bonds to the main component are deleted and proper charges are placed on both components.
  
 +{{ :regsys:reg_sys_9.png |}}
 ==== Stage D & E. Component registration ==== ==== Stage D & E. Component registration ====
 //Summary: individual components’ structures are normalized, unique components are registered with new mcule IDs assigned at the tautomer and protonation state independent (D) and dependent (E) levels (steps in the D & E stages are very similar except for novelty check)// //Summary: individual components’ structures are normalized, unique components are registered with new mcule IDs assigned at the tautomer and protonation state independent (D) and dependent (E) levels (steps in the D & E stages are very similar except for novelty check)//
Line 132: Line 161:
  
 In the mcule system there are [[stereonotations|four stereo configuration types]]: absolute, relative, racemic and unknown (the “unknown” type is used to denote uncertain configurations, where compound provider could not confirm that the configuration type is really absolute). They are assigned in the stereo clean-up stage, and these initially assigned types are inherited by the separated components. In these steps these assigned stereo configuration types as well as the stereo configurations are further processed: for those components having no stereocenters, stereo configuration types are removed, while the stereo configuration of components with stereocenters are normalized together with their stereo configuration types. In the mcule system there are [[stereonotations|four stereo configuration types]]: absolute, relative, racemic and unknown (the “unknown” type is used to denote uncertain configurations, where compound provider could not confirm that the configuration type is really absolute). They are assigned in the stereo clean-up stage, and these initially assigned types are inherited by the separated components. In these steps these assigned stereo configuration types as well as the stereo configurations are further processed: for those components having no stereocenters, stereo configuration types are removed, while the stereo configuration of components with stereocenters are normalized together with their stereo configuration types.
 +
 +{{ :regsys:reg_sys_10.png |}}
  
 Normalization is needed because certain configurations can be represented with multiple structures and/or [[stereonotations|stereo configuration types]]: replacing configurations around atoms and/or the configuration type can result in stereochemically equivalent structures. This can primarily happen when the configuration is only partially specified, containing atoms with both unknown/undefined and well-defined configurations. As a preparation step for the novelty check the same representative structures are selected from the set of structures with equivalent configurations. Normalization is needed because certain configurations can be represented with multiple structures and/or [[stereonotations|stereo configuration types]]: replacing configurations around atoms and/or the configuration type can result in stereochemically equivalent structures. This can primarily happen when the configuration is only partially specified, containing atoms with both unknown/undefined and well-defined configurations. As a preparation step for the novelty check the same representative structures are selected from the set of structures with equivalent configurations.
Line 137: Line 168:
 ==Component novelty check== ==Component novelty check==
  
-Main novelty check step is performed in stage D, aiming the identification of different tautomer forms and protonation states of the same compound. In stage E different tautomers and protonation states are treated and registered as different structures.+Main novelty check step is performed in stage D, focusing on the identification of different tautomer forms and protonation states of the same compound. In stage E different tautomers and protonation states are treated and registered as different structures.
  
-In the mcule registration system the novelty check of the individual components is based on non-standard IUPAC InChI identifiers[*]. The InChI software performs a lot of normalization steps and can detect common forms of tautomerism. It can also perceive protonation states of the same compounds in most cases. InChI strings therefore serve as a good starting point of novelty check. Different structures with identical InChIs can be considered as different representations of the same compound.+In the mcule registration system the novelty check of the individual components is based on non-standard IUPAC InChI identifiers. The InChI software performs a lot of normalization steps and can detect common forms of tautomerism. It can also perceive protonation states of the same compounds in most cases. InChI strings therefore serve as a good starting point of novelty check. Different structures with identical InChIs can be considered as different representations of the same compound.
  
 In stage D we use a novelty check algorithm that is based on the InChI strings but can detect an even broader set of potential tautomers than a simple InChI comparison. The system is capable to fully prevent the registration of duplicates as long as they are prototopic tautomers.  In stage D we use a novelty check algorithm that is based on the InChI strings but can detect an even broader set of potential tautomers than a simple InChI comparison. The system is capable to fully prevent the registration of duplicates as long as they are prototopic tautomers. 
  
 +{{ :regsys:reg_sys_11.png |}}
 ==== Stage E & F. Multicomponent structure registration ==== ==== Stage E & F. Multicomponent structure registration ====
 //Summary: additional checks are performed, component types are assigned, and unique structures are registered with new mcule IDs assigned at the tautomer and protonation state independent (F) and dependent (G) levels (steps in the F & G stages are very similar)// //Summary: additional checks are performed, component types are assigned, and unique structures are registered with new mcule IDs assigned at the tautomer and protonation state independent (F) and dependent (G) levels (steps in the F & G stages are very similar)//
Line 156: Line 188:
 ==Component type assignment== ==Component type assignment==
  
-After novelty check components are analyzed. The system can identify counterions and potential solvents. In the latter case registration is problematic. For example, water can be a crystal water or can denote the solvent. Depending on its role it should be removed from or retained in the structure. Also, deprotonated form of water (hydroxide) can serve as a counterion.+After novelty check components are analyzed. The system can identify counterions and potential solvents. In the latter case registration is problematic. For example, water can be a crystal water or can denote the solvent. Depending on its role it should be removed from or retained in the structure. Also, the deprotonated form of water (hydroxide) can serve as a counterion.
  
 In most cases the system is also capable of identifying the main components, which can serve as the input set for virtual screens. In most cases the system is also capable of identifying the main components, which can serve as the input set for virtual screens.
 +
 +You can see below the index page of compound [[https://mcule.com/MCULE-3198812899/|MCULE-3198812899]]. This is a maleic and/or fumaric acid salt
 +(uncertainty is marked by crossed double bond). Counter ions are marked, and component multiplicities are assigned correctly by the system.
 +
 +{{:regsys:reg_sys_12.png|}}
regsys.txt · Last modified: 2013/10/19 11:36 by rkiss