Differences

This shows you the differences between two versions of the page.

--- subsets [2016/12/27 20:44] – rkiss
+++ subsets [2016/12/27 21:23] (current) – rkiss
@@ Line 13: / Line 13: @@
 ==== Property based filtering ====
-For the drug-like and fragment subsets the [[http://www.sciencedirect.com/science/article/pii/S0169409X00001290|rule-of-5]] and [[http://www.sciencedirect.com/science/article/pii/S1359644603028319|rule-of-3]] physicochemical property filters are applied allowing max 1 violation. Additionally, we used a few more rules:
+For the drug-like and fragment subsets the [[http://www.sciencedirect.com/science/article/pii/S0169409X00001290|rule-of-5]] and [[http://www.sciencedirect.com/science/article/pii/S1359644603028319|rule-of-3]] physicochemical property filters are applied allowing max 1 violation. Additionally, we applied the following filtering criteria to skip some rather "strange" compounds:
   * number of components < = 1
   * MW > = 100
@@ Line 20: / Line 21: @@
   * number of halogens < = 7
   * number of inorganic atoms = 0
-We used these rules to leave out more "strange compounds" which may be included more likely in the diverse sets as they differ from the regular compounds.
 ==== Diversity selection ====
-The Mcule database contains ~5.7M stock compounds and ~30.3M virtual compounds. Diversity selection was carried out in a way to prefer the stock compounds over the virtual ones. As a result, the chemical space is represented by stock compounds where possible. More over, the downloadable files contains the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds, so you can choose yourself the number of compounds to represent the whole Ro3 and Ro5 subsets.
+Diversity selection was set up to prefer in-stock compounds over virtual ones. As a result, the chemical space is represented by in-stock compounds where possible.
-Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by OpenBabel. The combinations of the following algorithms were applied to extract the most dissimilar subsets:
-  * we used sphere exclusion to eliminate highly similar compounds to reduce the input size where needed
-  * then [[diversitysel|stepwise elimination]] was applied to obtain the most dissimilar compounds
-In sphere exclusion we used the stock compounds first as "centers" for the elimination of redundant compounds, and we retained the stock compounds during the stepwise elimination.
-===== Subsets =====
+The [[https://mcule.com/database/|downloadable files]] contain the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds. This means that if you want to further narrow down the number of compounds you can keep the first X compounds of the files and they will be the most dissimilar ones.
-To speed up the selection, we used sphere exclusion in case of the Ro5 subsets with TC=0.8 to pass at most 3M compounds for stepwise elminiation. Then, the following subsets were saved:
+Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by [[http://jcheminf.springeropen.com/articles/10.1186/1758-2946-3-33|OpenBabel]]. The combinations of the following algorithms were applied to extract the most dissimilar compounds:
+  * sphere exclusion: to quickly eliminate highly similar compounds to reduce the input collection to a manageable size for the subsequent [[diversitysel|stepwise elimination]] algorithm
+  * [[diversitysel|stepwise elimination]]: a more thorough algorithm that eliminates one molecule of the most similar molecule pairs
-^Subset name			^Input	^Property filter	^Diversity	^Subset size ^
+In sphere exclusion we used the in-stock compounds first as "centers" and eliminated their most similar analogs, while during [[diversitysel|stepwise elimination]] we retained the in-stock compounds from the most similar molecule pairs.
-|Mcule Purchasable (In Stock Ro5 Diverse 1M) |Stock compounds|rule-of-5, max 1 violation|top diverse 1M, max TC: 0.8|1,000,000|
-|Mcule Purchasable (In Stock Ro5 Diverse 350K)|Stock compounds|rule-of-5 (max 1 violation)|Top diverse 350K, max TC:0.7|350,000|
-|Mcule Purchasable (In Stock Ro3)|Stock compounds|rule-of-3 (max 1 violation)|-|154,238|
-|Mcule Purchasable (In Stock Ro3 Diverse 50K)|Stock compounds|rule-of-3 (max 1 violation)|Top diverse 50K, max TC: 0.8|50,000|
-|Mcule Purchasable (In Stock & Virtual Ro3)|Stock compounds + virtual compounds|rule-of-3 (max 1 violation)|-|789,907|
-|Mcule Purchasable (In Stock & Virtual Ro3 Diverse 70K)|Stock compounds + virtual compounds|rule-of-3 (max 1 violation)|Top diverse 70K, max TC: 0.8|70,000|
+Sphere exclusion diversity selection was applied in case of the rule-of-5 subsets with maximum TC=0.8 and a maximum of 3M compounds that were subjected to stepwise elimination.