User Tools

Site Tools


subsets

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
subsets [2016/12/27 20:44] rkisssubsets [2016/12/27 21:23] (current) rkiss
Line 13: Line 13:
 ==== Property based filtering ==== ==== Property based filtering ====
  
-For the drug-like and fragment subsets the [[http://www.sciencedirect.com/science/article/pii/S0169409X00001290|rule-of-5]] and [[http://www.sciencedirect.com/science/article/pii/S1359644603028319|rule-of-3]] physicochemical property filters are applied allowing max 1 violation. Additionally, we used a few more rules:+For the drug-like and fragment subsets the [[http://www.sciencedirect.com/science/article/pii/S0169409X00001290|rule-of-5]] and [[http://www.sciencedirect.com/science/article/pii/S1359644603028319|rule-of-3]] physicochemical property filters are applied allowing max 1 violation. Additionally, we applied the following filtering criteria to skip some rather "strange" compounds: 
   * number of components < = 1   * number of components < = 1
   * MW > = 100   * MW > = 100
Line 20: Line 21:
   * number of halogens < = 7   * number of halogens < = 7
   * number of inorganic atoms = 0   * number of inorganic atoms = 0
- 
-We used these rules to leave out more "strange compounds" which may be included more likely in the diverse sets as they differ from the regular compounds. 
  
 ==== Diversity selection ==== ==== Diversity selection ====
  
-The Mcule database contains ~5.7M stock compounds and ~30.3M virtual compounds. Diversity selection was carried out in a way to prefer the stock compounds over the virtual ones. As a result, the chemical space is represented by stock compounds where possible. More over, the downloadable files contains the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds, so you can choose yourself the number of compounds to represent the whole Ro3 and Ro5 subsets. +Diversity selection was set up to prefer in-stock compounds over virtual ones. As a result, the chemical space is represented by in-stock compounds where possible.
- +
-Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by OpenBabel. The combinations of the following algorithms were applied to extract the most dissimilar subsets: +
-  * we used sphere exclusion to eliminate highly similar compounds to reduce the input size where needed +
-  * then [[diversitysel|stepwise elimination]] was applied to obtain the most dissimilar compounds +
- +
-In sphere exclusion we used the stock compounds first as "centers" for the elimination of redundant compounds, and we retained the stock compounds during the stepwise elimination.+
  
-===== Subsets =====+The [[https://mcule.com/database/|downloadable files]] contain the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds. This means that if you want to further narrow down the number of compounds you can keep the first X compounds of the files and they will be the most dissimilar ones.
  
-To speed up the selection, we used sphere exclusion in case of the Ro5 subsets with TC=0.to pass at most 3M compounds for stepwise elminiation. Then, the following subsets were saved:+Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by [[http://jcheminf.springeropen.com/articles/10.1186/1758-2946-3-33|OpenBabel]]. The combinations of the following algorithms were applied to extract the most dissimilar compounds
 +  * sphere exclusion: to quickly eliminate highly similar compounds to reduce the input collection to a manageable size for the subsequent [[diversitysel|stepwise elimination]] algorithm 
 +  * [[diversitysel|stepwise elimination]]a more thorough algorithm that eliminates one molecule of the most similar molecule pairs
  
-^Subset name ^Input ^Property filter ^Diversity ^Subset size ^ +In sphere exclusion we used the in-stock compounds first as "centers" and eliminated their most similar analogswhile during [[diversitysel|stepwise elimination]] we retained the in-stock compounds from the most similar molecule pairs.
-|Mcule Purchasable (In Stock Ro5 Diverse 1M) |Stock compounds|rule-of-5, max 1 violation|top diverse 1M, max TC: 0.8|1,000,000| +
-|Mcule Purchasable (In Stock Ro5 Diverse 350K)|Stock compounds|rule-of-5 (max 1 violation)|Top diverse 350Kmax TC:0.7|350,000| +
-|Mcule Purchasable (In Stock Ro3)|Stock compounds|rule-of-3 (max 1 violation)|-|154,238| +
-|Mcule Purchasable (In Stock Ro3 Diverse 50K)|Stock compounds|rule-of-3 (max 1 violation)|Top diverse 50K, max TC: 0.8|50,000| +
-|Mcule Purchasable (In Stock & Virtual Ro3)|Stock compounds + virtual compounds|rule-of-3 (max 1 violation)|-|789,907| +
-|Mcule Purchasable (In Stock & Virtual Ro3 Diverse 70K)|Stock compounds + virtual compounds|rule-of-3 (max 1 violation)|Top diverse 70K, max TC: 0.8|70,000|+
  
 +Sphere exclusion diversity selection was applied in case of the rule-of-5 subsets with maximum TC=0.8 and a maximum of 3M compounds that were subjected to stepwise elimination.
subsets.txt · Last modified: 2016/12/27 21:23 by rkiss