User Tools

Site Tools


subsets

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
subsets [2016/12/22 18:27] – created sanmarksubsets [2016/12/27 21:23] (current) rkiss
Line 1: Line 1:
-====== Preselected subsets of the Mcule database ======+====== Prefiltered subsets of the Mcule database ======
    
-We provide you with Ro5 and Ro3 subsets that can serve as a starting point of your virtual screening projects if you don't want to screen the full Mcule database (36M compounds currently). Structurally diverse subsets of the drug like and fragment like parts were generated to represent the same chemical space with a smaller number of compounds.+In case you cannot search / screen the full Mcule database, you may consider using some smaller, representative subsets thereof prefiltered by physicochemical properties and diversity. Structurally diverse subsets representing the drug-like (rule-of-5) and fragment (rule-of-3) chemical space can be accessed as described below.
  
 ===== Availability ===== ===== Availability =====
  
 The subsets can be  The subsets can be 
-  * freely downloaded in SMILES and SDF file formats on our [[https://mcule.com/database/|download page]]  +  * freely downloaded in SMILES and SDF file formats from our [[https://mcule.com/database/|download page]]  
-  * or can be selected as the input collection for [[screen|online screening]] on mcule.com if you have [[https://mcule.com/accounts/signup/|free account]]+  * or can be selected as the input collection for [[screen|online screening]] in Mcule if you have an [[https://mcule.com/accounts/signup/|Mcule account]]
  
-===== Diversity selection =====+===== Methods =====
  
-The Mcule database contains ~5.7M stock compounds and ~30.3M virtual compounds. Diversity selection was carried out in a way to prefer the stock compounds over the virtual ones. The aim is to represent only those part of the chemical space by virtual compounds space by virtual compopunds+==== Property based filtering ====
  
-We've developed a method for large scale diversity selectionThe selection is carried out diverse subsets can be extracted while we +For the drug-like and fragment subsets the [[http://www.sciencedirect.com/science/article/pii/S0169409X00001290|rule-of-5]] and [[http://www.sciencedirect.com/science/article/pii/S1359644603028319|rule-of-3]] physicochemical property filters are applied allowing max 1 violation. Additionally, we applied the following filtering criteria to skip some rather "strange" compounds: 
 + 
 +  * number of components < = 1 
 +  * MW > = 100 
 +  * number of N+O atoms > = 1 
 +  * number of rings > = 1 
 +  * number of halogens < = 7 
 +  * number of inorganic atoms = 0 
 + 
 +==== Diversity selection ==== 
 + 
 +Diversity selection was set up to prefer in-stock compounds over virtual ones. As a result, the chemical space is represented by in-stock compounds where possible. 
 + 
 +The [[https://mcule.com/database/|downloadable files]] contain the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds. This means that if you want to further narrow down the number of compounds you can keep the first X compounds of the files and they will be the most dissimilar ones. 
 + 
 +Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by [[http://jcheminf.springeropen.com/articles/10.1186/1758-2946-3-33|OpenBabel]]. The combinations of the following algorithms were applied to extract the most dissimilar compounds: 
 +  * sphere exclusion: to quickly eliminate highly similar compounds to reduce the input collection to a manageable size for the subsequent [[diversitysel|stepwise elimination]] algorithm 
 +  * [[diversitysel|stepwise elimination]]: a more thorough algorithm that eliminates one molecule of the most similar molecule pairs 
 + 
 +In sphere exclusion we used the in-stock compounds first as "centers" and eliminated their most similar analogs, while during [[diversitysel|stepwise elimination]] we retained the in-stock compounds from the most similar molecule pairs. 
 + 
 +Sphere exclusion diversity selection was applied in case of the rule-of-5 subsets with maximum TC=0.8 and a maximum of 3M compounds that were subjected to stepwise elimination.
subsets.1482431235.txt.gz · Last modified: 2016/12/22 18:27 by sanmark