-====== ​Preselected ​subsets of the Mcule database ======+====== ​Prefiltered ​subsets of the Mcule database ======
-We provide ​you with Ro5 and Ro3 subsets that can serve as a starting point of your virtual screening projects if you don't want to screen the full Mcule database ​(36M compounds currently). Structurally diverse subsets ​of the drug like and fragment ​like parts were generated to represent the same chemical space with a smaller number of compounds.+In case you cannot search / screen the full Mcule database, you may consider using some smaller, representative subsets thereof prefiltered by physicochemical properties and diversity. Structurally diverse subsets ​representing ​the drug-like (rule-of-5) ​and fragment ​(rule-of-3) ​chemical space can be accessed as described below.
 ===== Availability ===== ===== Availability =====
 The subsets can be  The subsets can be 
-  * freely downloaded in SMILES and SDF file formats ​on our [[https://​​database/​|download page]]  +  * freely downloaded in SMILES and SDF file formats ​from our [[https://​​database/​|download page]]  
-  * or can be selected as the input collection for [[screen|online screening]] ​on ​if you have [[https://​​accounts/​signup/​|free account]]+  * or can be selected as the input collection for [[screen|online screening]] ​in Mcule if you have an [[https://​​accounts/​signup/​|Mcule account]]
-===== Diversity selection ​=====+===== Methods ​=====
-The Mcule database contains ~5.7M stock compounds and ~30.3M virtual compounds. Diversity selection was carried out in a way to prefer the stock compounds over the virtual ones. The aim is to represent only those part of the chemical space by virtual compounds space by virtual compopunds+==== Property based filtering ====
-We've developed a method for large scale diversity selectionThe selection is carried out diverse subsets ​can be extracted ​while we +For the drug-like and fragment subsets the [[http://​​science/​article/​pii/​S0169409X00001290|rule-of-5]] and [[http://​​science/​article/​pii/​S1359644603028319|rule-of-3]] physicochemical property filters are applied allowing max 1 violation. Additionally,​ we applied the following filtering criteria to skip some rather "​strange"​ compounds:​ 
 +  * number of components < = 1 
 +  * MW > = 100 
 +  * number of N+O atoms > = 1 
 +  * number of rings > = 1 
 +  * number of halogens < = 7 
 +  * number of inorganic atoms = 0 
 +==== Diversity ​selection ​==== 
 +Diversity selection was set up to prefer in-stock compounds over virtual ones. As a result, the chemical space is represented by in-stock compounds where possible. 
 +The [[https://​​database/​|downloadable files]] contain the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds. This means that if you want to further narrow down the number of compounds you can keep the first X compounds of the files and they will be the most dissimilar ones. 
 +Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by [[http://​​articles/​10.1186/​1758-2946-3-33|OpenBabel]]. The combinations of the following algorithms were applied to extract the most dissimilar compounds:​ 
 +  * sphere exclusion: to quickly eliminate highly similar compounds to reduce the input collection to a manageable size for the subsequent [[diversitysel|stepwise elimination]] algorithm 
 +  * [[diversitysel|stepwise elimination]]:​ a more thorough algorithm that eliminates one molecule of the most similar molecule pairs 
 +In sphere exclusion we used the in-stock compounds first as "​centers"​ and eliminated their most similar analogs, ​while during [[diversitysel|stepwise elimination]] ​we retained the in-stock compounds from the most similar molecule pairs. 
 +Sphere exclusion diversity selection was applied in case of the rule-of-5 subsets with maximum TC=0.8 and a maximum of 3M compounds that were subjected to stepwise elimination.
