This is an old revision of the document!
Table of Contents
Preselected subsets of the Mcule database
We provide you with Ro5 and Ro3 subsets that can serve as a starting point of your virtual screening projects if you don't want to screen the full Mcule database (36M compounds currently). Structurally diverse subsets of the drug like and fragment like parts were generated to represent the same chemical space with a smaller number of compounds.
Availability
The subsets can be
- freely downloaded in SMILES and SDF file formats on our download page
- or can be selected as the input collection for online screening on mcule.com if you have a free account
Methods
Property based filtering
Diversity selection
The Mcule database contains ~5.7M stock compounds and ~30.3M virtual compounds. Diversity selection was carried out in a way to prefer the stock compounds over the virtual ones. As a result, the chemical space is represented by stock compounds where possible. More over, the downloadable files contains the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds, so you can choose yourself the number of compounds to represent the whole Ro3 and Ro5 subsets.
Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by OpenBabel. The combinations of the following algorithms were applied to extract the most dissimilar subsets:
- we used sphere exclusion to eliminate highly similar compounds to reduce the input size
- then stepwise elimination was applied to obtain the most dissimilar compounds
In sphere exclusion we used the stock compounds first as “centers” for the elimination of redundant compounds, and we retained the stock compounds during the stepwise elimination.
Subsets
To speed up the selection, we used sphere exclusion with TC=0.8 to pass at most 3M compounds for stepwise elminiation if possible. Then, the following subsets were saved:
Subset name | Input | Property filter | Diversity | Subset size |
---|---|---|---|---|
Mcule Purchasable (In Stock Ro5 Diverse 1M) | Stock compounds | rule-of-5, max 1 violation | top diverse 1M, max TC: | 1,000,000 |
Mcule Purchasable (In Stock Ro5 Diverse 350K) | Stock compounds | rule-of-5 (max 1 violation) | Top diverse 350K, max TC: | 350,000 |
Mcule Purchasable (In Stock Ro3) | Stock compounds | rule-of-3 (max 1 violation) | - | 154,238 |
Mcule Purchasable (In Stock Ro3 Diverse 50K) | Stock compounds | rule-of-3 (max 1 violation) | Top diverse 50K, max TC: | 50,000 |
Mcule Purchasable (In Stock & Virtual Ro3) | Stock compounds + virtual compounds | rule-of-3 (max 1 violation) | - | 789,907 |
Mcule Purchasable (In Stock & Virtual Ro3 Diverse 70K) | Stock compounds + virtual compounds | rule-of-3 (max 1 violation) | Top diverse 70K, max TC: | 70,000 |