This is an old revision of the document!

Prefiltered subsets of the Mcule database

In case you cannot search / screen the full Mcule database, you may consider using some smaller, representative subsets thereof prefiltered by physicochemical properties and diversity. Structurally diverse subsets representing the drug-like (rule-of-5) and fragment (rule-of-3) chemical space can be accessed as described below.

Availability

The subsets can be

freely downloaded in SMILES and SDF file formats from our download page
or can be selected as the input collection for online screening at mcule.com if you have an Mcule account

Methods

Property based filtering

For the drug-like and fragment-like subsets the rule of 5 and rule of 3 rules were applied, allowing one violation. Additionally, we used a few more rules:

number of components < = 1
MW > = 100
number of N+O atoms > = 1
number of rings > = 1
number of halogens < = 7
number of inorganic atoms = 0

We used these rules to leave out more “strange compounds” which may be included more likely in the diverse sets as they differ from the regular compounds.

Diversity selection

The Mcule database contains ~5.7M stock compounds and ~30.3M virtual compounds. Diversity selection was carried out in a way to prefer the stock compounds over the virtual ones. As a result, the chemical space is represented by stock compounds where possible. More over, the downloadable files contains the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds, so you can choose yourself the number of compounds to represent the whole Ro3 and Ro5 subsets.

Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by OpenBabel. The combinations of the following algorithms were applied to extract the most dissimilar subsets:

we used sphere exclusion to eliminate highly similar compounds to reduce the input size where needed
then stepwise elimination was applied to obtain the most dissimilar compounds

In sphere exclusion we used the stock compounds first as “centers” for the elimination of redundant compounds, and we retained the stock compounds during the stepwise elimination.

Subsets

To speed up the selection, we used sphere exclusion in case of the Ro5 subsets with TC=0.8 to pass at most 3M compounds for stepwise elminiation. Then, the following subsets were saved:

Subset name	Input	Property filter	Diversity	Subset size
Mcule Purchasable (In Stock Ro5 Diverse 1M)	Stock compounds	rule-of-5, max 1 violation	top diverse 1M, max TC: 0.8	1,000,000
Mcule Purchasable (In Stock Ro5 Diverse 350K)	Stock compounds	rule-of-5 (max 1 violation)	Top diverse 350K, max TC:0.7	350,000
Mcule Purchasable (In Stock Ro3)	Stock compounds	rule-of-3 (max 1 violation)	-	154,238
Mcule Purchasable (In Stock Ro3 Diverse 50K)	Stock compounds	rule-of-3 (max 1 violation)	Top diverse 50K, max TC: 0.8	50,000
Mcule Purchasable (In Stock & Virtual Ro3)	Stock compounds + virtual compounds	rule-of-3 (max 1 violation)	-	789,907
Mcule Purchasable (In Stock & Virtual Ro3 Diverse 70K)	Stock compounds + virtual compounds	rule-of-3 (max 1 violation)	Top diverse 70K, max TC: 0.8	70,000

online documentation

Table of Contents