In case you cannot search / screen the full Mcule database, you may consider using some smaller, representative subsets thereof prefiltered by physicochemical properties and diversity. Structurally diverse subsets representing the drug-like (rule-of-5) and fragment (rule-of-3) chemical space can be accessed as described below.
The subsets can be
For the drug-like and fragment subsets the rule-of-5 and rule-of-3 physicochemical property filters are applied allowing max 1 violation. Additionally, we applied the following filtering criteria to skip some rather “strange” compounds:
Diversity selection was set up to prefer in-stock compounds over virtual ones. As a result, the chemical space is represented by in-stock compounds where possible.
The downloadable files contain the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds. This means that if you want to further narrow down the number of compounds you can keep the first X compounds of the files and they will be the most dissimilar ones.
Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by OpenBabel. The combinations of the following algorithms were applied to extract the most dissimilar compounds:
In sphere exclusion we used the in-stock compounds first as “centers” and eliminated their most similar analogs, while during stepwise elimination we retained the in-stock compounds from the most similar molecule pairs.
Sphere exclusion diversity selection was applied in case of the rule-of-5 subsets with maximum TC=0.8 and a maximum of 3M compounds that were subjected to stepwise elimination.