User Tools

Site Tools


subsets

This is an old revision of the document!


Prefiltered subsets of the Mcule database

In case you cannot search / screen the full Mcule database, you may consider using some smaller, representative subsets thereof prefiltered by physicochemical properties and diversity. Structurally diverse subsets representing the drug-like (rule-of-5) and fragment (rule-of-3) chemical space can be accessed as described below.

Availability

The subsets can be

Methods

Property based filtering

For the drug-like and fragment subsets the rule-of-5 and rule-of-3 physicochemical property filters are applied allowing max 1 violation. Additionally, we applied the following filtering criteria to skip some rather “strange” compounds:

  • number of components < = 1
  • MW > = 100
  • number of N+O atoms > = 1
  • number of rings > = 1
  • number of halogens < = 7
  • number of inorganic atoms = 0

Diversity selection

Diversity selection was set up to prefer in-stock compounds over virtual ones. As a result, the chemical space is represented by in-stock compounds where possible.

The downloadable files contain the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds. This means that if you want to further narrow down the number of compounds you can keep the first X compounds of the files and they will be the most dissimilar ones.

Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by OpenBabel. The combinations of the following algorithms were applied to extract the most dissimilar compounds:

  • sphere exclusion: to quickly eliminate highly similar compounds to reduce the input collection to a manageable size for the subsequent stepwise elimination algorithm
  • stepwise elimination: a more thorough algorithm that eliminates one molecule of the most similar molecule pairs

In sphere exclusion we used the in-stock compounds first as “centers” and eliminated their most similar analogs, while during stepwise elimination we retained the in-stock compounds from the most similar molecule pairs.

Subsets

To speed up the selection, we used sphere exclusion in case of the Ro5 subsets with TC=0.8 to pass at most 3M compounds for stepwise elminiation. Then, the following subsets were saved:

Subset name Input Property filter Diversity Subset size
Mcule Purchasable (In Stock Ro5 Diverse 1M) Stock compoundsrule-of-5, max 1 violationtop diverse 1M, max TC: 0.81,000,000
Mcule Purchasable (In Stock Ro5 Diverse 350K)Stock compoundsrule-of-5 (max 1 violation)Top diverse 350K, max TC:0.7350,000
Mcule Purchasable (In Stock Ro3)Stock compoundsrule-of-3 (max 1 violation)-154,238
Mcule Purchasable (In Stock Ro3 Diverse 50K)Stock compoundsrule-of-3 (max 1 violation)Top diverse 50K, max TC: 0.850,000
Mcule Purchasable (In Stock & Virtual Ro3)Stock compounds + virtual compoundsrule-of-3 (max 1 violation)-789,907
Mcule Purchasable (In Stock & Virtual Ro3 Diverse 70K)Stock compounds + virtual compoundsrule-of-3 (max 1 violation)Top diverse 70K, max TC: 0.870,000
subsets.1482873549.txt.gz · Last modified: 2016/12/27 22:19 by rkiss