====== Prefiltered subsets of the Mcule database ======
 
In case you cannot search / screen the full Mcule database, you may consider using some smaller, representative subsets thereof prefiltered by physicochemical properties and diversity. Structurally diverse subsets representing the drug-like (rule-of-5) and fragment (rule-of-3) chemical space can be accessed as described below.

===== Availability =====

The subsets can be 
  * freely downloaded in SMILES and SDF file formats from our [[https://mcule.com/database/|download page]] 
  * or can be selected as the input collection for [[screen|online screening]] in Mcule if you have an [[https://mcule.com/accounts/signup/|Mcule account]]

===== Methods =====

==== Property based filtering ====

For the drug-like and fragment subsets the [[http://www.sciencedirect.com/science/article/pii/S0169409X00001290|rule-of-5]] and [[http://www.sciencedirect.com/science/article/pii/S1359644603028319|rule-of-3]] physicochemical property filters are applied allowing max 1 violation. Additionally, we applied the following filtering criteria to skip some rather "strange" compounds:

  * number of components < = 1
  * MW > = 100
  * number of N+O atoms > = 1
  * number of rings > = 1
  * number of halogens < = 7
  * number of inorganic atoms = 0

==== Diversity selection ====

Diversity selection was set up to prefer in-stock compounds over virtual ones. As a result, the chemical space is represented by in-stock compounds where possible.

The [[https://mcule.com/database/|downloadable files]] contain the compounds in diversity order i.e. the first N compounds represent the most dissimilar N compounds. This means that if you want to further narrow down the number of compounds you can keep the first X compounds of the files and they will be the most dissimilar ones.

Structural similarity was measured by Tanimoto coefficient (TC) between FP2 linear fingerprints generated by [[http://jcheminf.springeropen.com/articles/10.1186/1758-2946-3-33|OpenBabel]]. The combinations of the following algorithms were applied to extract the most dissimilar compounds:
  * sphere exclusion: to quickly eliminate highly similar compounds to reduce the input collection to a manageable size for the subsequent [[diversitysel|stepwise elimination]] algorithm
  * [[diversitysel|stepwise elimination]]: a more thorough algorithm that eliminates one molecule of the most similar molecule pairs

In sphere exclusion we used the in-stock compounds first as "centers" and eliminated their most similar analogs, while during [[diversitysel|stepwise elimination]] we retained the in-stock compounds from the most similar molecule pairs.

Sphere exclusion diversity selection was applied in case of the rule-of-5 subsets with maximum TC=0.8 and a maximum of 3M compounds that were subjected to stepwise elimination.