This is an old revision of the document!

Diversity selection

This filter selects the most diverse compounds from large compound collections by eliminating of the most similar structures. The size of the input collection is decreased, while the maximum possible coverage of its represented chemical space is retained.

When to use

If you have limited experimental or computational resources, diversity selection is an unbiased way to limit the number of compounds to handle. Collecting compounds from different regions of the chemical space is an efficient strategy to maximize the diversity of the identified active scaffolds.

Using this filter you can either reduce the size of large (virtual) screening libraries, or select a diverse, representative set of your virtual hits.

How to use

It is recommended to apply structural or phys-chem property filters prior to diversity selection to eliminate unwanted structures. This is important to avoid exotic structures that are typically identified as very diverse by the algorithm.

Basic options

The following options can be used to control the diversity of the resulting collection.

Similarity threshold: the maximum S similarity allowed in the diverse set. It is guaranteed that none of the resulting molecules are more similar than S.
Number of most diverse molecules: the maximum N number of diverse molecules to be selected. The diversity selection algorithm will select the most diverse N molecules, unless the maximum allowed similarity is reached first.

If you do not limit the selection, the full collection will be returned ordered by diversity. This means that the top N molecules in the resulting collection will be the most diverse N molecules. The maximum similarity found at the Nth molecule in the results will refer to the minimum diversity of the first N molecules (none of them are more similar than that value).

Advanced options

Under Advanced options, you can adjust the definition of similarity/dissimilarity of molecules. You can select the descriptor used for calculating the similarity scores. Currently two fingerprints (OpenBabel Linear Fingerprint and Indigo Similarity Fingerprint) are available.

You will be able to set different similarity metrics as the measure of similarity. Currently, only the Tanimoto coefficient (Jaccard index)¹⁾ is implemented as the measure of similarity.

We plan to introduce more descriptors and more similarity measure types in the future.

Molecular descriptor: the molecular descriptor applied for representing chemical structures during the calculation

Default options

The default descriptor used is the linear fingerprint implemented in OpenBabel ²⁾, which is similar to Daylight’s fingerprint and ChemAxon’s ³⁾ linear fingerprint, and the Tanimoto coefficient is calculated as the similarity of fingerprints.

Algorithm

We use an optimized implementation of the stepwise elimination algorithm⁴⁾, which can be described as follows:

Calculate the similarity matrix of the molecules in the input collection
Process the matrix elements as follows:
1. Select the largest off-diagonal element in the similarity matrix
2. Eliminate one molecule of the most similar molecule pair randomly
3. Go to step I. if off-diagonal elements remained
Sort the list of eliminated molecules by similarity values associated to the elimination steps in increasing order

During this process, the size of the collection is reduced while the diversity of the collection is increased. Each elimination step filters out one molecule that has close analogues in the remaining set. As a result, the remaining molecules will have a decreased similarity (increased diversity).

After the algorithm finishes, structures are sorted by similarity values and are placed in the result collection. The first molecules in the resulted collection are the most dissimilar (most diverse) ones. The length of the result list is determined by input parameters: maximum number of compounds and similarity threshold.

Limitations

Diversity selection filter is freely accessible for registered mcule users. Monthly limit is set to 10,000 input molecules. Your usage limits including the number of remaining molecules can be tracked under user profile / limits. To check your user profile click on your user name in the upper right corner on the mcule.com website. If you would like to run the Diversity selection filter for larger collections, please contact us. Our technology allows effective processing of very large collections (~10M).

The average run time for 10,000 input molecules about a minute.

Changelog

—

¹⁾

http://en.wikipedia.org/wiki/Jaccard_index

²⁾

Open Babel v2.3.90 http://openbabel.sourceforge.net/

³⁾

http://www.chemaxon.com/jchem/doc/user/fingerprint.html

⁴⁾

R. J. Taylor, J. Chem. Inf. Comput. Sci., 1995, 35, 59-67.

online documentation

Table of Contents