Differences

This shows you the differences between two versions of the page.

--- diversity_selection [2012/06/28 10:25] – [Limitations] gazs
+++ diversity_selection [2012/07/02 20:50] – [Algorithm] rkiss
@@ Line 1: / Line 1: @@
 ====== Diversity selection ======
-This filter selects the most diverse compounds from large compound collections through the subsequent elimination of the most redundant structures. While the number of compounds in the input collection will be decreased, the maximum possible coverage of the represented chemical space will be retained.
+This filter selects the most diverse compounds from large compound collections by eliminating of the most similar structures. The size of the input collection is decreased, while the maximum possible coverage of its represented chemical space is retained.
 ===== When to use =====
-If you have limited experimental or computational resources, diversity selection is an unbiased way to limit the number of compounds to handle. Assuming that every region of the chemical space has a rather desirable or undesirable (active or inactive) character, collecting compounds from different regions can be increase the chance to find desired compounds (actives).
+If you have limited experimental or computational resources, diversity selection is an unbiased way to limit the number of compounds to handle. Collecting compounds from different regions of the chemical space is an efficient strategy to maximize the diversity of the identified active scaffolds.
-Using this filter you can select a diverse, representative set of your virtual hits, or effectively reduce your screening/virtual library enriching diverse scaffolds. Compound selection can be controlled simply both in terms of diversity and the number of diverse compounds.
+Using this filter you can either reduce the size of large (virtual) screening libraries, or select a diverse, representative set of your virtual hits.
 ===== How to use =====
-It is recommended that you eliminate unwanted structures before a diversity selection, placing the filter after structural or phys-chem filters. This way you can avoid exotic structures or structures with exotic substituents remaining in the results, which would likely happen with this filter.
+It is recommended to apply structural or phys-chem property filters prior to diversity selection to eliminate unwanted structures. This is important to avoid exotic structures that are typically identified as very diverse by the algorithm.
 ==== Basic options ====
@@ Line 20: / Line 18: @@
   * **Number of most diverse molecules**: the maximum N number of diverse molecules to be selected. The diversity selection algorithm will select the most diverse N molecules, unless the maximum allowed similarity is reached first.
-If you don’t limit the selection, the full collection will be returned in a diversity order. This means that the top N molecules in the resulting collection will be the most diverse N molecules. The similarity threshold found at the Nth molecule refer to the diversity of the first N molecules: none of them are more similar than this value.
+If you do not limit the selection, the full collection will be returned ordered by diversity. This means that the top N molecules in the resulting collection will be the most diverse N molecules. The //maximum similarity// found at the Nth molecule in the results will refer to the minimum diversity of the first N molecules (none of them are more similar than that value).
 ==== Advanced options ====
-You can adjust the meaning of similarity and dissimilarity of molecules here, selecting the descriptor on which similarity scores are calculated. We use the Tanimoto coefficient (Jackard index)((http://en.wikipedia.org/wiki/Jaccard_index)) as the measure of similarity now, but you can chose between different chemical fingerprints as descriptors. We plan to introduce more descriptor types and more similarity measure types in the future.
+Under Advanced options, you can adjust the definition of similarity/dissimilarity of molecules. You can select the descriptor used for calculating the similarity scores. Currently two fingerprints (OpenBabel Linear Fingerprint and Indigo Similarity Fingerprint) are available.
-  * **Molecular descriptor**: the molecular descriptor used to represent chemical structures during the calculation
+You will be able to set different similarity metrics as the measure of similarity. Currently, only the Tanimoto coefficient (Jaccard index)((http://en.wikipedia.org/wiki/Jaccard_index)) is implemented as the measure of similarity.
-==== Default options ====
+We plan to introduce more descriptors and more similarity measure types in the future.
-The default descriptor used is the linear fingerprint implemented in Open Babel ((Open Babel v2.0.0 http://openbabel.sourceforge.net/)), which is similar to Daylight’s fingerprint and Chemaxon’s linear fingerprint, and the Tanimoto coefficient is calculated as the similarity of fingerprints.
+  * **Molecular descriptor**: the molecular descriptor applied for representing chemical structures during the calculation
-If you have no suggestions to use another setup, you can rely on our choices. After implementation and evaluation of new fingerprints and metrics, the default setup can be changed. This can be tracked at the end of this document, in the Changelog section.
+==== Default options ====
+The default descriptor used is the linear fingerprint implemented in OpenBabel ((Open Babel v2.3.90 http://openbabel.sourceforge.net/)), which is similar to Daylight’s fingerprint and ChemAxon’s ((http://www.chemaxon.com/jchem/doc/user/fingerprint.html)) linear fingerprint, and the Tanimoto coefficient is calculated as the similarity of fingerprints.
 ==== Algorithm ====
-We use an optimized implementation of the stepwise elimination algorithm((R. J. Taylor, J. Chem. Inf. Comput. Sci., 1995, 35, 59 67.)), which can be described as follows:
+We use an optimized implementation of the stepwise elimination algorithm((R. J. Taylor, J. Chem. Inf. Comput. Sci., 1995, 35, 59-67.)), which can be described as follows:
   - calculate the similarity matrix of the molecules in the input collection
@@ Line 47: / Line 46: @@
 During this process, the size of the collection is reduced and diversity increases. Each elimination step throws out a compound that has close analogues in the remaining set. In result, we get a single compound, and a list of compounds with decreasing similarity values, which can be interpreted as the increasing diversity of the remaining set.
-After the algorithm finishes, structures are sorted by similarity values and are placed in the result collection. The first molecules in the resulted collection are the most dissimilar (most diverse) ones, and the lenght of the result list is determined using the diversity options: the maximum number of compounds to be selected and the maximum similarity values allowed between diverse compounds.
+After the algorithm finishes, structures are sorted by similarity values and are placed in the result collection. The first molecules in the resulted collection are the most dissimilar (most diverse) ones, and the length of the result list is determined using the diversity options: the maximum number of compounds to be selected and the maximum similarity values allowed between diverse compounds.
 ===== Limitations =====
-The diversity selection is freely accessible for every mcule user with a monthly limit of 10000 input compounds. The average run time for 10000 compouds is X minutes. The usage of your diversity filter can be tracked on the user profile / limits. Our technologies allow effective processing of very large collections (~10M). If you want to exceed your limits, please contact us.
+The diversity selection is freely accessible for every mcule user with a monthly limit of 10000 input compounds. The average run time for 10000 compouds is about 5 minutes. The usage of your diversity filter can be tracked on the user profile / limits. Our technologies allow effective processing of very large collections (~10M). If you want to exceed your limits, please contact us.
 ===== Changelog =====