Differences

This shows you the differences between two versions of the page.

--- diversity_selection [2012/07/02 20:46] – [Default options] rkiss
+++ diversity_selection [2012/07/02 20:58] – [Algorithm] rkiss
@@ Line 33: / Line 33: @@
 The default descriptor used is the linear fingerprint implemented in OpenBabel ((Open Babel v2.3.90 http://openbabel.sourceforge.net/)), which is similar to Daylight’s fingerprint and ChemAxon’s ((http://www.chemaxon.com/jchem/doc/user/fingerprint.html)) linear fingerprint, and the Tanimoto coefficient is calculated as the similarity of fingerprints.
-If you have no preference, you can use the default settings. After implementation and evaluation of new fingerprints and metrics, the default setup can be changed. This can be tracked at the end of this document, in the Changelog section.
 ==== Algorithm ====
-We use an optimized implementation of the stepwise elimination algorithm((R. J. Taylor, J. Chem. Inf. Comput. Sci., 1995, 35, 59 67.)), which can be described as follows:
+We use an optimized implementation of the stepwise elimination algorithm((R. J. Taylor, J. Chem. Inf. Comput. Sci., 1995, 35, 59-67.)), which can be described as follows:
-  - calculate the similarity matrix of the molecules in the input collection
-  - process the matrix elements as follows:
-    - select the largest off-diagonal element in the similarity matrix
-    - eliminate one molecule of the most similar molecule pair randomly
-    - go to step I. if off-diagonal elements remained
-  - sort the list of eliminated molecules by similarity values associated to the elimination steps in increasing order
-During this process, the size of the collection is reduced and diversity increases. Each elimination step throws out a compound that has close analogues in the remaining set. In result, we get a single compound, and a list of compounds with decreasing similarity values, which can be interpreted as the increasing diversity of the remaining set.
+  - Calculate the similarity matrix of the molecules in the input collection
+  - Process the matrix elements as follows:
+    - Select the largest off-diagonal element in the similarity matrix
+    - Eliminate one molecule of the most similar molecule pair randomly
+    - Go to step I. if off-diagonal elements remained
+  - Sort the list of eliminated molecules by similarity values associated to the elimination steps in increasing order
-After the algorithm finishes, structures are sorted by similarity values and are placed in the result collection. The first molecules in the resulted collection are the most dissimilar (most diverse) ones, and the length of the result list is determined using the diversity options: the maximum number of compounds to be selected and the maximum similarity values allowed between diverse compounds.
+During this process, the size of the collection is reduced while the diversity of the collection is increased. Each elimination step filters out one molecule that has close analogues in the remaining set. As a result, the remaining molecules will have a decreased similarity (increased diversity).
+After the algorithm finishes, structures are sorted by similarity values and are placed in the result collection. The first molecules in the resulted collection are the most dissimilar (most diverse) ones. The length of the result list is determined by input parameters: maximum number of compounds and similarity threshold.
 ===== Limitations =====