User Tools

Site Tools


diversity_selection

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
Last revisionBoth sides next revision
diversity_selection [2012/07/02 20:42] – [Advanced options] rkissdiversity_selection [2012/07/03 17:42] sanmark
Line 16: Line 16:
  
   * **Similarity threshold**: the maximum S similarity allowed in the diverse set. It is guaranteed that none of the resulting molecules are more similar than S.   * **Similarity threshold**: the maximum S similarity allowed in the diverse set. It is guaranteed that none of the resulting molecules are more similar than S.
-  * **Number of most diverse molecules**: the maximum N number of diverse molecules to be selected. The diversity selection algorithm will select the most diverse N molecules, unless the maximum allowed similarity is reached first.+  * **Max number of most diverse molecules**: the maximum N number of diverse molecules to be selected. The diversity selection algorithm will select the most diverse N molecules, unless the maximum allowed similarity is reached first.
  
 If you do not limit the selection, the full collection will be returned ordered by diversity. This means that the top N molecules in the resulting collection will be the most diverse N molecules. The //maximum similarity// found at the Nth molecule in the results will refer to the minimum diversity of the first N molecules (none of them are more similar than that value). If you do not limit the selection, the full collection will be returned ordered by diversity. This means that the top N molecules in the resulting collection will be the most diverse N molecules. The //maximum similarity// found at the Nth molecule in the results will refer to the minimum diversity of the first N molecules (none of them are more similar than that value).
Line 28: Line 28:
 We plan to introduce more descriptors and more similarity measure types in the future. We plan to introduce more descriptors and more similarity measure types in the future.
  
-  * **Molecular descriptor**: select the molecular descriptor used to represent chemical structures during the calculation+  * **Molecular descriptor**: the molecular descriptor applied for representing chemical structures during the calculation
  
 ==== Default options ==== ==== Default options ====
  
-The default descriptor used is the linear fingerprint implemented in Open Babel ((Open Babel v2.3.90 http://openbabel.sourceforge.net/)), which is similar to Daylight’s fingerprint and Chemaxon’s linear fingerprint, and the Tanimoto coefficient is calculated as the similarity of fingerprints.+The default descriptor used is the linear fingerprint implemented in OpenBabel ((Open Babel v2.3.90 http://openbabel.sourceforge.net/)), which is similar to Daylight’s fingerprint and ChemAxon’s ((http://www.chemaxon.com/jchem/doc/user/fingerprint.html)) linear fingerprint, and the Tanimoto coefficient is calculated as the similarity of fingerprints. 
 +==== Algorithm ====
  
-If you have no suggestions to use another setup, you can rely on our choices. After implementation and evaluation of new fingerprints and metrics, the default setup can be changedThis can be tracked at the end of this documentin the Changelog section.+We use an optimized implementation of the stepwise elimination algorithm((RJ. TaylorJChem. Inf. Comput. Sci., 1995, 35, 59-67.)), which can be described as follows:
  
-==== Algorithm ====+  - Calculate the similarity matrix of the molecules in the input collection 
 +  - Process the matrix elements as follows: 
 +    - Select the largest off-diagonal element in the similarity matrix 
 +    - Eliminate one molecule of the most similar molecule pair randomly 
 +    - Go to step I. if off-diagonal elements remained 
 +  - Sort the list of eliminated molecules by similarity values associated to the elimination steps in increasing order
  
-We use an optimized implementation of the stepwise elimination algorithm((R. JTaylorJChem. Inf. Comput. Sci., 1995, 35, 59 67.)), which can be described as follows:+During this process, the size of the collection is reduced while the diversity of the collection is increased. Each elimination step filters out one molecule that has close analogues in the remaining setAs a resultthe remaining molecules will have a decreased similarity (increased diversity).
  
-  - calculate the similarity matrix of the molecules in the input collection +After the algorithm finishes, structures are sorted by similarity values and are placed in the result collection. The first molecules in the resulted collection are the most dissimilar (most diverse) onesThe length of the result list is determined by input parameters: maximum number of compounds and similarity threshold. 
-  - process the matrix elements as follows: +===== Limitations =====
-    - select the largest off-diagonal element in the similarity matrix +
-    - eliminate one molecule of the most similar molecule pair randomly +
-    - go to step Iif off-diagonal elements remained +
-  - sort the list of eliminated molecules by similarity values associated to the elimination steps in increasing order+
  
-During this process, the size of the collection is reduced and diversity increasesEach elimination step throws out a compound that has close analogues in the remaining set. In resultwe get a single compound, and a list of compounds with decreasing similarity values, which can be interpreted as the increasing diversity of the remaining set+Diversity selection filter is freely accessible for registered mcule usersMonthly limit is set to 10,000 input molecules. Your usage limits including the number of remaining molecules can be tracked under user profile / limits. To check your user profile click on your user name in the upper right corner on the mcule.com website. If you would like to run the Diversity selection filter for larger collections, please [[support@mcule.com|contact us]]. Our technology allows effective processing of very large collections (~10M)
  
-After the algorithm finishesstructures are sorted by similarity values and are placed in the result collection. The first molecules in the resulted collection are the most dissimilar (most diverse) ones, and the length of the result list is determined using the diversity options: the maximum number of compounds to be selected and the maximum similarity values allowed between diverse compounds. +The average run time for 10,000 input molecules about a minute.
- +
-===== Limitations =====+
  
-The diversity selection is freely accessible for every mcule user with a monthly limit of 10000 input compounds. The average run time for 10000 compouds is about 5 minutes. The usage of your diversity filter can be tracked on the user profile / limits. Our technologies allow effective processing of very large collections (~10M). If you want to exceed your limits, please contact us. 
  
 ===== Changelog ===== ===== Changelog =====