Volume 22, Issue 9-10 pp. 1006-1026
Review

Approaches to Measure Chemical Similarity – a Review

Nina Nikolova

Nina Nikolova

Procter and Gamble, Eurocor, Central Product Safety, 100 Temselaan, B-1853 Strombeek-Bever, Belgium, Fax 32 2 5683098, Tel 32 2 456 2076, Tel 32 2 456 801

Search for more papers by this author
Joanna Jaworska

Joanna Jaworska

Procter and Gamble, Eurocor, Central Product Safety, 100 Temselaan, B-1853 Strombeek-Bever, Belgium, Fax 32 2 5683098, Tel 32 2 456 2076, Tel 32 2 456 801

Search for more papers by this author
First published: 23 January 2004
Citations: 324

Abstract

Although the concept of similarity is a convenient for humans, a formal definition of similarity between chemical compounds is needed to enable automatic decision-making. The objective of similarity measures in toxicology and drug design is to allow assessment of chemical activities. The ideal similarity measure should be relevant to the activity of interest. The relevance could be established by exploiting the knowledge about fundamental chemical and biological processes responsible for the activity. Unfortunately, this knowledge is rarely available and therefore different approximations have been developed based on similarity between structures or descriptor values. Various methods are reviewed, ranging from two-dimensional, three-dimensional and field approaches to recent methods based on “Atoms in Molecules” theory. All these methods attempt to describe chemical compounds by a set of numerical values and define some means for comparison between them. The review provides analysis of potential pitfalls of this methodology – loss of information in the representations of molecular structures – the relevance of a particular representation and chosen similarity measure to the activity. A brief review of known methods for descriptor selection is also provided. The popular “neighborhood behavior” principle is criticized, since proximity with respect to descriptors does not necessarily mean proximity with respect to activity. Structural similarity should also be used with care, as it does not always imply similar activity, as shown by examples. We remind that similarity measures and classification techniques based on distances rely on certain data distribution assumptions. If these assumptions are not satisfied for a given dataset, the results could be misleading. A discussion on similarity in descriptor space in the context of applicability domain assessment of QSAR models is also provided. Finally, it is shown that descriptor based similarity analysis is prone to errors if the relationship between the activity and the descriptors has not been previously established. A justification for the usage of a particular similarity measure should be provided for every specific activity by expert knowledge or derived by data modeling techniques.

The full text of this article hosted at iucr.org is unavailable due to technical difficulties.