Data Science

Implementing the five most popular Similarity Measures in Python

The buzz term similarity distance measures has a variety of definitions among math and data mining practitioners. As a result, concepts involving the term and it’s usage can go right over the heads of beginners. So today, I write this post to give simplified and intuitive definitions of similarity measures, as well as to dive into the implementation of five of the most popular similarity measures.


Similarity is the basic building block for techniques such as Recommendation engines, clustering, classification and anomaly detection.

Similarity is the measure of how alike two data objects are. Similarity in the data mining context is usually described as a distance with dimensions representing features of the objects. If this distance is small, there will be high degree of similarity; if this distance is large, there will be low degree of similarity. Similarity is subjective and is highly dependent on the domain and application. For example, two fruits can be similar because of their color, size or taste. Care should be taken when calculating distance across dimensions/features that are unrelated. The relative values of each feature must be normalized, or one feature could end up dominating the distance calculation. Similarity is measured in the range 0 to 1 [0,1].

Two main considerations about similarity are as follows:

  • Similarity = 1 if X = Y         (Where X, Y are two objects)
  • Similarity = 0 if X ≠ Y

That’s all about similarity. So let’s dive into implementing five most popular similarity distance measures.
Eucledian Distance:
eucledian distance
Euclidean distance is the most commonly-used of our distance measures. Because of this, Eucledian distance is often just referred to as “distance”. When data is dense or continuous, this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them. This distance between two points is given by the Pythagorean theorem.
Euclidean distance implementation in Python:
python code and similarities
euclidean distance in python

Manhattan Distance

manhattan distance in python
Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In simple language, it is the absolute sum of the difference between the x-coordinates  and y-coordinates of each of the points. Suppose we have a point A and a point B. To find the manhattan distance between them, we just have to sum up the absolute variation along the x and y axes. We find manhattan distance between two points by measuring along axis at right angles.
In a plane with p1 at (x1, y1) and p2 at (x2, y2).
Manhattan distance = |x1 – x2| + |y1 – y2|
This Manhattan distance metric is also known as Manhattan length, rectilinear distance, L1 distance, L1 norm ,city block distance, Minkowski’s L1 distance, taxi cab metric, or city block distance.

Manhattan distance implementation in python:

manhattan distance in python
manhattan distance in python

Minkowski distance:

minkowski distance in python
The Minkowski distance is a generalized metric form of Euclidean distance and Manhattan distance. It looks like this:
minkowski distance in python
In the equation d^MKD is the Minkowski distance between the data record i and j, k the index of a variable, n the total number of variables y and λ the order of the Minkowski metric. Although it is defined for any λ > 0, it is rarely used for values other than 1, 2 and ∞.
Different names for the Minkowski difference arise from the synonyms of other measures:

  • λ = 1 is the Manhattan distance. Synonyms are L1-Norm, Taxicab or City-Block distance. For two vectors of ranked ordinal variables the Manhattan distance is sometimes called Foot-ruler distance.
  • λ = 2 is the Euclidean distance. Synonyms are L2-Norm and Ruler distance. For two vectors of ranked ordinal variables the Euclidean distance is sometimes called Spear-man distance.
  • λ = ∞ is the Chebyshev distance. Synonyms are Lmax-Norm or Chessboard distance.

 Minkowski distance implementation in python:

minkowski distance in python
script 3

Cosine Similarity

cosine similarity in python
Cosine similarity metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we will effectively try to find the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1]. One of the reasons for the popularity of cosine similarity is that it is very efficient to evaluate, especially for sparse vectors.

Cosine implementation in Python 

cosine similarity implementation in python
script 4

Jaccard Similarity:

jaccard similarity in python
So far, we’ve discussed some metrics to find the similarity between objects. where the objects are points or vectors. We use jaccard similarity to find similarity between sets. So first, let’s learn the very basics of sets.
A set is (unordered) collection of objects {a,b,c}. We use the notation as elements separated by commas inside curly brackets { }. They are unordered so {a,b} = { b,a }.
Cardinality of A (denoted by |A|) counts how many elements are in A.
Intersection between two sets A and B is denoted A ∩ B and reveals all items which are in both sets A,B.
Union between two sets A and B is denoted A B and reveals all items which are in either set.
jaccard similarity in python
Now going back to Jaccard similarity. The Jaccard similarity measures similarity between finite sample sets, and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets. Suppose you want to find jaccard similarity between two sets A and B, it is the ration of cardinality of A ∩ B and A ∪ B.
jaccard similarity implementation
jaccard implementation

Implementation of all 5 similarity measure into one Similarity class:

similarity measures in one similarity class
similarity measurement all in one
all in one similarity class
The code from this post can also be found on Github, and on the Dataaspirant blog.
Photo credit: t3rmin4t0r / Foter / CC BY

  1. Hi there to every one, the contents present at this website are really amazing for people experience, well, keep up
    the good work fellows.

  2. It’s actually a great and useful piece of info.
    I’m happy that you just shared this helpful information with
    us. Please stay us up to date like this. Thank you for sharing.

  3. natalielise 1 year ago

    Hey there, You have done a fantastic job.
    I’ll certainly digg it and personally suggest to my friends.
    I’m confident they’ll be benefited from this web site.
    pof natalielise

  4. was coconut oil 11 months ago

    It’s remarkable for me to have a site, which is helpful designed for my
    knowledge. thanks admin

  5. 11 months ago

    It’s really very complex in this full of activity life to listen news on Television, therefore I just use web for that
    reason, and obtain the most recent news.

  6. 11 months ago

    Pretty! This was an incredibly wonderful article. Thanks for providing this info.

  7. quest bars cheap 11 months ago

    Ahaa, its fastidious conversation regarding this piece of writing here at
    this webpage, I have read all that, so at this time me also commenting at
    this place.

  8. ps4 games 11 months ago

    I do not know whether it’s just me or if everyone else encountering problems with your blog.
    It appears like some of the written text in your posts are running off the
    screen. Can someone else please provide feedback and let me
    know if this is happening to them as well? This may be a problem with my internet browser
    because I’ve had this happen before. Kudos

  9. ps4 games 10 months ago

    It is actually a nice and helpful piece of information. I am happy that you shared this useful information with us.
    Please keep us up to date like this. Thank
    you for sharing.

Leave a Comment

Your email address will not be published.

You may also like

Pin It on Pinterest