Simple Similarity Coefficient

Watchers
Share
Return to The Tapestry Homepage
Enlarge
Return to The Tapestry Homepage
Return to Methods
Enlarge
Return to Methods

Contents

This is one of a series of articles on Genealogical Methods, prepared in association with The Tapestry. See Index for a list of related articles.
__________________________


Background

Similarity coefficients are used to compare two samples to see how similar two samples might be to each other. Such coefficients could be used, for example, in YDNA testing to see how similar two YDNA test kits might be. There are a number of different types of similarity coefficients, each designed to assess a different type of information recorded the samples. YDNA data used by genealogists for determining whether two people share a relatively recent male ancestor is of a type known as "Ranked data". For such data the "Simple Similarity Index" is the appropriate measure.

Calculation

The similarity coefficient (Sf) for comparing two YDNA test kits is calculated as:

Sf=(N-D)/N

where

N= the number of markers examined (must be the same markers in each kit)
D=the absolute value difference between markers values in the two samples

A physical interpretation of such a similarity coefficient is that it shows the fraction of the markers for which two kits share common values. When expressed as a percentage, two kits that shared exactly the same values for all markers tested, who have a similarity of 100%. If they differed by one marker out of 67, their similarity would be 98.5%=(67-1)/67. If they differed on each and every marker their similarity would be 0%=67-67/67.


In the following example YDNA markers for Kit A are compared to those of Kit B in a 12 marker test:

MarkerKit AKit BSingle Step Difference
1 13 13 0
2 25 24 1
3 14 14 0
4 10 10 0
5 11 11 0
6 13 13 0
7 12 12 0
8 12 12 0
9 12 12 0
10 13 13 0
11 14 14 0
12 29 29 0
Total Difference1

In this case the similarity coefficient is given as

Sf=(12-1)/12=11/12=0.917

What that tells us is the marker values for the two kits 91.7% similar to each other.

It is sometimes useful to express this as a Dissimilarity Coefficient:

Df=1-S=1.0-0.917=0.083

In this case, this tells us that the two kits differ from each other by 8.3%

Ultimately, the difference between similarity and dissimilarity coefficients is that the one tells us how similar two kits are to each other, and the other tells us how dissimilar they are.

Criteria

Analysis of YDNA results assumes that on average, the larger the difference between two kits, the longer its been since they shared a common ancestor. To make use of YDNA results in furthering their understanding of their family history, genealogists seek to identify persons (perhaps persons with a better paper trail to their ancestor) that are closely related to themselves in terms of their YDNA. Typically, the closer the YDNA data matches, the more recently their common ancestor was living.[1] may show The number of mutations that have occurred in a persons YDNA over a given period of time has a strong probablitistic component. One average any given marker can be expected to have a mutation at a rate of roughly 0.02% per generation.


How long that might be is commonly measured in generations, though it can be converted into a "Time to Most Recent Common Ancestor", by assuming an average number of years per generations. An average of 25 years per generation is commonly used, though averages of as little as 20 years are sometimes used.

Non-Surname Matches

Values of Dissimilarity Coefficients in terms of the number of markers examined (N) and the number of markers-off. Makes no requirement for kits to share a common surname.

MarkersNumber of mismatched markers or "markers-off" (D)
N 0 1 2 3 4 5 6 7 8 9 10 11 12
12 0.000 0.083 0.167 0.250 0.333 0.417 0.500 0.583 0.667 0.750 0.833 0.917 1.000
25 0.000 0.040 0.080 0.120 0.160 0.017 0.240 0.280 0.320 0.360 0.400 0.440 0.480
37 0.000 0.027 0.054 0.081 0.108 0.135 0.162 0.189 0.216 0.243 0.270 0.297 0.324
67 0.000 0.015 0.030 0.045 0.060 0.075 0.090 0.104 0.119 0.134 0.149 0.164 0.179
111 0.000 0.009 0.018 0.027 0.036 0.045 0.054 0.063 0.072 0.081 0.090 0.099 0.108
Color Key:
Very Tightly Related
Tightly Related
Related
Probably Related
Only Possibly Related
Not Related

Surnames in Common

See:Ancestry FAQ's Based on FTDNA characterizations of strength of the relationships for matches with both descendants sharing the same surname.

MarkersNumber of mismatched markers or "markers-off" (D)
N 0 1 2 3 4 5 6 7 8 9 10 11 12
12 0.000 0.083 0.167 0.250 0.333 0.417 0.500 0.583 0.667 0.750 0.833 0.917 1.000
25 0.000 0.040 0.080 0.120 0.160 0.200 0.240 0.280 0.320 0.360 0.400 0.440 0.480
37 0.000 0.027 0.054 0.081 0.108 0.135 0.162 0.189 0.216 0.243 0.270 0.297 0.324
67 0.000 0.015 0.030 0.045 0.060 0.075 0.090 0.104 0.119 0.134 0.149 0.164 0.179
111 0.000 0.009 0.018 0.027 0.036 0.045 0.054 0.063 0.072 0.081 0.090 0.099 0.108
Very Tightly Related
Tightly Related
Related
Probably Related
Possibly Related
Probably Not Related
Not Related

Based on the above, but adjusted for use on WeRelate[2]:

MarkersNumber of mismatched markers or "markers-off" (D)
N 0 1 2 3 4 5 6 7 8 9 10 11 12
12 0.000 0.083 0.167 0.250 0.333 0.417 0.500 0.583 0.667 0.750 0.833 0.917 1.000
25 0.000 0.040 0.080 0.120 0.160 0.200 0.240 0.280 0.320 0.360 0.400 0.440 0.480
37 0.000 0.027 0.054 0.081 0.108 0.135 0.162 0.189 0.216 0.243 0.270 0.297 0.324
67 0.000 0.015 0.030 0.045 0.060 0.075 0.090 0.104 0.119 0.134 0.149 0.164 0.179
111 0.000 0.009 0.018 0.027 0.036 0.045 0.054 0.063 0.072 0.081 0.090 0.099 0.108
Very Tightly Related
Tightly Related
Related
Probably Related
Possibly Related
Probably Not Related
Not Related

Advantage

The advantage of using similarity coefficient to compare YDNA results is that they automatically adjust for differences in the number of markers being tested. In effect, these coefficients normalize the data so that 12, 27, 37, 67, and 111 marker kits can be readily compared. This is of particular value when comparing more than two kits at once, especially where a variety of kits with different N values are concerned. There are two specific situations where the use of similarity coefficients are useful in the context of evaluating YDNA results

1. When trying to identify other kits that share a similar YDNA signature with one specific kit
2. When trying place numerous YDNA kits into groups in which each kit shares a relatively recent common ancestor.



Most treatments of YDNA data focus on the number of markers by which two kits differ from each other. Similiarity and Dissiilarity coefficients can be readily converted to provide a representation of the difference between two kits, in terms of "Markers-off". The following table shows the relationships involved.


Footnotes

  1. This is a broad generalization. YDNA results are driven by random mutations. As a result the YDNA of some individuals descended from a specific ancestor may have accumulated more or less mutations compared to another individual descended from the same ancestor. As a result, using the number of mutations that have occurred to predict either the number of generations, or the time to the most recent common ancestor (TMCRA) is not precise.
  2. Adjustments made such that similarity values are consistent with labels across different values of N.