Interpreting a Similarity Matrix

Watchers
Share
Return to The Tapestry Homepage
Enlarge
Return to The Tapestry Homepage
Return to Methods
Enlarge
Return to Methods

Contents

This is one of a series of articles on Genealogical Methods, prepared in association with The Tapestry. See Index for a list of related articles.
__________________________


Similarity Matricies

Similarity matricies are used in a variety of fields of study to compare "populations". The objective is to see how similar or how different two different populations are. There are a number of different similarity coefficients, used for slightly different purposes under slightly different circumstances. [1] As an example, population biologists might use similarity techniques to see how different two communities of plants were. In practice, what they would do would be to identify the species present in each community, and then use a similarity index to see how many species were in common in the two communities, and how many were unique to one or the other. Usually, these coefficients have values ranging from 0.0 (sometimes expressed as 0%) to 1.0 (expressed as 100%), where 0.0 means the two communities are identical, and 1.0 means they are not at all alike. A value lying somewhere in between means the two communities share some species, but there are some unique species in one community or the other, and it is those unique species that sets them apart.

As an example if you had a total of 20 species in two separate communities, and each had seven species that were unique to it, but there were 6 species common to both, depending on the similarity coefficient used you'd get a similarity coefficient of about 30% (6/20).

That's pretty straight forward, though it can get a bit more complex. It might turn out that those species common to both communities were very rare, and that the unique species were commonplace within each community. Under those circumstances you might want to discount those common species, perhaps by using their respective population sizes to "weight" their presence in the community. That would have the effect in this case of driving the similarity coefficient toward 0, indicating that they were very dis-similar to each other. Of course, in such an instance the answer to the question "are there two different communities here" would be so obvious that you wouldn't bother with the calculations! After all, if the two communities are a grassland lying adjacent to a woodland its a pretty obvious that they are different communites. But its not always so obvious, which is where similarity coefficients really come into play.

Genetic Genealogy

Similarity coefficients can be used for other kinds of populations than what is involved in community ecology. In the present instance, they can be applied to genetic genealogy. In particular we can use them to evaluate whether or not the YDNA of two people indicates whether they have a common ancestor in a genealogically meaningful timeframe GMT. Typically, when we look at YDNA we take a test that tells us the value of certain genetic markers. That value changes through time as mutations accumulate from generation to generation. When we compare the results for two peoples YDNA tests we simply count the number of differences among their various markers. The more differences there are, the more mutations that have occurred; the more mutations that have occurred, the further back it is until the two people shared a common ancestor. As with the plant community example above, we can compare the two "populations" using similarity coeffiecients. As an example, if two people both take a 111 marker test [2] and differ from each other on 4 of the 111 markers, a simple dis-similarity coefficient of 4/111=0.036, or 3.6% would result.

A test result of 3.6% dissimilarity would indicate that the two test results (aka "kits"), were very similar to each other, and that, given the typical rate of mutation, not many generations had elapsed since the two individuals shared a common ancestor. The larger the dissimilarity the more generations have passed since that common ancestor. How many generations have passed? That's not so clear. The mutation rates on which this is based are only averages of a series of random events. Given enough events and they work out to a reasonably constant rate. But in specific cases, this can be quite variable. Thus we may be able to say that two kits differ by 4 markers out of 111, but we can only approximately know how many generations have elapsed since that common ancestor lived. The reason for that is because mutation rates are basically average values. Given two family line, the amount of time that has lapsed since the common ancestor can only be predicted approximately. FTDNA has provided tables translate various "markers off" for different marker tests that allow you to gauge whether two people share a relatively recent common ancestor. Here's a summary:

The following table is based on FTDNA's FAQ:919.


Characterization Y-DNA12 Y-DNA25 Y-DNA37 Y-DNA67 Y-DNA111 FTDNA's Suggested Interpretation
Very Tightly Related N/A N/A 0 0 0 Your exact match means your relatedness is extremely close. Few people achieve this close level of a match. All confidence levels are well within the time frame that surnames were adopted in Western Europe
Tightly Related N/A N/A 1 1-2 1-2 Few people achieve this close level of a match. All confidence levels are well within the time frame that surnames were adopted in Western Europe.
Related 0 0-1 2-3 3-4 3-5 Your degree of matching is within the range of most well-established surname lineages in Western Europe. If you have tested with the Y-DNA12 or Y-DNA25 test, you should consider upgrading to additional STR markers. Doing so will improve your time to common ancestor calculations.
Probably Related 1 2 4 5-6 6-7 Without additional evidence, it is unlikely that you share a common ancestor in recent genealogical times (1 to 6 generations). You may have a connection in more distant genealogical times (less than 15 generations). If you have traditional genealogy records that indicate a relationship, then by testing additional individuals you will either prove or disprove the connection.
Only Possibly Related 2 3 5 7 8-10 It is unlikely that you share a common ancestor in genealogical times (1 to 15 generations). Should you have traditional genealogy records that indicate a relationship, then by testing additional individuals you will either prove or disprove the connection. A careful review of your genealogical records is also recommended.
Not Related 3 4 6 >7 >10 You are not related on your Y-chromosome lineage within recent or distant genealogical times (1 to 15 generations).


Comparison of Kits

People can, and do, make use of "rules of thumb" such as are embedded in the above FTDNA table, to gauge how closely two kits might be related. Its not a precision approach, but the nature of the data does not really lend itself to precision, at least at the marker levels currently available for test takers at FTDNA. [3] In terms of interpreting these results, using relationships such as shown above are fairly straightforward. The bottom line is that they allow you to judge whether two test takers are "very tightly related", or "not related at all" with varying degrees of "tightness" in between. There is, however, a problem with using such rules of thumb. In particular, it's difficult to directly compare pairs of kits that have tested at different marker levels. [4] If you want to compare multiple kits at once, what do you do when some kits test at 12 markers and others at 111 markers? It makes quite a bit of difference if you have pairs of kits that differ from each other by 1 marker in a 12 marker test, vs 1 marker in an 111 marker test. If you're only comparing a few kits, you can accommodate those distinctions as you go along, but if you're trying to compare dozens of kits (say in order to find your best matches, or to examine fine distinctions in lineages, this gets to be fairly cumbersome.

One way to deal with that problem is to make use of FTDNA's TMCRA and Genetic Distance estimates using their TIP calculator. This approach is a form of "normalization", which allows you to compare results of kits of different sizes directly, without keeping in your head the distinction that "this kit is 12 marekrs, and this is 37 markers and this is 111 markers". Unfortunately, those calculations require that you assume an average generation period [5] Since generation time can very considerably using an average value can distort the results. In addition, TMCRA and Genetic distance estimates are problematic for doing genealogy. In the case of TMCRA, you usually end up with fairly broad period when your ancestor might have been born (plus or minus 200 years is not uncommon; knowing that your ancestor was probably born between 1700 and 1900 is not exactly useful information.)

One way to get around this kind of problem is to use a different form of normalization, and one which avoids adding an additional variable into what's already pretty complex data. There are different approaches to that, but the approach currently being used to compare any two kits is to simply divide the number of markers on which the two kits differ, by the number of markers that both kits share in common. That value is the dis-similarity coefficient referred to above.

Terms like "dis-similarity coefficient" can be a bit off-puting, but all this really amounts to is measuring the percent of the markers that differ between any two kits. It's the "percent difference". When its expressed as a percentage, it doesn't matter (so much) how many markers were being compared. Two kits differing by about 13% on 37 markers, are about the same as two other kits differing by about 13% on 67 markers. They have a different number of markers by which they differ (5/37 on the one, and 9/67 on the other), but the percentage difference is about the same.

So what does a percent difference actually tell us? FTDNA's FTDNA categorical scale, mentioned above, allows us to interpret percent differences fairly easily. One might argue over the fine distinctions that they make between terms like "possibly related" and "probably not related", but in general, they give us a fairly straightforward approach to interpreting differences in the number of markers by which various kits differ.

The following is based on their FAQ 919 [6]

MarkersNumber of mismatched markers or "markers-off" (D)
N 0 1 2 3 4 5 6 7 8 9 10 11 12
12 * 0.083 0.167 0.250 0.333 0.417 0.500 0.583 0.667 0.750 0.833 0.917 1.000
25 * 0.040 0.080 0.120 0.160 0.200 0.240 0.280 0.320 0.360 0.400 0.440 0.480
37 0.000 0.027 0.054 0.081 0.108 0.135 0.162 0.189 0.216 0.243 0.270 0.297 0.324
67 0.000 0.015 0.030 0.045 0.060 0.075 0.090 0.104 0.119 0.134 0.149 0.164 0.179
111 0.000 0.009 0.018 0.027 0.036 0.045 0.054 0.063 0.072 0.081 0.090 0.099 0.108

*FTDNA considers two kits that differ by zero markers at 12 or 25 markers to be "related". Kit comparisons at 0 markers difference for 37+ markers are scored as "very tightly related".


Categorical ScaleCriteria
Very Tightly Related0.0%-<1.0
Tightly Related 1.0%-<3.0%
Related 3.0%-<9.0%
Probably Related 9.0%-12.0%
Probably Not Related 12.0%-<13.5%
Not Related > 13.5%

Footnotes

  1. e.g., wikipedia:Jaccard Index, wikipedia:Sørensen–Dice coefficient
  2. FTDNA currently offers 12, 25, 37, 67, and 111 marker tests.
  3. Recently some organizations have begun to off tests for very large number's of STR markers (e.g., ~450, by some organizations. As one might expect, this can be very costly, and it remains to be seen whether clients will be willing to pay for such tests. If, however, such tests become commonplace, a considerable improvement in test precision would be expected.
  4. There's also a problem with exactly what "very tightly related" means. Does it signify that you are brothers or first cousins, or does it signal that you might be 3 or four generations away from the common ancestor. That, however, is not the problem being spoken to here.
  5. The time between when a person was born and when their father was born. Typically that's about 25 years, but this can very considerable from lineage to lineage. Its one thing is you're in a continuous line of eldest children, each born when their father was 20, and quite another if you are in a continuous line of youngest children, born when your father was 50.
  6. With some adjustments to make their termininolgy consistent with itself.