Dissimilarity Indexes for Interpreting YDNA results

Watchers
Share
Return to The Tapestry Homepage
Enlarge
Return to The Tapestry Homepage
Return to Methods
Enlarge
Return to Methods

Contents

This is one of a series of articles on Genealogical Methods, prepared in association with The Tapestry. See Index for a list of related articles.
__________________________

Related

Analysis:Descendants of David Cowan of Sevier County, TN, YDNA Comparison

Background

YDNA analysis provides an outstanding tool for sorting out possible relationships among people sharing the same surname. Comparisons of the YDNA test results (such as those provided by Ancestry or Family Tree DNA) in conjunction with a knowledge of the test subjects lineages, can yield considerable insight into whether or not two people share a relatively recent common ancestor in a genealogically meaningful timeframe (GMT). Many YDNA projects involve hundreds, and in some cases thousands of kits. Comparing results for such projects can be tedious. One approach to doing this is to use computerized methods to work with a data set containing kit results, and generate a table of "Similarities" between each and every kit in a project. A proprietary program, yALL is being used in the Cowan YDNA project to do exactly that. The output of yALL is a matrix of similarity coefficients that allow the user to compare each kit in the project with any of the other kits in the project.

YDNA Differences

In comparing YDNA results for two people sharing the same surname, we look first to see how many "markers" they differ from each other. The greater the number of differences, the more mutations have occurred in their lineage. The more mutations that have occurred, the longer it has been since they shared a "Most Recent Common Ancestor" (MRCA. Here's an arbitrary example:

Person*Earliest known ancestor**Markers
39339019391385a385b426383439389-1392389-2
Person AJohn Smith, b 1693, Gloucester England13 25 14 11 11 13 12 12 11 13 14 28
Person BEphraim Smith, b. 1802, Rockbridge, VA13 25 13 11 11 13 12 12 11 13 14 30
Step Differences001000000002
Markers Different001000000001

*Person taking the test **The stated earliest known ancestor of the person taking the test. Note that these data do not represent real persons, but have been developed for purpose of this explanation. If there's a "real" "Ephraim Smith" born in Rockbridge County in 1802, this is simply a coincidence.

The results of this particular (made up) test show that the descendants of John and Ephram differ by a total of 3 steps, on a total of 2 markers.

While a 12 marker test is not ideal for doing YDNA analysis, some information can be obtained. The fact that they differ on 2 different markers out of 12 tells us that they are probably not closely related. Had they shown a difference of 1 step on 1 marker, (an 8.3% difference) we might not be so sure. The general rule of thumb here is that differences of less than 10% (or thereabouts) probably indicate relatively recent common ancestry. But when we only have 12 markers to go on we have to keep in mind that this result could be simply a matter of chance. The rule of thumb is "12 markers are good for ruling folks out", but they are of limited value for "ruling them in". Adding more markers improves certainty, either for ruling folks in, or out. Usually, YDNA's 37 marker test seems to give the most "bang for the buck", but some folks use the more intensive 64 marker test, or even higher. Since additional markers cost additional dollars, adding more markers is something that needs to be carefully weighed and evaluated. At some point adding more markers may not give you appreciably better results.

Indicies

The above example is fairly straightforward, and we can easily reach a conclusion about the meaning of the data by simple inspection. Usually matters are more complex, and this is where the use of Dissimilarity indicies comes into play. In our example above we might add a few calculations to help reveal patterns. (Probably not needed in this particular case, but the methodology is easier to see in this example, than in a more complex problem.) In this example the results for John Smith are being used to see if Ephraim Smith is related to him though a relatively recent common ancestor.

Person*Earliest known ancestor** Markers Tested
(N)
Steps
(S)
Markers Changed
(M)
Step Index
S/N
Marker Index
M/N
39339019391385a385b426383439389-1392389-2
Person AJohn Smith, b 1693, Gloucester England12000013 25 14 11 11 13 12 12 11 13 14 28
Person BEphraim Smith, b. 1802, Rockbridge, VA123225.0%16.7%13 25 13 11 11 13 12 12 11 13 14 30
Step Differences001000000002
Markers Differnt001000000001

For practical purposes we don't really need to display the results for the individual markers. In fact, when dealing with a 67 marker test, displaying those individual marker results makes the resulting table very unwieldly, at least for purposes of display on WeRelate. A simplified form of the table is usually sufficient for most purposes.

Person*Earliest known ancestor** Markers Tested
(N)
Steps
(S)
Markers Changed
(M)
Step Index
S/N
Marker Index
M/N
Person AJohn Smith, b 1693, Gloucester England120000
Person BEphraim Smith, b. 1802, Rockbridge, VA123225.0%16.7%

What the above table tells us is that by either the Step Index, or the Marker Index, John Smith has (as would be expected) 0% dissimilarity to himself [1] Ephriam Smith, on the otherhand shows a Step Index dissimilarity of 25%, and a Marker Index dissimilarity of 16% compared to John. That shows that John and Ephriam are not related to each other in anything like a reasonable genealogical timeperiod. This is the same result as we got before, but in this simplified table its easier to focus on the important things, and let the underlying data about each persons marker values lie invisible behind the scene.

Uneven Testing

One of the commonalities of YDNA testing is that some folks test only a few markers, and some folks go hog wild. There is, after all a real cost implication to increasing the number of markers being tested. [2] A consequence of this is that you end up with a wide range of test results, at least in terms of the number of markers tested. Drawing a comparison between such uneven results can be difficult. For example, suppose person A and B took the 12 marker test and matched exactly, but then person A upgraded to 67, and new test subject, Person C, had a 67 marker test. Their new test results might comeout like this:

Person*Earliest known ancestor** Markers Tested
(N)
Steps
(S)
Markers Changed
(M)
Step Index
S/N
Marker Index
M/N
Person AJohn Smith, b 1693, Gloucester England670000
Person BEphraim Smith, b. 1802, Rockbridge, VA123225.0%16.7%
Person CCarlton Smith, b. 1800, Augusta, VA67324.4.0%4.4%


Because person B has not upgraded to 67 markers, we are still only able to compare his 12 markers with the other test subjects. Measured against our original person A, we get exactly the same results, because the underlying test data remains the same 12 markers. Person C, however, with 67 markers, has exactly the same number of mismatches as person B (Three steps, 2 markers). Even though this person differs from person A by exactly the same number of mutations, the dissimilarity indicies drop dramatically, to roughly 4%. Its fairly clear that person C does infact share a relatively recent common ancestor with person A.

Overall

When this process is repeated at a larger scale it soon becomes apparent that we can at least guage which of perhaps a hundred or more Cowan's taking the test, are similar to a particular test result. Analysis:Descendants of David Cowan of Sevier County, TN, YDNA Comparison gives a moderately large example, comparing a descendant of David Cowan of Sevier County TN, with 100+ test results for other persons in the Cowan YDNA project. Without repeating the full data set, there are several "takeaways" that can be pointed to, arrising from inspection of this table

1. As you read down the dable under the Index Columns (either Step or Marker indicies, dissimilarity generally increases. The further away you get from the test result of the first line, the less likely it is that the persons being compared are closely related.

2. Generally speaking, an index value of more than about 10% probably tells you that the most recent answer in common to the two people being compared, lies deeper than is genealogically meaningful. That's not an absolute criteria, but the odds are against a close match once the indicies get greater than about 10%.

3. A test involving 12 markers is a much less reliable guide than one involving 67 markers.

4. If the person you are interested in comparing is in the "grey area" ....

Footnotes

  1. That will always be the case, no matter how many markers are tested, or who is being tested. This is sort of an artifact of the analysis, a necessary formality.
  2. If Bill Gates were interested in this kind of thing, he'd probably opt for the most complete testing possible; most of the rest of us can't be quite so extravagant.