Using a Similarity Index to Group YDNA Project Results

Return to The Tapestry Homepage
Return to The Tapestry Homepage
Return to Methods
Return to Methods


This is one of a series of articles on Genealogical Methods, prepared in association with The Tapestry. See Index for a list of related articles.

by William M. Willis©


This page is currently under development.


Data Source



A man's patrilineal ancestry, or male-line ancestry, can be traced using the DNA on his Y chromosome (Y-DNA) through Y-STR testing. This is useful because the Y chromosome passes down almost unchanged from father to son, i.e., the non-recombining and sex-determining regions of the Y chromosome do not change. A man's test results are compared to another man's results to determine the time frame in which the two individuals shared a most recent common ancestor or MRCA. If their test results are a perfect, or nearly perfect match, they are related within genealogy's time frame.  :

The key to this idea is the fact that the while the Y-chromosome does not change by very much, it does change slightly, through mutation, from generation to generation. The changes are relatively uncommon (an estimated 0.02% chance of a change on each genetic marker in each generation), but they do occur, and accumulate through time. When the YDNA for two men is compared, the larger the number of accumulated changes, the longer it has been since the two men shared a common male ancestor.

FTDNA provides a YDNA testing service for clients interested in using YDNA results to further their understanding of their family genealogy and history. FTDNA provides a "project page" which displays test results for project kits belonging to the surname project. There are several display formats available to the user, but the following (with hypothetical example) is typical:

Kit NumberPaternal Ancestor Name/data Haplogroup DYS1DYS2 DYS3DYS4DYS5 DYS6 DYS7 DYS8 DYS9DYS10DYS11DYS12
Descendants of David Smith of Derry, Ireland
H99990David Smith 1802-1878R1b113251411111312 12 11 13 14 29
H19922Peter Smith 1824-1898R1b113251411111312 12 11 13 14 29
H19922Paul Smith 1828-1888 R1b113251411111312 12 11 13 14 29
Descendants of Phillip Smith of Cornwall, England
H19922John Smith b. 1754 d. Boston MassI214 221410 131411 1411 12 1128
H12032John Smith d1815 OhioI2142214101314 111411 1211 28
H7887Paul Smith b1852 IowaI2142214101314 111411121128
H25431Benjamin Smith I2142214101314 111411121128
H212121John Smith I2142214101314 111411121128

In some versions of the table the data is presented "as is"; in other cases, minimum, modal, and maximum values for a group are presented. The modal value for a group of kits is sometimes referred to as the groups "signature". In some versions of the tables, marker values are color coded to indicate whether the marker value for any given kit differs from the group mode. This helps highlight the differences between each of the kits in any given group, and the group YDNA signature. [See: YDNA. Examples of YDNA Data Tables for further explanation].

The total number of differences between two kits is then used to evaluate how closely the YDNA of each kit owner matches other kit owners. In general, the fewer the differences between any two kits, the closer the genealogical relationship. While there are a number of different approaches to grouping kits, in most YDNA projects groups are largely defined by having a minimum number of mismatches, and share a relatively recent common ancestor. Kits with fewer than "X" mismatches out of "N" markers are considered to share a relatively recent common ancestor (RCA), and are grouped together. The value of X that is used as the critieria for a RCA depends in part on the number of markers (N) tested. A single mismatch in a 12 marker kit is usually sufficient to rule out a RCA, while a mismatch of 1 marker in a 111 marker kit, is commonly accepted as indicating a relatively recent common ancestor.


Data for a surname YDNA project is presented on FTDNA Surname Project sites in tabular form as described above. Projects range in size from small (a half dozen or so kits) to very large (up to 1500+ kits, such as the Clan Frasier Project. Marker data for each kit is entered horizontally in the FTDNA tables. The total width of the table must be sufficient to facilitate the display of results from the maximum number of markers currently tested (111)[2] In practice, it is not possible to view such a large table, or even all of the markers for a few kits, in a single display on many computer monitors. Examining this table typically requires the user to scroll to the right to reveal values for each marker tested.

There's a lot of data in even the simplest of tables, but its still something of a challenge to visually evaluate the content of these tables, and to calculate the number of "mismatches/markers tested" for each pair of kits. When a project starts accumulateing a significant number of kits, the problem can get out of hand very quickly. In very large projects (hundreds of kits) a pairwise comparison of each and every kit becomes a real challenge. One result of this is that projects with large numbers of kits sometimes resort to simply grouping their kits by haplogroup. That approach at least reduces the magnitude of the task to a somewhat manageable size. Even so, it is not uncommon to find a "group" matched by haplotype to contain no kits with relatively recent ancestors. This is commonly the case with especially large project; in those projects there may be hundreds of kits belonging to the same haplotype. In such instances it can be challenging for the group administer to identify legitimate subgroups within the haplogroup.




  1. The number of markers (N) tested for a kit varies from user to user. Typically, 12, 37, 67, or 111 markers are tested. Ideally, the more markers tested the better the results, but since the cost of the test depends on the number of markers tested, the choice of N is often limited by the kit owners economic considerations. In anycase, an exact match between two kits testing at say 67 markers (e.g., a 67/67 result) is usually taken to mean that the two kit owners share a relatively recent common ancestor. A less exact match suggests that their common ancestor lies somewhat further back in time. An exact match at 12 markers may or may not indicate a relatively recent common ancestor. FTDNA Guidance on interpretting "Genetic Distance" can be found at FAQ 919
  2. Since some of the markers are multicopy, the table is currently designed to accomodate a larger number of markers. Currently, some kits that are listed as having been tested for 111 markers, actually show 113 or more markers, the two "extra markers" being the result of a relatively recent addition to the number of copies potentially present in one of the multicopy markers