Purpose of this project
The purpose of this project is to create a comprehensive database of name variants that should be searched whenever a particular name is searched.
Currently, providers of genealogical records use algorithms like Soundex, or home-grown solutions to the problem of returning records with names that are spelled similarly and are likely matches for the searched-for name. A large part of genealogical expertise involves learning alternate spellings for the surnames in your tree. Why not share this knowledge with others? By adding your alternate spellings to the database, searches on WeRelate and on any other website that uses it will include your alternate spellings automatically.
This database is being made freely available to any website that wants to use it. The goal is to create a free resource that all genealogy websites use, so that genealogy searches are consistent across the Web.
How you can help
It's now up to you. Computers are smart, but not that smart. We need your help to review the variants in the database: remove any that don't make sense, and add the ones you know are missing.
When you review the name variants, you'll see a list of names with checkboxes. The checked names are included in searches. (In addition, rare names with the same Soundex code are also included.) Unchecked names aren't included. Please review the checked names, uncheck those that shouldn't be included in searches, and add additional variants that should be included. To add a name, either check the box to the left of the name, or add it to the text box at the bottom of the screen.
The database may not help you that much, since you already know the alternative spellings to search when researching your tree, but it will definitely help those who come after you. Contributing to it is a way to leave an enduring legacy that will benefit genealogists of future generations. Won't you please help?
In addition, we need people to review the changes that others have made to the database, to make sure that we have multiple pairs of eyes reviewing the names that are being added and removed.
How it works
Creating a comprehensive index is a big task, but we have a head-start. Ancestry.com and WeRelate worked together to create an advanced algorithm for determining the level of similarity between two names. That algorithm was used to create the starting point for this database. The algorithm was used to find similarly-spelled names for the 200,000 most-frequent surnames and 70,000 most-frequent given names in Ancestry's database. This includes every name that appears more than once every 5,000,000 names in Ancestry's database. On average, 26 variants were found for each surname, and 32 variants were found for each given name. Rare names, appearing less than 1 / 5,000,000 names, are handled by the Soundex algorithm.
In addition, BehindTheName.com has donated their excellent list of given name variants. If you haven't had a chance to check out BehindTheName.com yet, you really should. To the BehindTheName and Ancestry list of names, we included additional variants from the WeRelate community, The New American Dictionary of Baby Names, and A Dictionary of Surnames.
As a result, this database currently reduces the number of names missed by Soundex and other approaches used by large genealogy organizations by over 25 percent without introducing as many dissimilar names as Soundex can add. This result is based on a sample of 100,000 pairs of names identified by Ancestry.
If you want to use this database for your own website
Feel free to download the database and code to access it from github. In addition to the database of name variants, the code also includes a function to return a similarity score between two names, which has been found useful in duplicate detection.