WeRelate:Functional Specification for Data Consistency Verification


Genealogy Data Correctness

Genealogy systems can not do a lot to fundamentally validate the data they manage. By this standard, the only real data that can be checked is probably the syntactical correctness of a date string.

Genealogy systems typically provide a lot more than that however, because they do not try to find explicit individual errors. Instead, data is checked for consistency to see if some configuration of two or more facts is mutually exclusive. For example, a date of death before a date of birth. Knowing which is correct is impossible, but knowing that something is wrong is easy to determine.

This document is an attempt to keep track of possible consistency tests that could be applied within the framework of werelate, as well as ideas for how to make that information available for display.

Ordinary Consistency Checks

For a single user page

  • Birth must come before any other person-specific event
  • Baptism and/or christening must occur after birth
  • A will can not be made after death, nor sooner than a basic age of majority (16 or so)
  • Multiple birth, death, or burial dates are automatically suspect
  • Burial may not precede events other than probate or obituaries
  • Probate events may not precede death
  • Life length greater than a reasonably long life span (100 or so)

For a family page with associated parent and child pages

  • Parents are of appropriate gender for their assignment as husband or wife
  • Children are not born before or after Mother's child bearing age
  • Children are not born before a father could be expected to have a child
  • Birth dates of children not far enough apart
  • Parents are not too far apart in age
  • If Mother's surname is known, but not supported by parents of that name, then the surname should be different than Father's.
  • Places not recognized in the place data base
  • Dates not of a recognized form
  • Unrealistically long-lived people
  • Date of birth after date of death
  • Date of birth or date of death unreasonable given a known date of marriage
  • Child surnames consistent with father if present, mother if unknown.
  • If a person is a spouse in more than one family, do the child-bearing periods of any of the families overlap?

Advanced Considerations

Separate Warning and Error Threashold Levels

A simple example of this situation is the case of a birth to a mother aged 48. It is not impossible, but it is unlikly enough to warrant a warning. A birth to a mother age 70 on the other hand, is plainly an error.

Threashold Levels adjusted for Era

Marriage and birth to a mother aged 16 would have been more typical in the 1600s then in the 1900s. It may be useful if warning and error threshold levels allow for a refinement based on the century in question.

Detection of Cycles

A child can not be their own parent or grandparent, but confusion in early records and incomplete information in the hands of particular researchers can easily create this type of defect. In principal, such errors could be searched out over the entire space of werelate genealogy, but it would be more realistic and probably just as useful if cycle detection was only implemented across the members of a given family tree.

Orphaned Tree Fragments

A werelate tree is a bit of a misnomer. It is really only a page that references a collection of other pages of various types. Under normal circumstances, the person and family pages of such a tree will represent a completely connected tree graph. Sometimes however, individual people or family pages can become detached from the tree at large, even though their name is retained in the werelate "tree". Detection of such discrepancies can be useful.

Presentation of warning information

In thinking about these areas, I realized that all represent warning conditions, and that a common warning detection and reporting strategy should serve them. Some ideas for this follow.

Warnings Page Companion for each Person and Family Page

Create a companion "warnings" page, for every person or family page. When a person or family page is seen to be older than it's associated warnings page, then the warning/duplicate logic could be triggered. Since it's common for a page to be edited several times before being left alone, and associated pages that would affect a warning might be similarly in transition, actual warning page regeneration would need to be delayed for a period after the last edit of a person or family page. When the warning page is regenerated, the logic would check if the warning content had changed. If so, then a new version would be checked in. If not, then the old version would be quietly updated to be newer than it's associated person/family page. In the event of a warnings page change, the system would also add a trivial entry to the associated "talk" page. In so doing, the user community for the page would receive notice that there are new warnings for a particular page.

Warnings Report for a Tree

While it may be useful to have a copy of warnings for any given person or family page that a user is working on, a user will probably want to see warnings on a higher level basis. All the warnings associated with a particular tree seems the most likely. Working from such a list of errors, the user should be able to gradually and systematically improve the quality of his tree's data until very few warnings remain. Expecting the user to manually walk the pages of their tree to see if individual pages have warnings (or have acquired them, due to other uploads and research) is utterly unrealistic.