WeRelate:Suggestions/Tool for checking consistency of data

Status

Two features have been added:

  • A Data Quality Issues list identifying inconsistent (and some incomplete) data. The list covers the entire database and can be filtered to a user's watchlist or MyTree (plus a couple of other filters). The list is updated daily and items fixed since the last update are marked as such.
  • Automatically identified data inconsistencies are displayed on Person and Family pages. The issues are determined in real time.

In both cases, issues are classified as errors (to fix if possible) and anomalies (to check and either fix or confirm that the data is correct). The latter type can be marked as verified, in which case they will no longer be reported as issues (except on request).

The original request to report on data inconsistencies garnered additional suggestions for data quality reporting (such as missing sources, broken links, and duplicate persons). While these are all valid suggestions, they each need their own suggestion page for proper prioritization. Therefore, I am closing this suggestion as completed.--DataAnalyst 22:53, 21 May 2023 (UTC)

Original request and discussion

During gedcom import the Data are checked for errors (warnings), places (already documented in the database), sources, and also families to be merged.

It would be very helpful to make such a test of data-consistency for trees independent of the gedcom-upload. Surely not the family-merging part, but for eventual warnings, missing places and sources (of which I am not sure what to check here). Alle the red places would be documented and also the possibly incorrect data deriving from manual additions.

The test will produce a printable warnings-list and a list of 'floating' places.--Klaas (Ekjansen) 07:27, 28 May 2011 (EDT)

I like this idea. If we could check the data that already exists on WR for errors, in much the same way we check gedcoms, we would be able to find impossible families like this: Family:Richard Markley and Catherine Unknown (1). --Jennifer (JBS66) 11:19, 3 November 2011 (EDT)
Are you thinking of a global list of all data inconsistencies, giving people a list of inconsistencies in their own trees, or both?--Dallan 23:45, 4 November 2011 (EDT)
Klaas and I conversed about this today. A global list for these types of errors would be overwhelming to deal with. What about something similar to the duplicates list? Each user could have a page that lists the errors in their trees (the same type of error checking that gedcoms go through). We could also have a global list - but that could just list the username, number of errors, and a link to their list. One potential problem may be if items are on the list due to a event that is actually true. --Jennifer (JBS66) 12:28, 5 December 2011 (EST)

I have two concerns: 1) I think we should avoid using the word Error unless something is clearly impossible. There are many strange situations that I encounter and some that seem implausible actually appear to be correct. A word lik Suspect or Implausible might be better than Error. 2) Tying this report to Trees is problematic. I have many pages that I am watching that are not in fact in any Tree - certainly not any Tree of mine and nobody else is watching them. Might it make more sense to tie this report to pages a user is watching? --Jhamstra 18:17, 5 December 2011 (EST)

We currently use three labels (severity levels) in the gedcom uploader: alert, warning, and error. I think it would be helpful to use the same labels, with the same reported problems, for the post-upload report. I don't mind changing the labels or which label to assign to which problem; but I think it would be helpful to be consistent.
I agree that the report ought to be tied to watched pages rather than individual trees.--Dallan 22:39, 5 December 2011 (EST)
I would also prefer to keep the same severity labels as usual with the gedcom-upload. Further agree to maked the watched pages the selection criterium. --Klaas (Ekjansen) 23:07, 5 December 2011 (EST)
Alert/Warning/Error is fine with me - I just didn't want to see everything flagged as an Error.

--Jhamstra 23:24, 5 December 2011 (EST)

Will there be a way to mark something as alright so it doesn't keep showing up on the report? -- Amy (Ajcrow) 14:50, 6 December 2011 (EST)

Yes, we need this. Maybe a template on the talk page with the text or an id of the warning on it.-Dallan 00:09, 11 December 2011 (EST)

  • We want to add a check for junior(jr) and senior(sr) in the surname field.
  • We want to break this suggestion into two parts: (a) run the checks on every person/family edit, and (b) run periodic (daily/weekly) checks against the entire database to create warnings lists for everyone, pointing to warnings for their watched pages
  • We'd like a way for people to view users' warnings lists with lots of warnings, so they can help reduce them

Off and on, I worked to create a specification on this sort of thing - see Functional Specification for Data Consistency --jrm03063 12:11, 30 September 2012 (EDT)


I'm wondering if checking for possible duplicate Person pages is part of the planned functionality of this suggestion. We currently have a check and report system in place for duplicate Family pages, but not Person pages. --Jennifer (JBS66) 10:43, 17 October 2012 (EDT)

I don't have a great explanation for that - other than to say that there's something a little magic about the amount of information present when you have two people in a marriage, versus a single person standing alone. It just seems that with two fully known people in a marriage, there's enough information to rise above the noise and detect matches with good certainty. --jrm03063 12:35, 17 October 2012 (EDT)
right -- detecting duplicate families is a much easier problem. Detecting duplicate people would likely report a bunch of possible duplicates that aren't really duplicates.--Dallan 06:57, 11 November 2012 (EST)
I fully understand the difficulty in matching individuals, but I wonder if we might at least have reporting on people with same first and last name and exact same birth date. Every time I upload a GEDCOM, I search for duplicates for every person at the borders of my scope - e.g., a person married to a sibling of an ancestor, to see if he/she has been added as a child of his/her parents, without any marriages. I usually find anywhere from one or two to several matches. I assume most people don't think to do this, and I bet we have a lot of duplicates that would allow us to link families together.
Also, there is a whole intellectual domain on person-matching algorithms, which are built into software products used by companies to consolidate customer information. There's bound to be some stuff in the public domain that might prove useful - either to help build a matching algorithm or to determine that we don't capture enough info. Worth some research at some time. I might look into this eventually if no one else does.--DataAnalyst 23:08, 30 November 2013 (UTC)

Until such time as it might be feasible to implement such a tool for a user's watchlist, this can be accomplished for a tree by downloading the tree to a GEDCOM and uploading it again. There is no need to process the GEDCOM - it can simply be removed after the messages are reviewed and addressed. The messages can be printed to be addressed later. There is, however, the 5000-person limitation on uploading a GEDCOM.--DataAnalyst 02:58, 4 January 2017 (UTC)



A tool of this sort doesn't have to live on the site proper. Even without the API we would get with a new version of mediawiki, I've been able to get pretty far screen-scraping my way around. Of course, it would probably be a little less hideous if there was an API to go through. In any case - I think code to do consistency checking need not be part of the WR site code proper. So - if we're thinking about how to allocate a limited development resource - I would think going to a new version of MediaWiki would take priority over this... --jrm03063 23:59, 29 June 2017 (UTC)