Talk:Data Quality Improvement

Topics


Related Suggestions [4 July 2016]

I just added 2 suggestions to assist with finding Data Quality opportunities and monitoring overall Data Quality.

  • The first is to improve the Birth Century facet to make it easier to find pages with missing birth date (currently pages with christening date as proxy for birth date, which is a common genealogy practice when the birth date is not recorded, show up as unknown birth century). Right now, a quarter of our Person pages show up as Unknown Birth Century, which is discouraging until you realize this includes everyone with a christening date as proxy for birth date.
  • The second is to support the more accurate collection of statistics. I find that monitoring statistics over time, especially the percentage of pages with, say, missing birth date or Unknown first name, helps motivate people to continue contributing to the improvement of the site.

Please vote for these suggestions by watching them, and if anyone has pull to get these implemented quickly, please do so. Thanks.--DataAnalyst 15:53, 3 July 2016 (UTC)


Working with the Sources Needed list [Jul 2016]

If there was some way for the persons listed on the Sources Needed list to have appended even one location from the page, folks could work on geographical locations they are familiar with. I checked a couple of names but they were from England and I have NO idea of English sources. A little more geographical info would certainly help us in selecting who we might be able to help. --janiejac 16:45, 3 July 2016 (UTC)

Continuing with the idea of working with a specific location, I went to Place:Wood, West Virginia, United States and clicked on what links here. The results list all the templates, then all the sources, then all the places and FINALLY lists the persons -which is what I wanted to see. That page cannot be edited, but if I could have, I would have placed a template on the page so that one could select either family, person, place or alpha. I guess the template normally used for this would have to be edited to include templates. But some way to sort the results on a 'what links here' page would be very helpful! I'll go add this to the suggestion page. (And yes, I voted for Data's suggestions above!) --janiejac 17:22, 3 July 2016 (UTC)

I agree location information in the Sources Needed list would help match the pages needed work to people with some knowledge of the area. But you can add "Sources Needed 1" Massachusetts (with quote marks as indicated) to the keyword field and do a search to find people in Massachusetts (substitute your locality of interest).

But I also burned out on that project because it all seemed to be from one GEDCOM upload and I know there are many, many others. And such work is still coming in by various current users, I might add: pages with no dates, no sources, no locations, just sparse pages that in quality and appearance looks just like the old GEDCOM dumps, but are recent manual edits. All you have to do is browse the current activity every once in a while.

The bad pages are there. There's lots more than are on the Sources Needed list. It would be great if a bot could flag them all instead of just one GEDCOM's pages. And then perhaps assign sub-categories based on location if the page has enough information on it, etc. It would also be great if the software could catch future bad pages, and put up popups that warn when there are no dates, no sources, or no locations - at least slow down the data entry so that perhaps those doing it figure either that it's not worth the time to wait through the popup, or that they should invest more effort to bring their work up to some acceptable standard. The popup could allow them to continue anyway, perhaps with a small wait time, and flag the page right at the time of saving as needing attention. Perhaps it could build lists like the duplicates report of all flagged pages each user is watching. Since they would appear to have some interest in the page, as indicated by watching it, it should then be likely that they can help clean it up. --Jrich 18:15, 3 July 2016 (UTC)


I'm pretty much lacking in tech knowledge, but if there was a way to automate adding geographic location to pages in the Sources needed category, it would help. Spent considerable time working with the Institutional version of Ancestry and trying to remember high school Spanish the other day trying to find a Mexican marriage, for example. (I try to take one page per day, and right now they are just alphabetical.) --GayelKnott 18:06, 4 July 2016 (UTC)

I would suggest you go to the Old GEDCOMs page. Some of these trees are fine, but most need to be reviewed to see if:
  • any individuals might be living (check anyone without a birth date - easy to do with the new Birth Century facet)
  • sources are needed
  • there are impossible families (e.g., children born before their parents)
Click "View" for any GEDCOM and then select the Person namespace. This will give you the count of countries mentioned in the tree. You can then decide whether or not you want to work on that tree. Note that some of the trees have been previously reviewed and no egregious errors noted - but they have many pages without birthdates and they may also be sadly lacking in sources. If you decide to work on a tree, I would suggest sorting by page title - makes it easier to keep track of where you are. (I am currently working through In Genealogist - a multi-month project. Would love to have others tackle other Old GEDCOMs.)--DataAnalyst 01:19, 5 July 2016 (UTC)

Comments on Currently Featured Task for Nov 2016


Sources vs. Repositories [13 November 2016]

[BEGIN comments moved from main article page]

I would like to propose that the pages showing the collections mentioned above as "mixed reliability" be changed to "Repositories" rather than "Sources." Individual data elements, facts, records, and pages within those repositories are (or could be considered) sources of varying reliability, but I feel each of those sites are in fact Repositories (as meeting the definition shown at the Repository Portal here at WeRelate, as opposed to Sources. I lost that argument 7 years ago, but hopefully I can garner support for that move now.
Since I don't think we can just use the Rename function, each of the pages would have to be deleted and then reentered correctly. I would be willing to do so, but only with the support of the community and the direction of the WROC. Please provide your feedback. --BobC 19:00, 12 November 2016 (UTC)
Personally I don't see how Torrey qualifies as a Repository. It is a book. When you go to the library, you find one card in the catalog for the whole book. The library is the repository. IGI is a collection of facts in a database. Surely we don't want a source page for each of millions of birth record, death records, etc. If the entire repository namespace went away I think people would stop trying to make it into something useful. --Jrich 21:30, 12 November 2016 (UTC)
Agree with Jrich. According to the dictionary definition, a repository is "a receptacle or place where things are deposited, stored, or offered for sale". I don't really see any of the above as repositories, unless you want to get into hair-splitting arguments over the difference between a publisher and a repository (or a database). (I don't.) As for Torrey, agreed that he has problems, but he is still a source (where someone got their information). Better, on a case by case basis, to show why it's wrong -- where it is known to be wrong. --GayelKnott 21:51, 12 November 2016 (UTC)
To be more specific, I was referring to the "mixed reliability" references: i.e. Ancestry.com, LDS records (International Genealogical Index, Ancestral File, Pedigree Resource File), RootsWeb's WorldConnect Project, and others such as Find A Grave, BillionGraves, etc.
Tell you what, I'll once again withdraw my recommendation since it seems to draw such misunderstanding and contention. Ain't no big deal and really isn't worth my time building my case again. --BobC 00:17, 13 November 2016 (UTC)
I appreciate the withdrawal, but I personally would like to try and find out more because, if possible, I would like to head off later discussions.
Some specificity might be useful. Personally I think the definition of repositories referred to is (like many other WeRelate pages) typically vague. In other words, the idea of a physical library is intuitively obvious, but translating that into the virtual space is a little harder, and the page gives no examples or guidelines specifically applicable to virtual repositories. I can justify any source-repository division: from the only repository is the Internet and any website on the Internet is merely one source in the "Internet" repository; to the base URL is a repository and all subpages within the same base URL (being managed presumably by one entity) are all sources in a single repository; to collections (i.e., subsets that make sense to the website manager) are individual sources within a repository (because they have related data and are managed with a coherent philosophy). I think the current examples are merely choosing a different division, but the rationale for your choices isn't clear to me.
I can't justify making a record in IGI into a source, so I don't understand how IGI could be a repository. Because, the one attribute that I believe is true, is that a repository is a container of sources. I can't justify making AFN's a repository because they are of one collection managed with one philosophy. One AFN is pretty much like any other AFN and they are like pages in a book. I believe everybody would equate a book with a source, so that means I think people should equate AFN numbers with page numbers, i.e., location information within a single source. Further I think AFNs have an encompassing repository, namely familysearch.org. I think one wants familysearch.org to be the repository so there is a place (the Repository page) where one could discuss familysearch.org as a whole, such as their policies about getting a login, their relation to Family History Centers, etc.
In the old days, tracking down a source cited on a website, often required one to find a library that held that source, and might mean traveling to one or two locations in the country to access the source. In those days, citing a repository seemed like merely a courtesy to help people find the source, since if one already knew the location of the source, one could pass it and save people a little trouble. It was not significant in identifying what source one was using. One might have cited IGI and people would have to figure out the location where they could access IGI, in other words, they had to figure out the Repository.
Currently, many sources are available in many places. For example, the vital records of Billerica, Massachusetts, may be found in at least 4 places on the Internet (for free) and maybe more. So in the virtual space, the main point of a repository page, is to tell people what they need to do to get to the source held by that Repository, like creating a login, paying a fee, etc. --Jrich 04:03, 13 November 2016 (UTC)

[END comments moved from main article page]


I would welcome opening up a discussion on the Repository namespace again - what the original design was intended to provide in the 2006 environment vs. what we should provide in today's 2016 environment and looking forward.

I respectfully suggest that such a discussion be held on a more related talk page such as Help talk:Repository pages or perhaps even the main talk page for the Source patrol, WeRelate talk:Source patrol, since there is no "Repository patrol". --cos1776 14:02, 13 November 2016 (UTC)

As I referred to the previous discussion of seven years ago, y'all can review it at the Repository Portal Talk Page, which includes Dallan's comments from that time as well. I'm not sure I can restate my opinion any better or any differently now than what was written there, so please take a few moments to review it.
I'll briefly encapsulate my viewpoint as follows:
  • A Repository is where you went to find the source (which in my opinion includes both physical locations and virtual domains).
  • A Source is what you found that contains the facts or events you are substantiating.
  • A Fact or Event is the specific genealogical information or content the source contains.
I think we can all agree that The New York Public Library is a Repository, and that the book in the library that contains a listing of passengers and descendants of the Mayflower voyage is a Source, and that the data elements relating to the passenger named Moses Fletcher are the Facts or Events we are citing and using in our person page here at WeRelate. So, if the library closed it's doors next year and maintained an on-line presence only, but it scanned and digitized images of all of its holdings including the Mayflower book, then we should accept that the library's website domain address of nypl.org is also a Repository, and the specific name of the online Mayflower book (as well as the full URL address of the on-line book) is the Source, and the same data elements are the facts and events cited.
So therefore, a website without a physical location such as the Ancestry.com domain should be considered a Repository (many might say "dumping-ground"), not a Source. Each of the census records within Ancestry.com's domain are Sources, and your family's specific genealogical record and the data elements found within that census record are the facts and events citied in your person or family page.
Does that make sense? Anyway, that's my point of view. --BobC 18:53, 13 November 2016 (UTC)
I would also call Ancestry.com a repository (Ancestry as an organization is who you have to go through to access the sources within, just as you have to go to the New York Public Library to access one of its holdings), but I would call its various collections, like Source:Ancestry.com Public Member Trees or Source:Edmund West. Family Data Collection - Individual Records as sources. I think using Ancestry.com a source is sloppy/lazy citing by anybody that does it.
But this discussion started lumping several mixed reliability sources into one group, all of which, it was maintained, should be called repositories, and of that group I think Ancestry.com is the only one that looks definitely like a repository to me. The others all appear to be single collections that can exist within one, or possibly, multiple repositories.
I tried reviewing the comments as suggested but they are too spread out and unfocused to get a clear takeaway of how they apply. But some of the ideas you express there are, I believe incorrect. "The Source is what you found." I disagree. The Source is the vessel containing the information you found. The Source is not data, it contains data. Repository is about how to access that container of data. In the physical world, where to go and what the hours were. On the Internet, what organization was providing the access, and what are the terms.
It appears that Find A Grave was proposed as a repository because it is just one location where the inscription may be found. The inscription is not the source, it is the data. The gravestone is one source and if I took the inscription from the gravestone I should identify the gravestone as my source. If I took my inscription from a book I identify the book as my source, and if I took my inscription from Find A Grave that should be identified as the source. These multiple sources are not interchangeable, and thus it is conceivable that different data may be taken from each of them. Find A Grave, for example, may have a picture that is out of focus or badly lit, or missing, any recorded inscription may represent a misreading of the gravestone, and a gravestone may no longer be readable in person but presumably was when captured 100 years ago in a book.
Clearly how much sources are conglomerated is a matter of practical choice. In the case of Find A Grave, it is certainly possible to make each memorial page a separate source page, but that is not analogous to how books are used, nor does it seem to provide much benefit, as if a single gravestone wants to be discussed, it probably wants to be on the Person's page, and not off on some Source page. In comparison, when an inscription is pulled from a book, we indicate the whole book as the source. The page number where the inscription is found is given as details within the citation. Thus, Find A Grave makes sense as a single source, identifying the memorial page within Find A Grave being done within the citation. --Jrich 21:18, 13 November 2016 (UTC)

Pick a Source [24 December 2016]

As a person who has, and is, working on a project of this type (for over a year), I believe this approach of working through sources of thousands of records and posting them to pages is much harder than is being portrayed here. Individual records provide a single snapshot without much context, and it can be very easy to fall into the mistake of name matching. The rush to apply a record and move on, can fool one into making intuitive mistakes that are look reasonable - but might have been avoided if multiple sources were compared. It requires a lot of experience to do this properly, and the willingness to always branch out to other sources and do indepth research anytime there is the slightest hint of conflict or confusion. This is an attitude not encouraged by an approach where the goal is to get through a massive list of records. --Jrich 16:47, 24 December 2016 (UTC)


Sources needed [9 January 2023]

I notice that Category:Sources Needed is listed under both "Permanent Opportunities" and "Tasks that Require a Moderate Amount of Research" 25 May 2016. The "Moderate 25 May 2016" paragraph also mentions two more categories. Perhaps Template:cn and Template:Unsourced should also be mentioned somewhere on the page?--fbax.ca 02:48, 8 January 2023 (UTC)

I added Template:cn. Template:Unsourced is not an official WeRelate template and only lists a handful of pages.--DataAnalyst 11:58, 9 January 2023 (UTC)