WeRelate:Suggestions/Source matching during GEDCOM upload

During an GEDCOM upload I have a source named '1850 U.S. Census'. It does seem like the system should be able to automatically suggest "United States. 1850 U.S. Census Population Schedule" as a match!! Instead it takes me 3 clicks for every year a census record is used to select the matches. That is nine census years times 3 clicks each - which should be entirely unnecessary. Can you fix this someway?? --Janiejac 13:45, 8 May 2012 (EDT)

The following comments (down to my comment on Nov 11) were copied from the "Smarter source matches" page, which I will redirect to this page. Both pages are about the same issue, but I like the title of this one better, as it comes up when I search for GEDCOM suggestions. --DataAnalyst 08:59, 11 November 2012 (EST)

If I have a source that says 1850 U.S. Census why can't the system figure out that this is the United States Population Schedule for 1850??? Is there any other 1850 census record it could mean? Changing all these census sources with every GEDCOM upload is a pain. I think this is surely one that needs to be fixed before the site comes out of beta!!

I think I've done my last upload until matching gets smarter. Even the upload I'm currently reviewing, I'm leaving a lot of places in red because the system can't handle 'prob' in front of a place name. I don't want to assert that the birth location is a proven item if it is only an assumption. But I'd rather include my best assumption as a probable than leave the area blank. --Janiejac 22:32, 2 June 2012 (EDT)

Possibly, not the best example? I believe there was a separate slave schedule in 1850 that collected information for a slave owner, who could also be listed as a head of household in the population schedule. One of many things that makes easy uploading of GEDCOM a mirage. You title the sources as makes sense on your personal computer, WeRelate titles them as makes sense from something approaching a library perspective (I guess, I don't pretend to understand it). --Jrich 00:38, 3 June 2012 (EDT)
jane could you give a link to a person who was like that so I can see what you mean? Thanks AndrewRT 14:14, 4 June 2012 (EDT)
I have found source matching to be the most frustrating part of GEDCOM upload. I understand the hesitancy of automated matching, given the number of similar sources, but I would make 2 suggestions:
First, WeRelate should keep track of how I have matched a named source in each GEDCOM upload, and then match to the same WeRelate source the next time I upload a GEDCOM file. I have been keeping track myself, but it would be so much nicer if WeRelate would at least deal with matching sources I have already matched. I reuse sources a lot, and I assume others do as well.
Second, cut down the number of steps to manually match a WeRelate source. For example, if I am matching sources in a GEDCOM file, it is unlikely that I will want to use a recently selected source (although I might want to use a recently created one). Could we skip the recently used sources screen when matching GEDCOM sources (but keep it for all other situations)? And still give the user the option to choose to look at recently created sources (i.e., a button to see recently created sources). Also, if I remember correctly, there is a button I have to select in order to initiate source matching once I put my cursor on my GEDCOM source - can that button also be skipped? Anything to make source matching faster.--DataAnalyst 08:45, 11 November 2012 (EST)
Unfortunately people do not practice consistent citation of sources. Your citations and somebody elses are not alike. So "smarter" matching is a difficult problem. What's smart for your source citations may not be smart for those of somebody else. Some ideas that are simple to put into words are not so easy to implement in a computer (recognize USA as United States: or is that United States of America, or both and US and U.S. and U.S.A as well, or is it only when accompanied by the word Census in the title field only, etc.)
At some point there is a basic issue of whether there is more value in being easy than in having consistently formatted data. For example, wikipedia's ref tag enforces no consistent form on source citations. But then it is unknown if source citations refer to a common work and there is no page for discussing the source.
Undoubtedly there are some mechanical issues that can be tweaked, such as how to select recently used sources, or whether to match the title of the source in a GEDCOM to the title field, or the Source page title? (What GEDCOM field matches Place where the census's County is stored?) But source matching is going to require work no matter what. Jillaine suggested why bother: leave them as MySources, it still gives a useable citation. But why have multiple MySource pages, maybe hundreds, describing what is already described by a Source page? If a person doesn't care to take the time to match to a source, then make it a Citation Only (basically a freeform citation). Nobody cleans up their MySource pages, so why not accept it, and just capture the citation as text (just like those people who list their sources in the narrative instead of letting their genealogy software manage their sources). --Jrich 11:06, 11 November 2012 (EST)
This is an interesting idea. I changed the GEDCOM upload process earlier this year to create a MySource only when there was additional information in the GEDCOM source that would be lost if a citation-only source was created. The goal was to preserve round-trip-ability so that you could later export the GEDCOM file and preserve all of the source information. But perhaps creating a lot of MySources that no one will maintain is a worse solution than turning all unmatched MySources into citation-only sources with the additional GEDCOM source information added to the citation text. How would people feel about that?--Dallan 13:04, 20 November 2012 (EST)
I have recently reused MySource pages that were originally created in a GEDCOM upload. And I can follow the links to see all my citations from that source (or citations added by other family members, if and when that happens, for that matter). I would not wish these to have been turned into citations. Maybe you can add a checkbox to allow users to decide whether or not they want a source to be a MySource - default could be citation only, and users have to check the box to make it a MySource. I might have to think about some before I could decide which I preferred, but at least I would have some control over it.--DataAnalyst 21:48, 20 November 2012 (EST)
I really hate the conversion you already did to citation-only, and don't want to see all of them go that way. It breaks links to existing MySources, it's ugly, and it's much harder to track and fix than MySources. Just because some people fail to use functionality doesn't mean it should be made *less* useful. Also, is it still true you can't redirect a MySource to a Source? If it's not possible, it should be, and if it is or will be possible, that's a huge reason not to do this, since it vastly simplifies cleanup.--Amelia 23:02, 20 November 2012 (EST)
You can't redirect MySources to Sources right now; it doesn't even appear to be on the suggestion list. I'll add it.--Dallan 23:38, 20 November 2012 (EST)
In response to JRich's comment (11 Nov 2012) - I wonder if my suggestion was properly understood, so let me try again. If I have manually matched my source A to WeRelate source B in one GEDCOM upload, then I want my next GEDCOM upload to automatically match my source A to WeRelate source B again. This requires tracking how each user matches their sources - so that if someone else also has a source A it is not affected by how I matched my source A.
I sometimes match 50-100 sources on a GEDCOM upload, which can take a couple of tedious hours. Up to half are sources I have previously matched. Automatic matching of a source I have previously matched (when my GEDCOM presents it exactly the same as before) would save me quite a bit of time. Alternately, if I have to search for sources, returning the ones I am watching first would help, as I watch all sources I have matched (to make it easier to pick the consistent one in the future).--DataAnalyst 12:58, 8 December 2012 (EST)
I'm not sure I agree with separate source matching for different users - this means that new people coming to the site for the first time would not get the best experience possible. I would rather design a system that works best for everyone. One simple way is to create redirect pages for the source pages that are regularly matches to another page - e.g. Source:1850 U.S. Census should redirect to Source:United States. 1850 U.S. Census Population Schedule. The matching process during GEDCOM upload could be changed to create these redirect pages. AndrewRT 16:05, 16 December 2012 (EST)
Similar examples have been raised before, and I have questions about them. As I recall, the guidelines say to match to a source page for the census at the county level. Should Source:United States. 1850 U.S. Census Population Schedule ever be linked to? I don't see why if people are following the guidelines. So do we want GEDCOM uploads to be doing something regular data entry doesn't?
I think this attention on censuses is all focusing the discussion on too narrow a basis, anyway. Trying to predict what each incoming person wants is impossible. For example, the guidelines for articles (somewhat vaguely) suggest most of the time source pages for individual articles aren't needed. If my GEDCOM cites author: Moriarty, G. Andrews and title: "Parentage of George Gardiner of Newport, R. I", is that an article), and if so, should it link to a magazine source page, or is that a book that should have or match its own source page? And once I figure that out, how can I make it do it again.
What needs to happen is that people need to be educated in the mysteries of WeRelate before they upload a GEDCOM, and probably forced to start with very small GEDCOMS, and then the size limit increased as they gain experience, so they can develop systematic approaches like Mike talks about. It does make sense to allow users to develop personal aliases or have personal favorites (that affects nobody else's GEDCOM uploads) so subsequent uploads go smoother by capturing what they learned in earlier trials about mapping their personal preferences to WeRelate forms. --Jrich 17:36, 16 December 2012 (EST)
The vast majority of GEDCOMs are designed for purposes other than werelate. I would rather we sought to make our system as flexible as possible rather than force other people down the route of having to modify their GEDCOMs just to fit in with the particular (in most cases rather arbitrary) naming convention we have chosen here. That doesn't seem to be a route for making WeRelate successful. AndrewRT 17:47, 16 December 2012 (EST)
Thank you Andrew! I'll agree with you! And I'm the one who started this conversation by asking why my census can't be automatically matched with an existing census source (see first paragraph above). I do not use the county first guidelines and I doubt if many folks who are not professional genealogists do either. We already have a source that begins with the year and that's what I've been matching to but it is not an automatic match - it takes several clicks to get there with each census record matched and this is more frustrating than most general users are going to bother with!!
We already have this Source:United States. 1850 U.S. Census Population Schedule so why can't my source "1850 U.S. Census" be an automatic link to it. --janiejac 18:50, 16 December 2012 (EST)
In response to janiejac: I believe datanalyst's suggestion would, after you connect your "1850 U.S. Census" to "Source:United States. 1850 U.S. Census Population Schedule" once, result in that pairing being retained for you so that next GEDCOM it would be at the top of the list. Specifically, why it can't happen now, the answers could be one of several, but the most obvious is that there is a "Source:United States. 1850 U.S. Census Population Schedule" and a "Source:United States. 1850 U.S. Census Slave Schedule" which both contain all the words in your title, so it is not possible, even without considering states and counties, to match one entry unambiguously. I suspect the idea of naming census pages at the county level was made long before GEDCOM uploading was a possibility, so revisiting the reasons for that guideline and seeing if it is still thought necessary to clutter up the matching with so many county level pages might be a viable question? Your question, in turn, begs the question, why have county level source pages if all the GEDCOMs are going to be matched at the nation level?
In response to AndrewRT: you have exactly identified the biggest problem with GEDCOM uploading: the GEDCOMs are rarely prepared for, or suitable for sharing in a collaborative environment, especially one that has standard forms for naming sources and places. GEDCOM upload is inherently hard, and it is only false expectations to believe otherwise. Even if sources and places are not a problem, you still have to compare your data with data already there and make decisions about what stays and what goes. There is a lot of human decision making involved. Matching sources, and places is probably the easiest part, if one has taken the time to learn how those systems work. I disagree that making GEDCOM upload easy is a significant factor in the success of WeRelate. If anything, volume of contributors has been a characteristic of unreliable, nearly worthless, sites like AFNs, WFTs, AFTs, etc., while useful sites are typically controlled by small groups that can enforce standards, like GMB. Small tweaks, like datanalyst suggests, that allows each user to have a history of matches that show up at the top of the list, or modifying the tabs to be more single-path, may make some small differences. But GEDCOM upload will remain difficult. Fortunately, once good data is entered on a Person page, that data should seldom change, so the difficulty of entering data gradually becomes irrelevant. Ease for ease's sake is not the long-term answer. By empowering people that don't want to follow Mike's example (of learning how to work with the system), we are effectively encouraging people to load in data that overwhelms our guidelines with non-conforming data, to the point that the guidelines may as well not exist. By making things easier, you are only going to get more people that dump their data and disappear. People that value all the potential of a collaborative wiki will realize that it will take work to merge their data with the data of a diverse set of fellow researchers. --Jrich 00:01, 17 December 2012 (EST)

Another observation about the use of MySources during GEDCOM upload: when a person picks up their football and goes home, all those shared Person and Family pages that the person contributed to, but which didn't get deleted because others are watching, end up with a bunch of red MySource citations. If the citation is an article, with all the magazine information on the now deleted MySource page,it can be difficult to determine what it was meant to refer to. Of course, you can delete the source citation, but there is a probability some of the data on the page came from that source citation. --Jrich 12:27, 12 February 2013 (EST)

jrich, pls clarify. Why would a MySource be deleted if/when someone leaves WR? Their data is still there, yes? Jillaine 06:10, 25 March 2013 (EDT)
I assume they are because I find red MySource entries a lot and I assume this is the result of somebody deleting their tree, and MySources are presumably only watched by the creator? In particular, I found it difficult when it was an article source and the now the magazine information is gone. At least I assume it was an article because I know of such an article, but maybe it was a reprint they were citing? In any event, I can't find out because the MySource page is gone. (Additionally, the handling of articles is a topic that needs reinvestigation and reorganization. We have citations of articles done using the record field while naming a magazine source, and also on a different page, as naming an article source. So no consistency. The rules indicate they should only be used when needed, whatever that means, and some users that create them all the time. FWIW, I think articles should be subpages of the magazine source. But all that is a different discussion.) --Jrich 09:35, 25 March 2013 (EDT)
See [1]. I recognize "Early Records of Boston" as an article in NEHGR. Undoubtedly " Edward Garfield (1575-1672), of Watertown, Mass." is an article too, but what magazine? After some searching, I found it in NGSQ. --Jrich 09:50, 25 March 2013 (EDT)

Forgive me for asking a silly question, but why do we have a facility where people can delete pages if they are the only person watching them? I don't understand the rationale if we are asking people to freely licesnse their contributions and then saying they can take them with them if they leave? AndrewRT 14:58, 26 March 2013 (EDT)

Two reasons I'm aware of: 1) we don't have a facility to update one's own gedcom, so allowing users to delete their own pages was a way of allowing people to delete their tree and re-upload. Outcry from those watching pages that were deleted for this reason lead to the restriction that pages are only deleted if no one else is watching the page (or, I think, affiliated pages like family and children -- why this doesn't apply to MySources, I don't know). I would agree with you and be happy to get rid of this rule, except, reason 2) It lets people clean up their own mess. If they enter a bunch of living people or people that don't exist, or even just realize later they made a mistake, they can fix it instead of using Speedy Delete and waiting for an admin to find it.--Amelia 18:04, 26 March 2013 (EDT)
I think the left hand menu item "more" contains a delete that works if you are the only contributor (i.e., speedy delete is only necessary when there is more than one watcher). --Jrich 22:32, 26 March 2013 (EDT)

This suggestion has been implemented. When you upload a GEDCOM, sources in the GEDCOM are automatically matched to Source pages if:

  • you previously matched a source with that title/author, or
  • at least two people have previously matched sources with that title/author (ignoring minor punctuation differences, etc.)

In addition, the source-matching process has been streamlined. An unnecessary screen and a click has been removed. --Dallan 21:23, 4 November 2015 (UTC)