I believe the ability to merge duplicate pages will advance WeRelate to a better resource than it was prior to having that ability. Yes, WeRelate is a wiki, but it is a different kind of wiki than the very successful "Wikipedia." In Wikipedia, an author has to work to manage a page. WeRelate allows for the automated authoring of pages simply by submitting a GEDCOM file. The typical barriers to wiki authorship are greatly reduced in the WeRelate wiki.
As with any wiki, the more time put in by dedicated contributors the more valuable the content of the page. Automated authoring potentially dilutes that value. At the moment, this is not a big problem on the site. It's still a good way to collaborate on raw data and the merge tool is helping us sort through it all and improve it.
Obviously, as merging duplicates correctly proceeds, the merged data will have more value. On the other hand, being inundated in the future with automatically authored pages becomes more worrisome. Of course, we will ask people making submissions to be careful about what they submit and we'll try to introduce them to our tools, but unfortunately most new people are unskilled (by definition) and naive about the proper way to do things. There is also the problem of abandoned submissions. Inexperienced users performing lots of merges introduces a potential danger. Relying on inexperienced people to properly manage the wiki is a poor strategy.
So what do we do?
I think we already have discussed a version of this issue in the problem of Junk Genealogy. In that case there are old, well researched genealogies that shouldn't be available for just anyone to access. I propose that as the value of a page increases as researchers improve it, any valuable page should be protected like those discussed in the "Junk Genealogy" section.
How would this work?
- Define the concept of Managed pages.
- A Managed page has a Manager. The Manager of the page grants access. Only researchers who have been granted access can edit the page.
- A researcher must specifically request Manager duty on a page by page basis
- A researcher must specifically request access of the Manager also on a page by page basis
Why this would work
- Managed pages are protected from automated edits
- Managed pages are protected from unsure new hands
- The need to perform an page related act to request Management or access will naturally limit the number of Managed Pages. WeRelate has worked pretty well so far and this restriction protects that behavior.
- Sure there will be problems with people becoming Manager and not doing their job correctly, or denying access rights when they shouldn't. However, this problems should be less difficult to deal with than issues concerning unknowing newbys waxing pages that have had a lot of effort put into them.
I think this proposal would suffer from an inability to scale. The audience for WeRelate is potentially hundreds of millions generating billions of pages. In ten years times, there will be millions and millions of mature pages. It will be almost impossible to find enough people to do all the needed work.
Further, I think this creates an elite class of members, which regardless of how correctly they do their job, could potentially become object of resentment by other people because it may occasionally appears that things are arbitrary. It would be worse if they happen to be incompetent or possibly have an agenda.
The proposal needs more work describing how somebody that thinks a change is needed gets the attention of the manager; the mechanism for giving them access versus everybody else; and how to appeal decisions that don't seem fair. Who chooses the managers?
--Jrich 09:08, 27 October 2008 (EDT)
- I can believe a structure such as this might be needed eventually, but why not simply take advantage of wikipedia for this sort of thing? If a page has such a level of scholarship and quality that it warrants protection from less than careful modification, it's probably worthy of being a wikipedia page (if it isn't already). Maybe all we need to do is create some sort of unbreakable association between a werelate person page and a backing wikipedia page? --jrm03063 10:59 AM 27 Oct 2008
- I am not sure what you mean by a backing wikipedia page. I assume you don't mean a page on Wikipedia, since most of the people are far too obscure for that. Admitting that caveat, and hoping you will make allowances if I say something patently stupid, my initial response is that genealogy is never finished. So even on the best researched and documented page, there is always a potential for new discoveries. George Bowman, the editor of Mayflower Descendant, was incredibly priggish about acknowledging Mayflower Descent, but even after 40 years of being immersed in Mayflower lore, he apparently never realized that Experience Mitchell had multiple wives and a fairly mature genealogy had to be redone. We can never know the truth back more than a few generations, we can only interpret evidence. At any time new evidence may arise, or inspiration can lead somebody to a better interpretation. It is a mistake to close the process to further change. --Jrich 11:51, 27 October 2008 (EDT)
- I do indeed mean a wikipedia page. Of course, some person pages aren't going to be of sufficient notoriety to justify a wikipedia page, but we're only talking about those that have a non-trivial level of interest. In many cases, that means you're talking about someone who is worthy of a wikipedia page due to significance in some town's early history, or perhaps simply because they have particularly numerous or famous descendants. I'm suggesting that a 90% solution to this issue exists, without us doing anything, if we properly work alongside wikipedia. It's also an issue that I'm already sort of tangling with in the process of merging our medieval royalty pages and other "celebrities" (for example, US Presidents, and their immediate ancestors, etc.), where a wikipedia page already exists. --jrm03063 12:05 PM 27 Oct 2008
- There are lots of people with no "notoriety" that deserve to be protected, partly because they are subject of "disputed lineages", and newbies uploading GEDCOMs will continue to import the same old garbage, running roughshod over mature, collaborative data. I don't think it will be all that useful, either, if the data items in WeRelate are wrong, but don't worry, the sucked in Wikipedia text is correct. --Jrich 12:14, 27 October 2008 (EDT)
- Jrm, I don't think transporting even well-conceived, well-written lineages from WeRelate to Wikipedia would fly, because I don't think the folks (or the management) at Wikipedia would buy it. There's already a well established policy there that "Wikipedia is not a collection of genealogy." Lineages of royal persons notwithstanding (because that's arguably general history, not personal family research), you won't find (say) detailed genealogies of six generations of descent from U.S. presidents there. That's not what a general encyclopedia (like Wikipedia) does. --mksmith 18:08, 4 April 2009 (EDT)
My Proposal: Elections
I would propose (knowing it is significant additional development so couldn't possibly be implemented soon) that no direct changes to existing data is allowed. Rather all changes, whether input manually, or through some automated process, get added to a "Proposed Changes" page that each base page has attached to it. (I.e., like the Talk page).
Either at intervals (say, every 6 months) or on nomination and seconding by an interested party, a election is held. The base page is temporarily frozen, email sent out to all people on the watchlist, and at the end of a week or some other appropriate period of time, the input votes are counted and used to identify the consensus of what is the closest to the truth, as supported by the evidence. An automated process would update the main page accordingly and unfreeze it, allowing a new round of changes to start accumulating. (Freezing a page for a short period would still allow viewing, just no inputting of additional proposed changes until the election is over.)
People that put themselves on the watchlist are signing up to be the judging committee for that page. Presumably they have an interest in discovering the truth about that person. So they would read and compare sources and discussions on the Talk page, and then would check those proposed items they find most credible among all those offered (or abstain if they don't feel qualified). This would still provide a "democratic" collaboration. Anybody could input data, you just have to support it with evidence sufficient to generate a consensus among the self-designated interested parties. I believe most people would do a reasonable job of reading through arguments and deciding which makes most sense. Some people may be blind to alternative facts, but they would be outvoted by a majority if the evidence is strong enough. "Losing" an election would just tell you to go find more evidence and try again.
Rejected proposals and discussions would be archived so that careful researchers could read them and not propose changes that have already been rejected, unless they truly have new and convincing evidence.
There are probably many details that would need to be worked out. For example, I assume creation of a new page would not require elections, only the changing of that page. Perhaps a page watched by only one person could be changed by that person without election, too. I am pretty sure the developers would have a better feel than I do on how to tweak the details so as to make the process easiest for everybody.
--Jrich 14:10, 26 October 2008 (EDT)
- I brought this over from the Duplicate Review page where I guess I mistakenly posted it. Compared to the above proposal, this is an attempt to design an approach that is self administering (people involve themselves by joining the watchlist) rather than requiring so committee to select thousand or more managers. It is almost entirely implemented by software so it doesn't take vacations or disappear without notice. Decisions are made by a group so it is much harder to claim somebody is biased, discriminating against somebody, etc. --Jrich 10:48, 27 October 2008 (EDT)
--Srblac 20:00, 27 October 2008 (EDT) Monday October 27
Inability to Scale - I was concerned with that issue. That was key to my proposal to limit Managed pages to only those that people volunteered for. I want it to be a small set on purpose. I like the way WeRelate works now.
Elite Class of members - Guilty sort of. Elite is a loaded word, but wikipedia has a small subset of members who do the majority of the editing. They are the self-selected wikipedia elite. WeRelate already has a self-selected of set of frequent contributors. I would allow anyone to manage pages, but since you have to take an explicit action there is a barrier (on purpose!) to this. Therefore management will be limited to those who are motivated.
Mechanism - I can see I was too sparse on that. I meant for "managing a page" to be something anyone can do. Managers are self-selecting. However, the management mechanism should be hard to activate on purpose. There should not be a choice to "manage all pages in my tree". It should be hard, so that only pages that people really care about get managed. Something on the order of a "Manage this page" selection for each page. You would have to specifically navigate to the page and specifically choose to manage it, thereby limiting the number of managed pages.
That doesn't mean that just the manager can edit. Another choice would be to request access. That would ask the manager to grant normal WeRelate editing privileges. The manager and anyone granted access can change the page. I realize that there will be problems with capricious, lazy, or absent managers not granting access. That is going to have to be appealed to the site administrators. I believe this will be less work for them in the long run than fixing broken unmanaged common pages or resolving disputes on those pages. Management creates an interested self-selected subcommittee for these key pages.
How to appeal decisions - This proposal just reduces the set of people allowed to edit a page. Everything else remains the same. The existing WeRelate rules for conflict resolution still apply.
Elections - I think the act of granting access achieves the same thing with less work. Any trusted editor granted access can make changes at any time using normal editing behaviors. I think the simpler the better.
Using the wikipedia for key pages - Here's an example I have done using this technique (Lucy Walter). I think this is the right approach for historical figures like Napoleon or George Washington. However, there are various high velocity people in geneaologies that aren't that famous. Any person who was living in the America's in the 1600s will have a lot of descendants and a lot of potential traffic. They have a small set of serious researchers and a large number of folks inheriting work that touch on the data. Take Mayflower descendants for example. Yes, actual Mayflower folk could fit the Wikipedia rule but for every actual Mayflower person there are hundreds more that aren't famous enough for Wikipedia. I subscribe to MyHeritage and I get SmartMatches reports on the same set of 50 people all the time. These are the folks who need have their pages Managed by serious researchers who will protect them from damage from well-meaning descendants who are posting yet another copy of the data. I am agreeing with Jrich here, putting the argument in slightly different terms.
WeRelate is not a normal wiki - The ability to auto-edit large numbers of pages (by posting a GEDCOM) changes the game. The fact that edits are anchored by people accelerates the problem because it means a lot of edits are going to randomly hit the same page without possible foreknowledge of the authors. As more manual effort is expended to add value the site, it behooves the community to find a way to protect that added value from inadvertent damage, however well-meaning. Yet, the site has a wonderful power to it that we must not hobble too much. A good balance is tricky.
--Srblac 20:00, 27 October 2008 (EDT)
- Instead of elections, how about levels of membership based on contibutions, achievements and time. I am a scuba diver, and for each level or title there are certain requirements based upon education, experience and proficiency I much reach to earn that title and the benefits that come with it. For instance, the WeRealte newby would be limited to uploading three files limited to a file-size or person count. After a certain level of participation, education and experience, say 100 contributions within 1 month and taking a certain number of tutorials, they would attain the next level, where they would be able to upload more information and have additional limited access privilages. Then after another period of participation and experience, say 300-500 contributions within three months, they would attain another membership level and further privilages such as downloading, unlimited merging, etc. And so on and so on till they attain the manager's level. This would be an automated process with minimal human interaction -- or just someone validating the data and taking into account other non-automated criteria. I think this would be an adequate low-maintenance internal control based upon incentives and rewards rather than annointing a WeRelate elite-class or enacting a politically-motivated election process which would be much too cumbersome to manage over the long run.
- Just a thought... --BobC 21:24, 4 April 2009 (EDT)
I like this idea. We'd probably want to start by having admins manually promote someone from one level to the next based upon their contribution history, but eventually hopefully this could be automated.--Dallan 11:00, 6 April 2009 (EDT)
Some statistics and background information
Wikipedia has several levels of article protection:
- Full - only administrators can edit (just over 1000 articles are currently in this state)
- Semi - only users with confirmed email addresses can edit (just over 3000 articles are currently in this state)
- Creation - prevents a previously-deleted page from being created (I didn't count these)
- Move - prevents a page from being moved (I didn't count these either)
Any user can request that a page be protected by listing the article on this page.
In the 1980's the LDS Church asked people to submit their trees to a database called Ancestral File. They automatically merged people's trees with some degree of success. The database eventually contained 40 million people. They told me that roughly 80% of the people in the database had just one contributor, 10% had two contributors, and 10% had several contributors. The ones with several contributors often had many contributors, not just 3 or 4. Many of the contributors to that database share a common heritage so their percentages may be high compared to ours, but based upon the duplicate counts that I've seen so far, I estimate that 80-90% of our pages will have a single contributor, 5-10% will have two contributors, and 5-10% will have several contributors. We currently have nearly 1.6M person pages, so 80-160K of them will probably have several contributors when the merging is generally completed. This number will grow.
Most person pages either have no sources, or they just source other GEDCOM files.
Most people when notified that a page they are watching (i.e., a page that came from their GEDCOM) has been edited do not respond.
Types of edits
One difference between WeRelate and Wikipedia is that in WeRelate it is possible to edit multiple pages at once by merging. In the future, I want to make merging part of the GEDCOM upload process - the uploader will be presented with possible duplicates before wiki pages have been created from their GEDCOM, and they'll be asked to review and merge into the existing pages in order to complete the upload process. This means that they could be presented with dozens (or possibly hundreds) of duplicate pages to review before their wiki pages get created. I'll give uploaders an option to just link to the existing pages, leaving the existing pages as they are, rather than merging with them. If they link to an existing page, then they also link to the ancestors of that page, so linking to an existing page means they can avoid merging with that page or any of its ancestors. This can reduce the number of merges they have to perform, which gives them an encouragement to link rather than to merge. I believe that most people don't care that much about many of the pages that they upload, especially the ones that are in common with many other contributors, so I think that people will often choose to link rather than to merge. Having said that however, allowing people to merge during the upload process is "relying upon inexperienced people to properly manage the wiki," which I agree is a poor strategy.
I think we can say that there are three types of edits at WeRelate, ordered according to the amount of attention the editor is likely to give to the edit:
- Merging during the GEDCOM upload process
- Merging as a result of "Show duplicates" (once the initial batch of merges is complete and merge-during-upload is implemented, this show-duplicates-merge should be much less frequent)
- Editing a page
The danger of careless edits is higher for merges than for regular edits. Wikipedia has the third type of edit only.
A couple of protection mechanisms have been proposed. I'll summarize them and propose a few more.
- Managed pages - only certain people can edit pages, similar to the authority model at PlanetMath
- Elections - people vote whether to change a page on a periodic basis
- Full protection - as in Wikipedia
- Semi-protection - since all pages at WeRelate require a confirmed email address to edit, we could modify the definition of semi-protection to require that the editor has made at least a certain number of (non-GEDCOM-upload) edits in order to edit the page
- Noedit template - similar to the nomerge template, the presence of a noedit template on the talk page keeps a page from being edited. This can be circumvented easily by anyone who wants to take the time to first remove the noedit template from the talk page, but it does make editing the page more difficult.
The above protection mechanisms are all manual -- the require someone to take action to protect the page. The following protection mechanisms are automated:
- Don't allow editing if more than N people are watching the page
- You can add information but not remove information if more than N people are watching a page
Choosing protection mechanisms based upon type of edit
I'd like to propose that different types of edits use possibly different protection mechanisms. For example, I don't believe that any of the manual protection mechanisms are sufficient to protect the 80-160K multiple-contributor pages that we're likely to have very soon -- there are simply too many pages to take a manual action on each of them. So I think that the protection mechanisms for merge-during-upload must include at least one automated mechanism. People could still merge their GEDCOM people into these protected pages, but if they wanted to change or remove any of the information on such a page, they would have to edit that page after their upload is complete. For example, we could say that pages with at least 5 people already watching them or that have a "noedit" template on their talk page cannot have their information changed during a GEDCOM-upload merge operation. (In this case we'd probably want to rename the "noedit" template to connote something specific to editing during merges.) Or we could say that pages with at least 5 people watching them cannot be changed during the GEDCOM upload process unless you have made at least N non-GEDCOM-upload contributions. I'm not trying to propose a specific set of protection mechanisms right now, just that we need to think both automated and manual mechanisms for this type of edit.
For normal edits, Wikiepdia seems to get by fairly well with just full protection and semi-protection. Elections have conceptual appeal, but they would require a fair amount of work to implement and I'd prefer to not implement something big like this unless experience shows that we really need it. Managed pages also have conceptual appeal, but if the manager becomes inactive or uncooperative you have to have a procedure for putting the page up for adoption. This requires some work by the administrators, so again it's not something I'd like to implement unless experience shows that we really need it.
For the show-duplicates-merge type of edit, unless we feel the need for a separate set of protection mechanisms for this type of edit, to keep things simple I propose that we adopt either the merge-during-upload protection mechanisms or the normal-edit protection mechanisms for this type of edit.--Dallan 00:50, 31 October 2008 (EDT)
- I like a lot of what has been discussed here. I agree that a procedure requiring regular group user intervention isn't going to fly. Even if you can get enough involvement to have elections and what not, you lose the rapidity of change that allows a wiki to quickly converge on very good information.
- Here's a more concrete proposal:
- People who have a wikipedia biography. We use the werelate page to add references for the person, add content that may not fit well in the wikipedia page (for whatever reason), and to create a typical genealogical fact list. Researchers doing serious work with such folks should be encouraged to put their work on wikipedia and reference it from werelate. The body of the werelate page is made up directly of regularly refreshed information from wikipedia. The page is also marked "not for automatic edit" (more on this shortly).
- People of genealogical notoriety. These are the folks that some of you claim don't rate a wikipedia page but are, never the less, significant as a consequence of a large descendancy. I don't think there are a lot of such folks, but I'll allow that there are some. These people are worked on as ordinary werelate people, but they are also marked as "not for automatic edit".
- Everybody else. As today, merged on the fly.
- What to do when someone uploads to a page that is marked "not for automatic edit" by some mechanism is sort of interesting. There are a number of possibilities here:
- Drop it on the floor. Tough luck Charley.
- Don't add it to the tree, but put the information as a block on the talk page of the user who uploaded it, annotated with links to the pages that the information relates to. If they care to go through that stuff and add it manually where it matters great.
- Add the information block to the talk page behind the locked/not for auto-edit person page. Maybe the uploading person gets a list of such instances on their personal talk page, but not the specific data.
- Go ahead and create a secondary person page but add to the new page a link to the old page it probably duplicates.
- My personal favorites are to make use of the talk page behind such manually edited person pages and/or the talk page of the user performing the upload.
I like the idea conceptually of managed pages, but I think in order for it to scale, it would have to be too easy, and thus not accomplish what we want. I really do not like the idea of elections. Way too slow and complicated to implement across the board. Which leaves me in favor of fairly automated protections for certain people. I like a single "no automatic edits" theory, at least to start, to be implemented after this first mad rush of merging.
This breaks down for me between people for whom automatic edits are almost certain to be unhelpful and others. The first category includes the "famous" whose birth, death, family and accomplishments are documented elsewhere extensively; the well-researched, on whom exist considerable information because there are so many descendants (a category of which we probably have more right now than the first one, but will hopefully change); and pages where one person has done a really good job of covering the available information. Which leads me to the questions that are closely related: 1) how do we identify the pages that get protection; and 2) where is the line between the second two categories?
I just wrote and deleted a paragraph where I went back and forth about famous people that are currently barebones and unwatched, and not especially interesting pages watched by ten, and really well-done pages that are watched by three. I got to the point that instituting protection should for now be manual by request. We probably do not need to protect all 80-160k pages watched by two or more people (wonder what fraction of those are watched by the same four merging people...), so hopefully this will not be especially burdensome. I'd be in favor of a rule something like, pages are eligible for protection if they have been edited by three or more people, have at least one reasonably reliable source, and at least some substantive notes (maybe just for person pages). This leaves out the one-person efforts, but I think that's okay because it's just not likely to be a problem. But it does protect the case where the efforts of a few people have been carefully edited for presentation.
Three additional thoughts --
- I think we don't need the full protection mode, at least at this point, because I don't think we have such controversial pages as to merit it (the classic Wikipedia no-edit is I think George W. Bush's page, which gets understandable mischief).
- How would this work with families? I'd love to prevent edits to the couple in my line that has their marriage in a parish register and well-documented children, and has been uploaded something like 15 times now. Does protecting the family page keep out new kids? Or not?
- And what will happen on upload if a page matches a no-auto-edit page? I'm imagining a user being asked to do their merges on upload, and then any pages that are no-auto-edit are not uploaded. But does that incent savvy people to just say that people should not be merged and thereby create their "own" versions of the pages?
I need to stop now. This is giving me a headache.--Amelia 18:19, 31 October 2008 (EDT)
Based upon this discussion, let's start out with something simple.
Pages can be semi-protected from editing during merges (both merges during upload and merges as a result of "show-duplicates"). Pages become semi-protected by leaving a request on WeRelate:Requests for page protection. Admins monitor this page and semi-protect the pages that are justified. People with a wikipedia biography or of genealogical notoriety would generally qualify. Families that are semi-protected can't have additional children added. During an upload, if a page in your GEDCOM matches a semi-protected page, you'll have to link to that page instead of merging, and we'll list all of the pages that were linked to on your talk page.
This may incent some people to create their own pages, but we'll have to have checks on uploaded GEDCOM's anyway (e.g., for obvious errors like people dieing before their children are born, or new medieval people being uploaded), so we could add "decided to create separate pages for matching pages" as another check. I'm thinking that if a GEDCOM has more than a few potential problems, we would have an administrator review the GEDCOM before new pages were created for it.
If this approach doesn't work, it should be possible to extend it as we understand better where it fails.
How does this sound to everyone?--Dallan 11:51, 7 November 2008 (EST)
I'm all for simplicity.
--Jrm03063 12:55, 7 November 2008 (EST)
That approach has roughly the same effect. It is an indirect approach which makes it more subtle than having a tool directly on the page that can be used to protect the page. That means it will get used less, and then only by folks who are "in the know". But that's OK. Protection shouldn't be a tool used on lots of pages anyway. The approach can grow if the need grows.
--Srblac 21:21, 7 November 2008 (EST)
If you think it will work, give it a try. But I think everybody is underestimating how much this will get used. If I know from googling a person that half the webpages out there (or less, but some number greater than zero) have it wrong in the face of good evidence to the contrary, I am going to think protecting a page might be the right thing to do. And it sounds like it involves some manual procedure, so if there are thousands of people like me (impossible, I know, but pretend it is so) will it scale?
So somebody posts a request. What are the criteria that causes that request to be accepted? Or does posting the request make it so?
--Jrich 22:30, 7 November 2008 (EST)
- What happened to the proposal for dealing with discredited data that would put the common but wrong information on the page as alt (or something), and thus prevent new edits asserting that the common information is right? That would seem to reduce the need to protect a page for the reasons Jrich suggests.--Amelia 22:49, 7 November 2008 (EST)
- I suspect that would require specific data items be set up on the page to hold discredited information, i.e., disputed birth, disputed parents, disputed death, etc., and then the merge would need to compare to those field, and if a match is found, that would signal not to write the date into the regular birth, death, parents, spouse, etc.
- Somewhere it was suggested that if more than N people are watching a page (or it has more than N tags under the new plan), then freeze it. That sounds very simple to me.
- In the current environment, I think 5 would be a reasonable number? After 5 people have looked at that page, one would suspect most of the data has been entered? Probably no single number is the proper value for all situations but what is a good compromise that won't cause too many problems? Anything over zero and under 10 would be fine with me. --Jrich 09:33, 8 November 2008 (EST)
I think if we need to, we could add a "Request protection" link to the "More" menu to make it more visible. I'm thinking that we would grant a request if the page had been degraded while it was being merged say 2 or 3 times. Perhaps this indicates that the "disputed/discredited" concept will be pretty useful since it could stop a lot of degradation before it happened.
I also agree that we'll need to have an automated protection mechanism, if not right away, then eventually. Five watchers seems like as good an automated rule as any I can think of.--Dallan 21:06, 29 November 2008 (EST)
Missing from all this discussion is sources. After all, the quality of data is only as good as its source. What if the sources in our data base were ranked in terms of credibility. We already have a selection list to choose the quality of the source. Then, say any page that contains a ranked source could not be overridden by data from an inferior source. In that case, the management could be reduced to management of sources instead of pages. Of course the source citation should include volume, page and enough text to prove it is indeed applicable. MySources would need to be judged individully to determine if they are of sufficient quality to override a ranked source. Sources not in the database would be submitted for inclusion and ranking, and be strictly managed. Only exceptions to the process would need to be subjected to appeal involving watchers, editors and administrators.--Scot 17:21, 30 November 2008 (EST)
The problem I have with ranking sources is that we're a long way from being able to do this in an automated way. The source database is probably nowhere near complete, it doesn't have any ranking information, and we don't match GEDCOM sources to Source pages. I think this kind of analysis will have to be done manually.--Dallan 02:07, 28 December 2008 (EST)
Root Cause Analysis?
After a few reads of this page, I keep coming back to the same thought: is this a proposed solution to a problem that is not really addressing the root cause that brought us to this place? Maybe I've got it wrong but it seems like we got here because the ease of gedcom uploading directly into the wiki resulted in a lot of junk and we don't want the otherwise good stuff to be polluted by the junk.
Is there an entirely different solution that might be more appropriate?
No I am not advocating the abandonment of gedcom uploads. But perhaps there is an interim step before the gedcom is integrated into the wiki?
We've talked elsewhere about "before you upload" instructions. In additionn perhaps there needs to be some holding space for uploaded gedcoms that get some sort of human review (vol. Comm.?) to assess its quality?
However much of this can be done by technology (some sort of scan of sources for example) that might provide a rating that triggers some sort of human review flag?
This may not be the right answer; my main point is that I wonder if we're designing a solution for the wrong problem. -- jillaine 18:44, 18 December 2008 (EST)
- I agree that the more serious long-term quality issue isn't with multiple human users actively and manually editing, it's with GEDCOMs. I think that as long as we permit new users, with no demonstrated commitment to the site, to upload large amalgams of ged-screment, the situation will be hopeless. There are approaches, such as those discussed on this page, to help protect carefully created and managed pages. That's good and probably necessary. Still, none of that will help in the case of incorrect pages and incorrect family connections, since those things will be regularly wiped away and then - via GEDCOM upload - regularly recreated. Last night, for example, I was tagging for delete pages that claimed a lineage (presumably a 19th century hoax or similar) to ancient Troy and then to Abraham. Oh Please!
- I think that there are a lot of folks out there that are (my term) "GEDCOM collectors". In the same way that there are folks out there that collect dogs and cats, etc., and think that they're helping or saving them. At some initial stage, maybe they were, but eventually the scale of what they are doing grows beyond their ability to do it reasonably well. The situation then acquires a rather horrifying aspect. If such GEDCOMs withered away on peoples home machines, it wouldn't be a problem - but they get regularly contributed to sites such as ours.
- My preferred approach on this is GEDCOM upload size limits for newbies. Until someone has a history of a few hundred or so manual edits, they should only be able to upload a small GEDCOM fragment. Only after they have done enough work with the site to have a remote clue of what it means to upload and maintain material, should they be then permitted to upload more stuff.
- See, I can't get behind that idea. While I appreciate the intent behind it (discouraging gedscrement), tying gedcom uploads to.ongoing editing risks losing great data.
- Let's say I was an 80-year-old genealogist who's been doing genealogy for thirty years, did all the proofs to get my kids registered in all the venerable societies and family associations. In recent years even went so far as to enter all my great work into a family tree program given to me one Christmas by my grand-daughter, who subsequently told me about werelate.org. I now see an opportunity to contribute my hard work to a community which if interested, will further maintain it. Unfortunately, for whatever reasons, I won't be able to do further work on it. If your rules were in place, that gedcom couldn't be added. jillaine 08:01, 22 December 2008 (EST)
- My network is rather large due to my work on reducing duplication. I share over 1000 pages with each of 15 people. Of those, perhaps two are still active, and less than five ever contributed anything beyond the initial GEDCOM load. It has taken me a dedicated year, of over 100 hours a month to get this far, but it probably took those folks ten minutes each to designate and upload their GEDCOMs.
- But to get back to your intial point, I agree wholeheartedly, that the scale of the problem presented by GEDCOM pollution is several orders of magnitude larger than the problem presented by individuals simultaneously interested in the same page. Jrm03063 8:34, 19 December 2008 (EST)
- Well, you and I may be approaching agreement, but I'm not sure about others. Where I don't agree with you, however, is the need for some sort of controlled access to good pages. To me, that goes against the nature of a wiki. And I think it might cause more harm than good. What you DO want to protect those pages from is merging with crap data (as you call it: GEDscrement-- great new word; thanks). To me that calls for a review committee that scans uploaded GEDCOMs before they can be integrated/merged into the wiki. jillaine 12:17, 21 December 2008 (EST)
- I generally agree with you, but the author of this idea seems convinced otherwise, and it's hard to claim that there would never be a page where management wasn't needed. I would think it would be extremely rare - that mostly such pages would be associated with people who have a presence on wikipedia, and we can somehow push the issue off that way - we don't want pages that separately reproduce wikipedia. Instead, such pages should be regularly and mechanically updated with the current wikipedia content. Jrm03063 12:52, 21 December 2008 (EST)
- Generally I agree with both of you. Manual input is slow and encourages one to review what one is inputting, etc., and while it is still possible to input bad stuff, it requires much more insensitivity. I think it is more likely that people will see all the existing data and sources and say "I can't compete with that". But any automated process loses that. Even merging is automated enough that it is easy to miss a Talk page or other pertinent data before proceeding.
- Again, while I understand the intent of your idea, I don't think it's feasible. Using the example of my 80-year-old alternate persona, there is no way in hades that I'm going to manually input 30 years of data into werelate. And frankly, my 40-something self ain't gonna do it either. In fact, it's why I left that other genealogy wikia because they didn't have a gedcom upload feature. I started manually entering my information and it was taking so dang long, I threw up my hands and left.
- If this is about enabling or disabling GEDcom uploads only, then my only concern is the sooner they are turned off the better. I think a GEDcom upload should never overwrite existing data.
- And here's why I'm proposing an alternative: there is a GEDCOM review committee that approves uploaded GEDCOMs before they are incorporated into the system. I wouldn't suggest a solution without being willing to be part of it, so I'm willing to serve on such a committee. This is also done in conjunction with a pre-upload set of steps and recommendations and warnings that includes among other things that GEDCOMs with a certain set of characteristics will not be incorporated into werelate.org. I started drafting the first part of this elsewhere, but not the criteria-for-exclusion. jillaine 09:01, 22 December 2008 (EST)
- I don't think this will work. It will cause too much delay since it adds a human intervention, and it will be marginally able to do its job even if not swamped by uploads. Say your hypothetical grandmother uploads 5000 individuals, generally sourced from reasonable sources and 4998 are good, but 2 overlay existing entries incorrectly. Is your committee going to even bother to figure the exact impact of the upload, is it going to block the whole upload because it ruins 2 pages, is it going to edit the upload to remove the 2 bad ones, or is it going to just allow everything to proceed because the good stuff is valuable and outweighs the bad stuff?
- If there is just a list of criteria that will be applied a GEDcom to get an overall impression of its quality, one would think this should be done by software. There is a quality field in the source entry so perhaps a software check to ensure that each birth death and marriage is annotated with a source of some desired quality rating or above? Or at least one such source per individual? Of course, just because there is a source, that doesn't guarantee that the source hasn't been applied to the wrong individual, or that the source was correct in the first place, or that it was transcribed correctly, etc.
- Yes I agree no uploads looks like a show stopper, now. I am not sure that would always be true. Unfortunately, I think this was an issue that almost had to be designed into WeRelate from the very beginning to be handled correctly instead of figured out after the fact. So any answer will probably result in a few situations where the process is inefficient and difficult. But over time, as more and more people are input, uploads will be a higher percentage of duplicates and less new material. This will actually become a bigger problem so it is important to do something.
- The simple solution proposed above, that an upload can't overwrite existing data on a page if more than X people are watching it seems quick to implement and surprisingly effective. Then the new data your 80-year old grandmother is adding will get added, but stuff that is already input will be protected. Assuming there is a report telling what is allowed and what was blocked, the grandmother can get her young grandchild to help her manually fix the blocked entries where appropriate, thus engaging a new generation in this hobby of hers. If she can't, we only lose her input on people that others are studying, so hopefully none of her truly valuable contributions. (Perhaps the report about blocked uploads could be added to the Talk page, allowing one of the watchers to see it, and possibly act on it in those cases where the grandmother cannot.)
- Even here, there is some ambiguity, about blank fields on a page. I believe if there are X people watching, the whole page should be off-limits, even blank fields. Say the subject has no death date, and there is a death date provided in the upload. This may indicate that the upload has one of the disputed lineages, and would be inputting the death date of a different person, mixing the valid birth date with the death date of a disputed person. This creates a mess. So, if X people are watching the page, and none of them have input the death date, I believe that ratifies that the date is not known. Thus we no longer consider the field empty, per se, just not known, and not overwriteable by an upload. --Jrich 10:20, 22 December 2008 (EST)
User:Npowell and I are currently implementing a new GEDCOM upload process that will have three parts:
1. Someone uploads a GEDCOM. The system determines:
- which families in the GEDCOM are likely to match existing Family pages (possible duplicates),
- which people have data problems (marriage before birth, etc.), and
- which people are living.
2. The user reviews their GEDCOM before the wiki pages are created.
- They can see how every person, family, and source from their GEDCOM will look when it is imported into WeRelate.
- For every possible duplicate, they're asked to compare the families in question and decide whether or not to merge into the existing pages.
- If they elect to merge with an existing page, they can either
- go through the compare screen and merge screen just like in ShowDuplicates to edit the existing pages with their information (if the existing page hasn't been semi-protected), or
- simply check a box to say that they want to merge with the existing page, in which case the existing page stays as-is. If you check the box you're then asked if you want to adopt the ancestors of the existing family as well rather than merging your ancestors. This is much faster than using the merge screen to edit the existing pages, so I'm hoping to encourage people to check the merge box if they don't care that much about the people in question because they didn't research them personally. If the existing page is semi-protected, your only option is to check the "merge" box, and to edit the page later if it's important to you.
- We also ask them to review the data problems found.
- They can also review the Place links and fix them if their places aren't linked to the correct Place pages.
- They can also review their GEDCOM sources and try to match them to existing Source pages, or create new Source pages for them, or keep them as MySources, or omit them from the upload.
- They can also review the people in their GEDCOM that WeRelate thinks are living and are to be excluded from the upload and either unmark them or mark new people to be excluded.
Once they've finished reviewing their GEDCOM, they press a "Create wiki pages" button to go on to the next step.
3. The system evaluates the GEDCOM to decide whether it needs human review. I'm not sure what the requirements for human review are, but I'm not against saying that every GEDCOM requires a human review. Other possibilities:
- Every GEDCOM where the uploader didn't decide whether or not to merge one or more of the possible duplicates shown.
- Every GEDCOM with medieval data requires review.
- Every GEDCOM where the user chose not to merge one of the potential duplicate families that we presented requires review.
- Every GEDCOM with more then 5 data problems requires review.
- Every GEDCOM with more than 1000 people requires review.
Once the human review is complete, the wiki pages are generated.
Re-uploading your GEDCOM will go through these same steps, with the variation that pages in your GEDCOM that you haven't changed since your previous GEDCOM upload are automatically marked as merging with the existing pages. Pages that you've changed in the current upload must be merged into the existing pages using the merge screen. So if you've added information to 20 of the people from your previous upload, you've now got to do 20 merges, but hopefully that's not too difficult, and it keeps you from blindly updating a bunch of pages that might have been updated by others since your last upload.
I agree that GEDCOM uploads are going to be the main source of problem, not manual editing. But I don't believe we can do away with them altogether - they're just too convenient for most people. I think the above approach makes GEDCOM uploads more time-consuming for the uploader, but I'm hoping that that encourages uploaders to think more when they merge into existing pages.
We currently get only a few GEDCOM uploads a day. I hope this number increases once we start promoting the site more next year, but it may not be too onerous to have a human review of every GEDCOM if we decided it was necessary, or perhaps every GEDCOM flagged with even minor potential issues. We couldn't review every page in every GEDCOM, but we could review the pages that looked problematic -- pages that were highlighted in a "pages with possible problems" list.
I think that semi-protection should be used to keep people from changing the pages when merging into them during a GEDCOM upload, but not editing pages directly.
One thing about this approach is that in step 2 the uploader is editing existing pages that are possible duplicates, adding information from their pages. These edits happen right away. So if during a review we decide to throw away the GEDCOM, the uploader's edits to existing pages remain. If this becomes a problem (because the uploader put a lot of bad information on the existing pages, it should be fairly easy to create a tool to "undo" all of someone's edits that were made during a given time period.
I don't know if this approach addresses all of the issues above; I hope so.--Dallan 02:07, 28 December 2008 (EST)
Wikipedia - WeRelate Relationship
- Sorry, couldn't help myself; the above "Root Cause Analysis" discussion was diverging into two different arguments; I pulled the wikipedia pieces out and moved them here. My apologies in advance if the cut and paste resulted in a bit of rockiness in both places, but to me, they are separate issues. jillaine 09:15, 22 December 2008 (EST)
... It's hard to claim that there would never be a page where management wasn't needed. I would think it would be extremely rare - that mostly such pages would be associated with people who have a presence on wikipedia, and we can somehow push the issue off that way - we don't want pages that separately reproduce wikipedia. Instead, such pages should be regularly and mechanically updated with the current wikipedia content. Jrm03063 12:52, 21 December 2008 (EST)
- However, I so not share in your acceptance of a wikipedia role. First, every change to the orthodoxy must start with a single lone voice, and no matter what that orthodoxy is, you don't want to stop listening for that new voice. Second, I also don't see why WeRelate should become subservient to wikipedia. Wikipedia is more of an encyclopedia, but WeRelate specializes in genealogy and will most likely be interested in some facts and sources that some Wikipedia author may not deem to be of general interest. So despite a wikipedia article on some individual, WeRelate may wish to still allow further input because there are additional facts of genealogical interest, such as the will, or such stuff. --Jrich 13:56, 21 December 2008 (EST)
- Let me deal with this in pieces.
- "First, every change to the orthodoxy must start with a single lone voice, and no matter what that orthodoxy is, you don't want to stop listening for that new voice."
- Creation of a duplicate mechanism for page management, and indeed a secondary wiki in werelate, in no way addresses such a problem, if indeed it exists. To my knowledge, wikipedia and werelate are both accepting of differing interpretations and perspectives, provided that they are presented respectfully and in recognition of other reasonable alternatives. If a good or useful idea is being stifled on wikipedia, there's no special reason to believe it won't suffer the same fate on werelate.
- "Second, I also don't see why WeRelate should become subservient to wikipedia."
- This isn't about subservience. This is about engaging the greater community of people studying particular areas of genealogy.
- "Wikipedia is more of an encyclopedia, but WeRelate specializes in genealogy and will most likely be interested in some facts and sources that some Wikipedia author may not deem to be of general interest. So despite a wikipedia article on some individual, WeRelate may wish to still allow further input because there are additional facts of genealogical interest, such as the will, or such stuff."
- But now you're talking about a combined probability - how often is werelate different than wikipedia AND how often do we need page management to buffer page changes? Taken together, I think the number of pages is closer to 0 than 1 - it's just not worth worrying about. Jrm03063 18:44, 21 December 2008 (EST)
- But what I am responding to is what seems to be an excessive reliance on wikipedia as an absolute authority.
- You have been working lately on a lot of European nobility and wikipedia may play an important role in that arena. I have about 7000 people in my genealogy database and I would be surprised if 10 of them are in wikipedia. So proposing wikipedia inclusion as any kind of criteria seems to me to be excessively limited and not all that useful for most of the WeRelate pages.
- I use wikipedia a lot. It may be right about most things, and is a useful source. Right now, I agree, wikipedia is probably more likely to be correct compared to WeRelate. But at some point in the future, I beleve that this could change. They are based on the same model of correctness: mass review. But over time, as usage of WeRelate increases, given its focus on genealogy and the possible attraction it has for experienced genealogists, it may actually become more reliable about issues of concern to the genealogical community. --Jrich 23:51, 21 December 2008 (EST)
A few random thoughts:
- Having a wikipedia template on a page only impacts part of the page: generally the first part of the text, although the wikipedia template could appear anywhere in the text area - even after WeRelate-specific text. We can use the other parts of the page for WeRelate-specific information. Place pages for example, like Place:Illinois, United States, use wikipedia templates to provide general and historical text, but we add genealogy-relevant information like timelines and population histories to this.
- Using the presence of a wikipedia template on a page as a semi-protection mechanism seems like a good idea to me because people on wikipedia tend to be the kind of people that "collectors" add to their GEDCOM's, where first-hand research is not likely to have been performed. This keeps these pages from being edited by GEDCOM upload merges, but wouldn't keep the pages from being edited normally. If someone has additional genealogy-relevant information to add to a person with a wikipedia template, I hope that they would edit it add it to the page, just like people are adding genealogy-relevant information to the Place pages.
- If we used the presence of a wikipedia template on a page as a semi-protection mechanism, we would need to have other semi-protection mechanisms in place as well, like the 5-watchers rule, or a manual request-for-semi-protection.
- One downside of Wikipedia text inclusion is that the text included from Wikipedia isn't tied to the events and relationships we list down the right. So it's possible for the included Wikipedia text to disagree with the events we list. I don't know how much of an issue this is. I would expect most of the information on these people to be fairly stable.
- If we get to the point where we disagreed with Wikipedia and we felt that we were more accurate, we could either stop including the Wikipedia text for that person, include a statement above it that it was wrong in that regard, or try to change the article on Wikipedia. Regardless of which approach we used, we'd probably still want the person to be semi-protected.
--Dallan 02:07, 28 December 2008 (EST)