WeRelate talk:History of Parliament Inclusion Project

Watchers

Discussion Transferred from talk page of AndrewRT


History of Parliament [6 April 2013]

More interesting than I first suspected...

I've already written a program that harvests member page URL information from the 26 * 10+/- different index pages on the History of Parliament site. I've been able to rip that information apart in some useful ways, so that I can identify prefix, first and last name, and the URL for the person's page. I should be able to torture out a little bit more too.

My original idea had been to recreate the member indexes from the History of Parliament site, and use that as a guide by which we could go out to create each of the 20K members, starting from a "Person:First Last, MP (1)" red link. It would be one at a time, but it would be systematic. But now I've got a different idea...

Instead, I'm going to try directly creating simple GEDCOMs from the index information. While the GEDCOMs would be somewhat degenerate - containing nothing but a lot of detached individuals - we would be starting from Person pages that have roughly correct names, DOB, DOD, and appropriate history of parliament source backing the dates. So we'ld be starting from basic pages on individuals, which we would refine and try to extend/expand until they made contact with existing parts of the WeRelate database. Obviously, a lot of them would match from the start. It's also common for Parliamentary membership to run in families, so connecting up members loaded individually could be done quickly.

Right now, I'm doing some simple experiments uploading and downloading trivial GEDCOMs with a couple of people. When I get basic information sort of right for a few people, and start creating some small GEDCOMs, I'll show you the results. When we're satisfied that we're getting a good start on the information, I can start to generate some whoppers - perhaps 2000 at a wack.

Ambitious enough?

--jrm03063 15:54, 5 April 2013 (EDT)

Yes that does indeed sound interesting - and ambitious!! Just a few thoughts:

GEDCOMs are typically used to upload people who are linked to each other as members of the same family. This is interesting use of this facility, but would other WeRelate users be comfortable with this? In particular, they may see a risk that you end up with lots and lots of detached people who never get linked up?

Hard to say. I don't see a technical or user-experience complaint being justified - though I've learned that trying to do anything substantive and unorthodox is hazardous in the presence of "genealogists". I think the real risks for the site lie in doing what its been doing - and hovering around 2.4M person pages with very little upward movement. I also think there are risks in trying to pre-qualify what we load. We'll miss things and we won't be able to make a nice, simple, broad statement - such as - "Find your parliamentarian ancestor at WeRelate" - or some such.
Maybe the better argument to offer folks is that there really isn't a downside risk - because anything that gets loaded by a large bulk process - can - if a problem develops - be efficiently and selectively deleted. The pages that get created will obviously be part of known tress and categories - so getting back to them would be a very straight-forward affair - were it to be necessary.

I have an idea for how to mitigate this: can you search the "Family and Education" section of each biography and see if there are any linked people? For instance, http://www.historyofparliamentonline.org/volume/1509-1558/member/cromwell-thomas-1485-1540 has in this section a tag "href="/volume/1509-1558/member/cromwell-gregory-1516-51" title="Gregory "". Could you do a routine that counts the number of such tags from all the 20,000 entries and then we could start with the ones that had the most links?

I can think of heuristics of a few different sorts too - but maybe the better way to go is to just pick the 2,000 or so that are in the most recent time period first - and to track, over some period of time - how many pick up connections and/or edits? I certainly expect to work with a lot of these pages.

By the way, I'd love to see the code that you've used to create these GEDCOMs -it's something I've been wanting to learn more about but haven't done so yet. AndrewRT 17:36, 5 April 2013 (EDT)

I havn't got anything yet, but I don't expect to do anything very sophisticated. I'm planning to reverse-engineer some tiny GEDCOMs that I export - and then see if I can produce some equally tiny original versions that will be acceptable for import. --jrm03063 22:09, 5 April 2013 (EDT)
Actually, there are technical disadvantages to uploading GEDCOMs containing mostly disconnected people. You say "Obviously, a lot of them would match from the start." This is not possible. The GEDCOM uploader does not check for duplicate individuals, it only checks for duplicate families. The same is true for WR's daily duplicate report. Disconnected Person pages (orphans) will not be connected into existing family groups during GEDCOM upload. Also, if other users upload a GEDCOM containing some of the same people, they will not be matched. I've seen this problem on a small-scale with the Dutch files when there are only 1 or 2 orphan pages.
I'm well aware of the way that automatic matches are done - and had a tiny part in the technical discussion when Dallan developed the process. I did some simplistic mechanical things to find such matches before the software supporting it here existed.
By "from the start", I mean matches that would be readily detectable when each person page is visited and investigated BY A HUMAN, at even a cursory level. I've added over 400 cites to this source by hand, and know of many that I just haven't been able to get to yet. So while I'm looking to significantly streamline and/or reduce the amount of "by hand" work, and make the process more methodical, I do not think that element can be eliminated.
I do have experience with gathering large amounts of data from a source, turning it into a GEDCOM, and uploading it to WR. However, in that project, I had connections to either their Parent unit or their Spouse unit. In that case, there was something for the uploader to match to. --Jennifer (JBS66) 07:01, 6 April 2013 (EDT)
The only thing that's novel here, is my claim is that we need not be restricted to the two extremes - of either doing everything by laborious data entry - nor by insisting that a more complete set of connections be present so that we're working from full trees and tree fragments loaded by a conventional-looking GEDCOM. I want to take what I can get from one, doing only what I must with the other. The balance between what you get from each can be different than what we have previously contemplated.
However this plays out - even if it eventually does all 20,000 pages - that won't be arrived at overnight. The process will be experimented with on a tiny scale (as it is presently) then larger sets created as we get experience with the process and see that it's working. I'm excited about it because I'm pretty certain about it, so I do anticipate getting to those large sizes. But maybe others aren't so sure - fair enough - Even if I'm completely wrong - and I created some sort of large size mess - it wouldn't be a disaster - because the whole thing is subject to automatic and reliable removal. There's a lot of potential here - and nearly zero risk. --jrm03063 10:15, 6 April 2013 (EDT)

Results of a tiny export->import [6 April 2013]

Pages Person:Charles Abbot (6) and Person:John Abercromby (2) represent the export and re-import versions of Person:Charles Abbot (5) and Person:John Abercromby (1) (an exported GEDCOM, consisting of two detached Person pages, re-imported without any changes. Some initial observations:

  • Obviously the pages we generate won't include the "WeRelate" source.
  • Too bad that the "Record" part isn't correctly placed back in the "Record" field on re-import. The GEDCOM looks right as far as this information goes, I think that the GEDCOM import simply isn't making the right choice. At least the template text wasn't messed with.
  • I did expect that the source for History of Parliament would be turned into a MySource, but the information present in the GEDCOM really should have been enough to avoid re-creating that. Too bad.
  • The category was correctly established, so that's a win.
  • I'm glad to see that two different facts (birth and death) attached to a single source, came back and were still a single source (even if that source became a MySource).
  • While not important for this effort, the unchanged Person pages could have been recognized on re-import. The exported GEDCOM has a bunch of information that identifies the page as originating on WeRelate, a tiny attempt to "export" potential unchanged re-imports could be done, and compared against the GEDCOM content. Maybe a note to add regarding software defects/enhancements somewhere?

So my first thought - it's not awful. I never expected that the pages would arrive in a state that they didn't need individual attention. Not being able to maintain input wiki syntax like the Category membership and template for the source record cite would have been much more of a hardship. I think I'll next try to write an actual tiny GEDCOM.

Thoughts? Remember - anything that we can do semi-automatically and consistently in software - won't have to be done by hand later - and anything starts looking like a drag if you have to do it 20,000 times! :) ! --jrm03063 17:04, 6 April 2013 (EDT)

Looks good. I'm intrigued to know how you generated the dates of birth and death (particularly considering the DOD for Abbott was in the main text rather than in the "Family" section. Was this done with an algorithm? How would you propose keeping track of which uploaded people had been checked, particularly if we were sharing this work? AndrewRT 17:23, 6 April 2013 (EDT)
I havn't generated any GEDCOM yet - I just wanted to start with WeRelate pages that were about what I wanted, export that and re-import it, to see what survived. I don't expect to do better than a year for DOB/DOD in the typical case. --jrm03063 17:53, 6 April 2013 (EDT)