User:Jrm03063/Source-side reference creation

Overview

WeRelate (WR), and it's wiki underpinning, helps automate a lot of steps in the process of creating good genealogy.

  • When a user makes a connection from a Person page to a Family - the system automatically generates two connections: 1) a connection from the Person to the Family and 2) a connection from the Family to the person.
  • When a user makes use of a Source for a Person or Family page citation - formal information on the source is obtained without the user needing to reproduce that for each citation created.
  • If a user makes a change to the formal information for a Source page - then all the Person and Family pages that are associated with the source immediately benefit.

While the community very much desires that source standards be improved - both in terms of extent and consistency of application - WR offers nothing beyond title and publication details. Users are very much on their own.

Source citations therefore - involve lots of mechanical steps for a user - the chances of doing things consistently is low. Different users are apt to apply different standards - and even a single user may not recall, from day to day, what their practice was for a source. Further - even if the community manages to agree on conventions for citations of a particular source - such agreement will doubtless come after many individual source citations are created. Indeed, a proper consensus might change over time. However desirable changes might be though - the community will be loath to take a step that involves thousands - or even tens of thousands - of changes to various Person and Family pages.

Fortunately - many important genealogy sources are outside copyright and available in transcript form. Indeed, creation of such transcriptions was one of the earliest on-line shared genealogy activities. Such transcriptions can readily be added as pages to WR. With appropriate organization, markup, and supporting programs - it should be possible to consistently generate many thousands of individual source citations. If the desired conventions for the appearance of those citations should change - the programs should be able to perform the needed updates quickly and consistently.

Instead of a user working slowly through individual Person pages - laboriously creating citations - source-side specialists can swiftly and systematically edit markup for transcript pages. When the appropriate programs are run - the transcript markup is analyzed to feed citation information back to appropriate Person pages.

The Genealogical Dictionary of the First Settlers of New England

Savage's Dictionary is a well known secondary reference of early New England genealogy. It is long out of copyright, and has previously been transcribed by Dr Robert Kraft. Here at WR, we have our own copy of that transcript. In the last several years, this copy has been subjected to numerous edits, linking to WR Person pages where known, improving the cosmetic appearance of the content and fixing character recognition errors latent in Kraft's original text. The same markup also provides a way for software to analyze the material - recognizing sections and sketches. Further, analysis of ordinary wiki annotation syntax, allows a program to determine what WR Person pages are the subjects of what Savage Sketches (name, section, volume and page).

A sample extraction (python source) from Savage is presented on this page. The complete extract is over 80K lines - showing 33839 possible citations for 23199 WR Person pages. The extracted quotation uses bold to designate where the associated Person was referenced in the sketch. Content where Savage is believed to have been in error is designated with strike-through characters. Rules for the length of content to be obtained for a quote are modifiable program constants.

Demonstration Extract to Person Pages

A demonstration (using slightly different parameters than those shown here) has been done for a handful of Person pages. The current implementation adds a generic source entry for Savage with extracted information being placed in the narrative body for the source.

Automatically extracted sources are made to be recognizably different from those created by users, using uniquely named empty templates to bracket the extracted content (SavageDictionaryIndexExtractBegin and SavageDictionaryIndexExtractEnd at the beginning and end respectively). The bracketing templates have a side benefit of making it easy to find the Person pages in the demonstration set (visit the template, then see "what links here" - or here).