Bot ideas [19 August 2012]
I'm interested. I don't know the specific nuts and bolts of bots - but regularly run code that grovels over pages both on WeRelate and elsewhere. --jrm03063 10:45, 13 August 2012 (EDT)
- Thank you! I don't have anything specific right away, but will probably have something in September. Do you happen to know Java? I've written a "bot" framework in Java. You could certainly write in other languages too if you don't know Java.--Dallan 08:34, 14 August 2012 (EDT)
- I wrote a lot in Java ten years or so back, but I'm much happier in python these days (much nicer fit w/C++). There are some things that I have in mind.
- I was thinking about defining a template for "MySource" pages, where you could define the preferred "Source" in one place. The bot could look for instances of the template on MySource pages, then, for any it found, see if they are linked by person or family pages. On those, change instances of the MySource to the appropriate Source. Smarter transforms could be written if you needed to do more than just change a MySource to a Source (changing other fields or moving things around).
- I'm also thinking about something to work backwards from Savage - creating sources on Person/Family pages indicating where the person is mentioned (and some part of the context). This one is apt to be controversial - so I'll probably have to create an invisible template on Person pages - that marks them as "auto-savage-cite".
- A wikipedia scanner that would look for ahnentafel templates that contain references to other wikipedia pages. Compare the findings with what's in WeRelate, to both confirm existing references and to discover opportunities for adding more pages from the WP biography set.
- --jrm03063 12:05, 14 August 2012 (EDT)
- All of these ideas sound good. You really only need two things to get started, both of which don't sound like they would be that difficult to implement using Python:
- XML parser to process the pages.xml file. Pages.xml contains one XML Element per page. The file is fairly large, so you either need a machine with lots of memory or a streaming XML parser. Each page contains the raw page content, often with an embedded XML island. You can see what the raw page content looks like by appending "?action=raw" to the url of any page: here's an example.
- HTTP client to simulate a user interacting with WeRelate: sign in, handle the sign-in cookie, fill out the forms to edit pages.
- Please let me know if you have any questions. A python-based framework for parsing the pages.xml file and interacting with WeRelate edit-page forms might be a good start.--Dallan 17:10, 19 August 2012 (EDT)
- I've done a bunch of reading w/the action set to raw and also got XML parsing. No editing yet though. --jrm03063 20:29, 19 August 2012 (EDT)
Helping Out [16 August 2012]
I would also be interested in helping out. I am just starting my Master's program this coming week, so I will be busy, but I'd love to stay in the loop.
Python is also my language of choice. Jrm, if you had a framework working in Python, then I would probably have time to help out with some basic stuff. - Jdfoote1 21:17, 15 August 2012 (EDT)
Explicit Biographical Page Corresspondence [14 September 2012]
I've been very interested in web-based sources for which there is a unique correspondence between a WeRelate Person page and a page on that source. The most obvious is of course wikipedia. For example, the person page for William Penn, founder of Pennsylvania, has a one-to-one correspondence with this wikipedia biography. Wikipedia has no other biography of William - nor do we have another person page. As I understand it, we are presently able to infer this correspondence because our page sources [[Template:Wp-William Penn]]. In the event that a WeRelate person page does not source the WP template, the correspondence can still be inferred if the [[Template:Moreinfo wikipedia|page_name]] appears.
Explicit correspondence isn't hugely important for human researchers - but it could make for some interesting bots. For example, as noted in a previous section, we can look at WeRelate pages that are associated with a WP page that uses an ahnentafel template. By examining the contents of the WP template, we can check to see if WeRelate is missing references to other Wikipedia biographies - or perhaps discover situations where there is an inconsistency. Either WeRelate or Wikipedia may be in error - simply by seeing a difference between the two.
Wikipedia isn't the only source of this type. I like to establish such correspondence with other biographical collections. For example, Chaucer, for whom we have the pages/locations that correspond in Find A Grave, The History of Parliament and Lundy's site. For these source however, there is no way to designate correspondence. It is easy to imagine use of the History of Parliament biography page for other people or families. So how to designate when a source corresponds absolutely?
One possibility is that we don't rely on the Person page source entry at all. Instead, simply create a template that prints nothing - but designates the correspondence. Perhaps a template called "corresponding_person" - that takes a parameter indicating the source and a second parameter indicating the URL on that source. Another approach would be to leverage ordinary source entries. Create a template that prints the input parameter, named "corresponding_page". Yet another would be to create a way to designate source-specific conventions on the source page. For example, placement of a url in the record field of a source cite for the History of Parliament, would implicitly designate a correspondence.
I don't think I presently have a preference about how to do any of this - but I do think that such correspondence has a lot of potential value.
--jrm03063 12:39, 27 August 2012 (EDT)
- This is interesting. We do something along these lines for places, where we use source-fhlc and source-getty templates to identify correspondences in getty and the fhlc. See Place:Wales for example.--Dallan 10:45, 29 August 2012 (EDT)
- Interesting. It would be nice to solve this problem in a general way - instead of having to roll an approach on a per-source basis. I didn't even know that the "source-fhlc" and "source-getty" templates were serving that purpose. I created a template "fhlc-item" (only used once) because I wanted to be able to have the fhlc information appear in a more useful way (see St. Olave Hart Street, London, England). --jrm03063 13:48, 29 August 2012 (EDT)
- Would you be interested in writing a bot to move the source-fhlc and source-getty templates from the top of the page to the bottom of the page, under a "Resources" section?--Dallan 09:35, 13 September 2012 (EDT)
- Sure! It would be good to get an opportunity to work on something where I can get some guidance on the process. I actually downloaded the Java yesterday, not because I want to use it directly, but because I wanted an example of what is done and how. I also want to use terminology that you will hopefully find to be meaningful. So - perhaps this weekend - I'll start work on a python equivalent of at least some of your Java code. I'm sure I'll have questions, but I think I have enough to get started. --jrm03063 11:19, 13 September 2012 (EDT)
- Great! Let me know if you have any questions along the way.--Dallan 19:19, 14 September 2012 (EDT)