Skip to content

Society for American Baseball Research Projects

Login SABR Home  

Blueprints
You are here: Home Projects
Projects
This section includes news on the status of SABR Projects. You can select an individual project from the "Projects" menu above to see only status updates for that project, or view all updates on this page. All of the pages have RSS feeds available.

Improving person disambiguation and "smarter" searching
Written by Ted Turocy   
Friday, 22 May 2009 11:53

One of the challenges we face in the Encyclopedia project is that we have a universe of what will be over 200,000 "notable" people. That universe grows every day as players make their professional debuts, and our knowledge about past players, managers, executives, umpires, and so forth continues to evolve. We chose a wiki-based Encyclopedia in part because it gives us the flexibility to deal with these ever-changing data, and allows for collaboration in improving the data.

Still, it can be hard to keep this all organized, and to find what you want. The disambiguation pages listing people with the same or similar names are a good example. When we created the initial set of person pages, we created simple disambiguation pages, with names and active dates. However, these are entirely static, in that they do not reflect the content on the pages they link to. For example, if I update a player's page with his 2009 teams, it does not update a corresponding disambiguation page to indicate he was active through 2009. If I add another person with the same name, I must also manually edit the disambiguation page to add him.

This is all fine and well as we start on the project, but it's clear that we need a more robust solution that will serve us well over many years. On selected disambiguation pages, I am rolling out an initial cut at what I believe may be part of the solution. Have a look at the disambiguation page [[John Smith]] in the Encyclopedia. At the top is the original, manually-organized list of all the John Smiths. Below that is an experimental query on the wiki which generates the same list. As most readers of this know, we are making heavy use of Semantic Mediawiki (http://www.semantic-mediawiki.org) in the development of the project. This extension allows us to associate properties with each page, and to do queries on those properties. You've seen this in action already on team pages, where the rosters, managers, and ballparks are automatically generated using such queries. The experimental disambiguation page is similar.

I think the potential power of this approach is clear. The list of teams each person played for is generated from the individual person pages, so there is no issue with keeping multiple pages synchronized. If we add a new person with the same name, the disambiguation page will automatically be updated. (Note: For performance, queries like this are cached and only refreshed periodically. You may need to click the "refresh" tab at the top of the screen to see very recent edits reflected on these pages.) 

There is still much to do on these disambiguation pages. The query does not currently list non-playing engagements, so managing, umpiring, etc. doesn't show up. It would also be helpful to be able to list dates of birth and other biographical information. It will take some trial and error to come up with a visually appealing solution that communicates the information a user is looking for, to help find the right guy out of a list of 40+ people with the same name. The good news is that there are no technical barriers to doing so; with time and experience, these queries will improve.

In parallel to this, Peter has been doing some looking into improving the search feature. The default MediaWiki search engine we are using is not very reliable; it's often the case that it doesn't report "near misses."  If you search on "Joseph Shlabotnik" and it turns out we have him listed as "Joe Shlabotnik," you may get zero hits. There are better search engines out there, and we'll be deploying one sooner rather than later. In addition, we are looking into ways to exploit the semantic contents of our pages to help make search smarter, so that using Joe vs. Joseph, Mike vs. Michael, and so on won't be an obstacle to helping you find the person you're looking for.

 

 
Flagging articles with factual errors: possible-error-flag
Written by Ted Turocy   
Wednesday, 15 April 2009 08:41

It came to my attention based on an edit last night that there was a hole in the documentation on how to edit the career section of pages on people. What happens if you find a page with an engagement record after the person's death? It's logically absurd, and we certainly don't want situations like that to persist. However, if you encounter the situation while browsing around, or while looking for some other page, then you might not know how to correctly resolve the situation. Simply deleting the engagement record for the person is not satisfactory. If it is a minor league playing record, for instance, then this will cause a problem in the Minor Leagues Database, since there will no longer be any person attached to that playing record.

To address this situation, which will not be rare but will also, I hope, not be too common, I created a new template, possible-error-flag. This should be placed at the top of a page on which a factual error or conflict has been identified. Use the discussion page to record the nature of the conflict, including any information you might have towards resolving the conflict.

We will be able to generate a list of all pages tagged with possible-error-flag, which will give a rich hunting ground to our puzzle-solving colleagues who live to track down the solutions to such conundra.

 

 
SABR Encyclopedia: First Update
Written by Ted Turocy   
Monday, 13 April 2009 12:49

This is the first in a series of occasional updates on the development of the SABR Encyclopedia wiki.

It's been about six weeks since the Board approved the concept of the Encyclopedia and authorized us to begin work. After a month of planning, on April 1, the first automated "bot" went into action, creating a page for each person listed in the Minor Leagues Database, which is the single largest dataset anywhere of people involved with professional baseball. About five days later, the upload process was completed, and a few intrepid souls, Jack Morris, Cliff Blau, Joel Dinda, and John Zajc among them, have begun the task of organizing and expanding biographical knowledge about this set of people. In the meanwhile, pages have automatically been built out for (most) professional leagues and teams. Pages for each ballpark to host at least one Major League game have also been created.

A major focus of development in the coming weeks will be organizing these pages and "stubbing" out pages for other persons, leagues, and concepts. This breadth-first approach is motivated by the belief that most potential contributors will be more comfortable expanding existing pages rather than creating new ones from scratch. Organization and navigation, through categories, navboxes, and the like, will make it possible for contributors to find the best pages on which to make their contributions.

We have begun making use of the Semantic Mediawiki extension within the wiki. We are very excited about the possibilities this extension offers, to allow us to autogenerate information within the wiki. We currently generate roster tables for each club using this extension, and have just implemented a similar feature to autopopulate executive roles for leagues. Similar features for club managers and general managers, and umpires for leagues, are intended. We are also using this feature to create an automatically-updated necrology for 2009, which we hope will help Rod Nelson and the Emerald Guide crew get a head start on next year's edition. (Even a few days later, it still affects me when I see Nick Adenhart's name at the top of that page.)

A key design feature is the use of templates to record information systematically about entities in the wiki for easy extraction down the road. Some of these templates wrap Semantic Mediawiki properties, so contributors don't need to learn how SMW works; the creation of properties happens automatically behind the scenes. Even where templates do not wrap SMW properties, they are easy enough to parse that tools will be able to spider the wiki to extract and cross-check information.

One such spider program being developed now is a program to extract the basic biographical data and update the Persons table in the Minor Leagues Database. The Encyclopedia wiki is now the primary place to update biographical data, both the basic demographics (name, height, weight, date of birth, and so on), and the assignment of playing, managing, and other records to each person. We are hopeful that this will ease the task of processing this information in a timely fashion, as well as minimize the chance of errors. Early experience indicates that this will be a viable solution, if managed properly.

We will continue expanding the breadth of the Encyclopedia in the coming weeks. One of the next datasets to come will be minor league ballparks, based on Gord Brown's register. A few states' worth have been wikified already, and we will soon be seeking volunteers to carry out the rest. Also on the shortlist of major tasks are work on the collegiate summaries Gary Benner has, and updating the wiki with major league managers and umpires, which records are currently largely missing.

For the sake of posterity, as I write this, the front page of the wiki report 217,779 pages, including 169,789 people, 4933 league-seasons, 31003 team-seasons, and 301 ballparks. There have been 250,509 page edits, and 245,292 pages, which means there have been at least around 5,300 non-bot edits. I don't put too much stock in raw edit statistics; after all, mechanically adding navboxes to all the seasons of a league creates a lot of mindless edits that don't directly do very much yet. Even at that, given the small number of us who are active right now, that's a sizeable number, and I take it as an indication that we're off to a good start.

 

Last Updated on Monday, 13 April 2009 19:20