Just yesterday I blogged about LinkedIn and PersonRank. Today, I want to talk about a different type of problem, a challenge posed by a startup called Spock. Their challenge, worth, $50,000 is at http://challenge.spock.com/ .
The problem in a nutshell is, how to say page 1 is about person XYZ and page 2 is also about person XYZ, where the first and second XYZ in real life are two different people.
I am sure a lot of set processing, artificial intelligence, neural networks, Bayesian probability are going to show up in the various solutions. But is that all really necessary? Well, let me put some of my thoughts.
Many websites are typically about a particular topic. Ofcourse, there are generic websites too. But when dealing about people, many come in the form of some type of news or the other. Even in that case, for example, CNN uses a subdomain or a subdirectory for different sections of news it has. So, if one XYZ is a singer and his news appears in the Entertainment section and the other guy, a football player, appears in the sports section, that’s a good indication of which page is talking about which guy. So, instead of trying to derive the information purely from within the document, using the attributes of the document, like the host, uri etc will probably give more accurate results. From their data set description only the webpages are provided and not the url of these pages.
Secondly, a few keywords clearly provide classification. That is, whenever you see a keyword called football, chances are that the article is about sports. Ofcourse, some 6th grade kid could do some science fair project using a football and win it and that would become a news and that news is not really about sports. But in general it’s a safebet to assume that any article containing football is mostly about sports (the noise can be filtered out with certain thresholds, corelated words etc). So, say there are these root keywords that broadly classify the various professions. Professions help segregate two people with the same name. Ofcourse, if there are two guys with the same first and last name and both in sports or software development, then it gets difficult. Then, perhaps, age, job titles, professional affiliations etc will come into picture.
Overall it’s a good and tough problem, but attempting to solve any set oriented algorithms on the entire web is not going to be easy. The key is probably to rely more on corelation of words within the pages and the url patterns than trying to derive patterns across pages. Let’s see who wins the context and how they solve it. Hopefully, the solutions will be available for public to look at.