Monthly Archives: April 2007

Script.aculo.us Effects on a Tag/Keyword Cloud

Script.aculo.us is one of the popular Web 2.0 javascript library which goes with the theme of “it’s about user interface baby!”

And Tag Cloud is also a Web 2.0 concept.

So, what if we combine these two together? You get Script.aculo.us effect on a Tag Cloud. That’s exactly what the ToCloud Keyword Cloud Generator has done. Here are a few examples.

Plusate Effect on My Blog

Grow Effect on Amazon Homepage

Shake Effect on MySpace.com

BlindDown Effect on Yahoo!

Leave a comment

Filed under DHTML, javascript, keyword cloud, script.aculo.us, tag cloud, Web 2.0, word cloud

Beware of privacy when contributing to Wikipedia

For good or bad, wikipedia keeps track of each edit and the associated ip address. It also provides the ability to search for all the edits coming from a given ip address. This is useful for the guys who keep the wikipedia as clean as possible without spam and other junk. I think they also maintain blacklisted ip addresses.

I see one big problem with this ip address tracking. Whenever a person visits a website, the website also knows the ip address of that person. Now, using that ip address and querying up against wikipedia will give the website access to information about the person! Depending on what articles a person has edited, it will then be possible to know what a person’s knows, likes etc. So, for example, let’s say you are a person interested in Yoga and edit a bunch of Yoga articles on Wikipedia. Now, the moment you go to Amazon, if Amazon customizes their website for you showing a lot of Yoga books and related stuff, that would be good? Or spooky!

Leave a comment

Filed under online privacy, wikipedia

State, Navigation & Performance

Web Applications development is all about managing the state and navigation. And for those few lucky sites that have high traffic (like wordpress.com), performance also matters. Two days back I happened to be looking at a portal application that was trying to display reports in real time. Since a portal typically has multiple portlets in it each one showing a real-time report, the way that particular portal application was designed was to initially show loading icon till the portlet page is loaded and then display the page. One good thing with this approach, as opposed to sequentially building the entire page within a single request is that user can start looking at the content as soon as it’s queried and sent to the client. The downside is that there will be multiple client requests to serve each page, one per each portlet, that would be expensive.

Now, this particular portal page I happened to look at has drill-down pages for each report and a group of people are constantly looking at the report and the drill-down. The drill-down is in a separate page, so there is a lot of navigation going back and forth. And each time the user clicks from the drill-down page to the main portal page, the portal page starts rendering all the portlets in real time. This definitely didn’t look promising as it was taking a lot of time for the page to render and consuming a lot of the entire group’s time.

Now, I have recently been experimenting with GreyBox. What this does is, it allows you to open up a detailed page in a embedded window without leaving the current page. When the user clicks on a link, it greys out the original page, and puts a new box with the target link. Hence the name, GreyBox, I would assume. The good thing with this approach is, since one can look at the details, without leaving the current page, in a scenario like a Portal, it would be extremely useful.

I think, overall, with AJAX, DHTML and JavaScript, a lot of server side load can be reduced by carefully designing the navigation using widgets like GreyBox.

Leave a comment

Filed under AJAX, DHTML, GreyBox, performance, portal, web development

$50,000 Spock Challenge

Just yesterday I blogged about LinkedIn and PersonRank. Today, I want to talk about a different type of problem, a challenge posed by a startup called Spock. Their challenge, worth, $50,000 is at http://challenge.spock.com/ .

The problem in a nutshell is, how to say page 1 is about person XYZ and page 2 is also about person XYZ, where the first and second XYZ in real life are two different people.

I am sure a lot of set processing, artificial intelligence, neural networks, Bayesian probability are going to show up in the various solutions. But is that all really necessary? Well, let me put some of my thoughts.

Many websites are typically about a particular topic. Ofcourse, there are generic websites too. But when dealing about people, many come in the form of some type of news or the other. Even in that case, for example, CNN uses a subdomain or a subdirectory for different sections of news it has. So, if one XYZ is a singer and his news appears in the Entertainment section and the other guy, a football player, appears in the sports section, that’s a good indication of which page is talking about which guy. So, instead of trying to derive the information purely from within the document, using the attributes of the document, like the host, uri etc will probably give more accurate results. From their data set description only the webpages are provided and not the url of these pages.

Secondly, a few keywords clearly provide classification. That is, whenever you see a keyword called football, chances are that the article is about sports. Ofcourse, some 6th grade kid could do some science fair project using a football and win it and that would become a news and that news is not really about sports. But in general it’s a safebet to assume that any article containing football is mostly about sports (the noise can be filtered out with certain thresholds, corelated words etc). So, say there are these root keywords that broadly classify the various professions. Professions help segregate two people with the same name. Ofcourse, if there are two guys with the same first and last name and both in sports or software development, then it gets difficult. Then, perhaps, age, job titles, professional affiliations etc will come into picture.

Overall it’s a good and tough problem, but attempting to solve any set oriented algorithms on the entire web is not going to be easy. The key is probably to rely more on corelation of words within the pages and the url patterns than trying to derive patterns across pages. Let’s see who wins the context and how they solve it. Hopefully, the solutions will be available for public to look at.

Leave a comment

Filed under algorithms, people search

LinkedIn, PersonRank & Google’s PageRank

I just accepted a linkedin invitation. At the bottom of the invitation email, there is a fact.

“Fact: People with 20+ connections appear in LinkedIn search results 14.6x more often”

That got me thinking and I ended up writing this article. As more and more people start using LinkedIn, and many just keep “networking” without really knowing much the other people they are connecting up at level 1, the whole linked in system is going to break at sometime.

If Google hadn’t invented PageRank, the search results would have still been bad (like those of Yahoo!, MSN which is now Live etc). So, I was thinking along the lines of a PageRank scheme for LinkedIn. That is, a system in which people linked from popular people themselves inherit some of that popularity. Just like how Google’s homepage get’s a perfect 10 on 10 as it’s PageRank, say the most popular people on LinkedIn get a LinkedIn Rank of 10. Now, I will get to in a minute what I mean by “most popular people” and how that is measured as this is critical for this scheme to work. Any person who is connected to this most popular person will get his/her rank increased a bit. So, a person connected to more and more popular people would him/herself get better rank.

Now, if we just go by how many people a person is linked to others, then that doesn’t really give a good picture. Mainly because, there are so many recruiters on LinkedIn who have more than 500 connections. Does that mean these recruiters should get a better LinkedIn Rank than others? What about those who just keep accumulating LinkedIn connections just for the heck of it? One guy for example, who was from my Alumuni and some 10yrs elder to me connected with me and I later realized that he did that just to promote his published books! The guy didn’t even bother to send a response to my personal email I sent as part of accepting his invitation! I wish LinkedIn has a way to retrieve back an earlier accepted connection.

Anyway, back to the topic at hand. I think, popularity should be determined by profile search and visits. That is, when people are looking for a particular profile, then that person is likely to be more popular. In this case, how many connections a person has doesn’t matter (ofcourse, they matter in calculating their rank indirectly based on the above scheme of connections from popular people, but that wouldn’t be linear and cumulative). Infact, popularity can be a combination of number of page visits + links to popular people. In other words, it’s a hybrid of Alexa’s Traffic Rank + Google’s Page Rank.

Let’s call it PersonRank.

1 Comment

Filed under linkedin, social networking

Scaling by Table Partitioning

Recently was talking to a friend who said that big internet companies like Yahoo! and others have proprietary databases to be able to scale for huge customer base. Obviously with more than 50% of traffic for Yahoo! coming from it’s email application (refer http://www.alexa.com/data/details/traffic_details?url=yahoo.com), it needs to support huge customer base.

While there are benefits having proprietary data formats I personally feel going for such schemes is not really a good idea. Mainly because, many of these schemes which sacrifice the ACID properties of the RDBMS databases to achieve their additional speed fall apart big time to generate aggregate reporting. Unless you are a Yahoo! or Google, the cost of writing and maintaining such code is usually not justified.

There are ways to use traditional databases and achieve good performance. One of the techniques is using data partitioning (table partitioning). Table partitioning is a feature where the data in the table is partitioned into multiple segments each of which can potentially reside in a separate disk there by giving a better IO throughput.

There are 3 types of table partitioning. They are Hash Partitioning, Range Partitioning and List Partitioning. Below I will go in detail about each of these and which one is best used for what purpose.

List Partitioning: If the set of values of a column is fixed, then list partitioning is useful. For example, one can assume that the names of customers can only start from A to Z and hence have 26 different partitions. Think of this when you can write your column values as (a, b, c, d … fixed-list) and there are several records for each of these values.

Range Partitioning: If the set of values is large and the queries deal with a subset of these values, then range partitioning is usually the right choice. For example, order date is a good range partitioning candidate since usually one is interested in all the orders placed in the last one month, last one quarter or last one year. So, depending on the volume of orders, the size of the partitioning can be by month, quarter or year. Also, range partitioning is the only possibility to be able to keep adding partitioning as needed. Think of this when you can write your column values as (a-b, c-d, e-f …) and there are several records between each of these ranges.

Hash Partitioning: Finally, Hash partitioning is useful when the values of the column can be too many but range is not really the right choice. For example, an account number, though has ranges, since accessed randomly is a good candidate for hash partitioning. The idea here is, using a hashing function, each value is mapped to one of the hash buckets.

MySQL defines another partitioning called Key partitioning, but it’s a lot similar to Hash Partitioning.

Using the right partitioning scheme and the right set of where clause in the queries can cut down the amount of IO quite a bit. And given that each partition can reside in a separate disk, it also gives scalability as concurrent IO is possible.

Another benefit from partitioning is the data gets clustered into the appropriate bucket. In one scenario with 4 valued list partitioning scheme, I was able to take advantage of this clustering factor to get even more optimized IO.

1 Comment

Filed under performance tuning, VLDB

Commercial Interest & Open Source Interest Inversely Proportional?

Compiere, supposed to be one of the most popular open source ERP and CRM applications software announced venture funding in 2006/06 as per

http://www.compiere.com/news/index.html

Now, looking at the statistics about compiere project on sourceforge.net at

http://sourceforge.net/project/stats/?group_id=29057&ugn=compiere&type=&mode=year

the web traffic has been declining gradually since the same time frame. And before that, the graph was showing a upward trend. Does this tell us anything?

Leave a comment

Filed under compiere, open source