Monthly Archives: March 2007

CGI HTML Compression

These days with broadband connection, the size of the html size being served shouldn’t really matter right? Wrong! Trust me, when you have a 200 kb file that has to be served, compressing it and serving it will definitely help your audience. And perhaps, even you, if you are constrained by bandwidth limits of your hosting plan. Some hosting plans may provide the compression by default. But if that is not the case, then you may have to do it on your own, at the application layer as opposed to the webserver layer.

Here is some code in Perl, on how that can be achieved.


use Compress::Zlib;

$compress = ($ENV{HTTP_ACCEPT_ENCODING} =~ /gzip/);
$gz;
$printf;

if($compress) {
  print "Content-Type: text/html\\n";
  print "Content-Encoding: gzip\\n\\n";
  binmode STDOUT;
  $gz = gzopen(\\*STDOUT,"wb");
  $printf = sub { $gz->gzwrite(@_); };
}
else {
  print "Content-Type: text/html\\n\\n";
  $printf = sub { print @_; };
}


That's it. In the rest of your code, that has something like

 print "Hello world";

change it to 

 &$printf("Hello world");

and you are done!

So, what does all the code above mean? First and foremost, not all browsers may recognize compression. Hence, we need to make sure that our strategy works for both types of user agents, those that recognize gzip and the other that don’t. This information is available in the Accept-Encoding header, that’s available as HTTP_ACCEPT_ENCODING environment variable. So, if that contains gzip pattern in it, then we know the client can accept compressed form. Based on that, the response headers should indicate to the client if we are serving the content in compressed more or plain text mode.

Next, based on that, the function defined for variable $printf uses either plain text or compressed streaming using Compress::Zlib. That’s pretty much there to it.

Now, I believe using advanced concepts like tie of perl it’s possible to remove the ugly &$printf statements and retain the original print statements as is. I need to learn some more perl to get to that state. But since I needed to do this for only one existing perl script, I just resorted to converting all the print statements to &$printf(); statements. If I invest more time on learning tie and have some concrete code, I will post it some day.

So, finally, for my specific use case, files of sizes up to 180k are reduced down to those of 15k. That’s about 20 times saving! I could see a noticeable difference in the page rendering. One note though is, that the compression is going to put additional load on the server while it eases up some network load.

Note: Window based compression like lzw is what really made this efficiently possible, being able to compress as content is being written on to the wire. Imagine if we only had Hoffman encoding.

Advertisements

Leave a comment

Filed under cgi

MySQL: COUNT DISTINCT vs DISTINCT and COUNT

Today I had the pleasure (or the pain?) of tuning a bunch of SQLs written for MySQL (5.0.26). One that bothered me most was a sql that took about 210 seconds and it appeared to be a very innocent SQL except for the fact that the where clause is useless and ending up in a full table scan. It was of the form

select a,b,count(distinct c),count(distinct d) from a-bunch-of-tables-and-where group by a,b;

To rule out the possibility of problem due to full table scan, I tried to do a simple query with out the distinct in the count and to my surprise, it returned back in under 3 seconds. Not bad for more than half-a-million rows!

Now, this difference of 210 to 3 seconds really worried me. After a bit of searching, came to realize that this is currently a limitation with MySQL as mentioned at http://forge.mysql.com/worklog/task.php?id=3220

So, I changed the query to the form

select a,b,count(distinct c),count(d) from (select distinct a,b,c,d from ... where ...) group by a,b;

and the query started working returning results in 12 seconds. Ofcourse, 12 seconds is no good, but as the query is for an aggregate report, I am fine with it.

In general, all the SQLs I tuned today required using sub-queries. Having come from Oracle database world, things I took for granted weren’t working the same with MySQL. And my reading on MySQL tuning makes me conclude that MySQL is way behind Oracle in terms of optimizing queries. While the simple queries required for most B2C applications may work well for MySQL, most of the aggregate reporting type of queries needed for Intelligence Reporting seems to require a fair bit of planning and re-organizing the SQL queries to guide MySQL to execute them faster. With Oracle CBO, that’s usually never the case. Things are far more intuitive and easy in Oracle world.

If anyone has any other stories of their performance tuning experience with MySQL, feel free to comment on them.

High Performance MySQL is the latest book on MySQL Performance tuning.

11 Comments

Filed under MySQL, performance tuning

News Cloud

How about converting the news feeds into a keyword cloud? I did some research and this is already being done for quite sometime. The notable entries are

fserb.com.br/newscloud/ which creates a news cloud out of Google News.

newzingo.com is another site that’s doing news cloud not only for Goolge News, but also other sites like Slashdot.

tocloud.com also uses Google News to create the news cloud. However, only phrases are presented. The advantage with a cloud made of phrases as compared to keywords is, that the phrases give more context. This is especially important for a text source like news that keeps continuously changing and a keyword’s importance is very temporal.

Leave a comment

Filed under news cloud, Web 2.0

Buy vs Build, Is Open Source The Third Alternative?

I recently saw an advertisement when doing some research on open source software. The ad said something like

IT Solutions Using Open Source Software
– Learn about IT solutions assembled with open source
– a third alternative to “build versus buy.”

These days everyone is using the open source word and it kind of seems to lose it’s original spirit. For large and small companies, startups and venture capitalists, open source is now nothing but a marketing tactic.

Is open source really a third alternative? What is the real cost of implementing a software solution? When most of the large software companies make most of their revenue from support compared to license fee, what does that tell us? Any smart CIO should be knowing that the real cost of a software solution is not merely the cost of building/buying it but the cost of supporting it.

Infact, it’s this realization that is partly making most smaller companies, who are otherwise at a disadvantage of competing with larger players, to offer the software as an open source version with the appropriate license that prevents others to do anything they want and distribute unless they give it back. The hope is that, people will download the solution as it’s free, try it out and perhaps even start using it, and start paying for the support.

Since IT organizations don’t need to bother about distributing it, they perhaps can make any modifications they feel like and use it in house. So, the GPL-like clause is perhaps not a cause for concern for the IT departments. But if there is someone in the IT department that’s trying to make changes to the open source software, it’s no different than “building” it. May be not from scratch. But in that case, is there really a need for a 3rd party consulting company to provide those open source IT solutions? One interesting thing with the GPL-like clause is, what happens if someone develops extensions, but don’t distribute it freely but provides them only for those organizations that want to buy such extensions? In that case, the IT organizations are still “buying” it.

With Venture Capital pouring into open source software, and highly political corporate veterans, who never knew what GPL means without consulting the corporate lawyers, are also jumping into the open source bandwagon. One of the strengths of open sources is that people like to contribute because it’s open and anyone can study the code. But most successful open source software really is what can be termed as Systems software. Like a Operating System, a Web Server or a Programming Language. But these days the open source mantra is extending to the enterprise applications. Which college kid would have the desire to write a manufacturing application or a CRM application? Especially when they know that there are a bunch of people behind those so called open source enterprise applications backed by venture funding and the ones that are really making a living out of those apps.

Also, the complexity of enterprise applications doesn’t come from just the technology but also the functionality. Start a small application with 10 tables, one person can handle it. Say it grew to 100 tables, soon adding any new feature, however small it may be, starts taking much much more time. Enterprise software is fundamentally complex not necessarily because of technology or the lack of it, but there are several ways of doing business, writing code, modeling the data and the flows. And as one tries to make more and more of the various enterprise roles/departments to work seamlessly together it gets tougher and tougher to manage it. It doesn’t scale linearly. Think of it as solving a 20 piece jigsaw puzzle vs 5000 piece jigsaw puzzle. They don’t scale either in resources or the time and effort. Same with building enterprise software that works harmoniously. People blame architectures several times. But good and flexible architectures can only help so much.

Ofcourse, using the new web 2.0 tools like wikipedia type of documentation for collaborating and capturing all the vast amounts of knowledge (and wisdom) in the brains of all the great people who originally contributed to the software does help to some extent. But that’s something every vendor, be it the one offering closed source or open source will do it.

There is no free lunch. This fundamental economic principle remains unchanged even in the new economy or the regular economy. What open source really has done, is doing and will do is to keep the software prices more transparent or put a reality check. In the absence of an open source database, a closed source vendor can charge whatever it wants, so is for a middle-tier web server or a end user application. In the world of open source, there is no place for politically motivated/manipulative management.

2 Comments

Filed under open source

Keyword/Tag Cloud Suggest – A mashup

What happens if a keyword cloud or a tag cloud is mixed with Google Suggest? You get a Cloud Suggest. The motivation behind this is that say you have a tag cloud of your blog. The cloud gives you and your readers a quick idea of what topics you mostly cover in your blog. But what if you or your readers want to know what are the popular searches related to those tags? That’s where you can make use of Google Suggest. ToCloud.com seems to be the first keyword cloud generator that has this Cloud Suggest idea. So, once a cloud is generated, clicking on any of the word/phrase opens a popup that fetches suggestions for that word/phrase from Google Suggest.

Leave a comment

Filed under keyword cloud, mashup, tag cloud, Web 2.0

Advanced Keyword Cloud Features

Creating a keyword cloud from a page shouldn’t be that hard as it just involves breaking up the text into words, then counting the frequency of the words and then finally displaying them as a cloud. That’s it. Right? Wrong!

A keyword cloud can be made more sophisticated. Some of the features to keep in mind are

1. preserving case for abbreviations. So, for example, if there is a web page about SEO (Search Engine Optimization), then when creating a cloud of that web page should not be displaying it as seo but as SEO. This is very important as people are more used to seeing any abbreviation in uppercase and not lower case.

2. displaying the keywords in the order of their occurance in the page. Wondering why this may be useful? Say you have a blog which contains the most recent blogs at the top of the page. Obviously, you then may want a cloud that provides keywords of your recent articles first and then subsequent keywords.

3. one of the most difficult parts of the keyword cloud generation is the extraction of phrases. ToCloud.com now has the capability to extract meaningful keyword phrases from a page and so my blog’s keyword cloud starts showing up keyword phrases (click the My Blog To Cloud link to see this in action).

1 Comment

Filed under keyword cloud, Web 2.0

GAFYD can cut the IT staff

GAFYD, which stands for Google Apps For Your Domain is pretty cool. It provides the ability to run a bunch of common applications off your own domain. The apps available are email, calendar, chat and webpages. These apps are good enough for most mom-and-pop and even a small size businesses.

GAFYD started in 2006. Initially, there was no choice to create your own domain as part of registering with GAFYD. But now they have that option and they offer it by partnering with enom and godaddy. The good thing with getting your own domain as part of the GAFYD is that they charge only $10/yr and that includes making your information anonymous in the whois so that you don’t get spam. In addition, without registering a domain, you get only up to 25 email accounts (each with 2gb space, similar to gmail). But with domain registration, you get up to 200.

Currently, one missing app in GAFYD is the ability to blog and link it to your own domain. Hopefully this will soon be included.

With the reliable and gmail like excellent user interface for email and similarly reliable web space, there is no need for maintaining these two servers in house. No need for maintaining a backup. So, that’s essentially reducing some of the IT operations inhouse. For $10.0 a year, the standard service is definitely worth it.

Oh, forgot to mention, they also made their Spreadsheet & Docs available as part of GAFYD. Check out the list of apps and more info about them at

http://www.google.com/a/help/intl/en/users/user_features.html

Leave a comment

Filed under Enterprise Applications, IT