Category Archives: Tech – Tips

UTF-16 to UTF-8

I recently downloaded a csv file of a report from a report generating system and tried opening it in vi editor and all I saw was some garbage like a binary file. When opened in notepad, it looked fine as expected, as a plain text file. I thought there was problem with the ftp and I tried different things like zipping the file and then doing the ftp and so on. But nothing worked. Opening with emacs didn’t work either. Then, I tried opening it with some notepad like application on Linux and the first thing it did was show an error saying that it didn’t understand the encoding and asked me to pick one. I picked UTF-16 and then it started showing up.

Well, now that I knew what the problem was, how do I convert it to UTF-8? I need to convert it to UTF-8 because, the version of perl I was using didn’t support various encodings (research on the web indicated the need to compile with perlio option or something like that and that wasn’t the case for me). So, I used Java to achieve this. It’s really very simple. Here is what it would look like.

import java.io.*;

public class UTF16toUTF8 {
  public static void main(String[] args) {
     BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(args[0]),"utf16"));
     String line;
     while((line = br.readLine()) != null)
        System.out.println(line);
  }
}

That’s it. This simple piece of code was a real time saver for me.

2 Comments

Filed under Tech - Tips, UTF

HTTP Response Status Code 304

If you have a dynamic website that is search engine friendly, chances are your entire database of some entity such as a list of products is made available that can be easily reached by crawling by the bots. Doing this comes with a cost. The bots keep crawling regularly, even at a slower rate, and you have to render those dynamic pages one after the other costing you both bandwidth and CPU cycles. To avoid this, you can write your dynamic pages such that they are “last modified” aware. That is, say you have a product page that lists the product details as well as any comments by the user. While most of the product details itself is relatively static, content such as comments could be changing every few days especially so for a product that has launched in the first few months. Ofcourse, if you do dynamic pricing, then that’s different. But even then, people seldom search based on the exact price string (people want to know if a product is available below a certain price and not an exact price, so don’t bother to worry that your latest price is not indexed. Besides, by the time it’s indexed to the time it’s available in the search results, your price may have changed yet again).

So, how do you make your dynamic page “last modified” aware? The way it works is, bots use a special http request which is a conditional http GET request that passes a special if-modified-since header with a specific date. So, the bot is essentially asking you to respond with the full content only if the content has modified since a given date. Otherwise, you can just respond with a status code of 304 which tells the bot that there is no change. So, search engine crawlers like that of Google which maintains the last time they have crawled uses these conditional requests so that they can avoid the same bandwidth and cpu cycles as you.

How do you know if you are making use of this functionality? It’s easy. Check in your log files and see if you have any 304 response codes against Google bot or other search engine bot requests. If you always see 200 and never 304, then you are not using this feature.

Leave a comment

Filed under Search Indexing, Tech - Tips

User Input Sanitizing Before Using In Regular Expression

Say you want to take the input from a user and filter any set of text (titles/descriptions) and show the reduced list to the user. Assuming the users are not sophisticated enough to write regular expressions, the requirement is simply to ensure that whatever the user typed in is available some where within the text.

A simple way to do this is just look for the exact substring and see if the index is greater than or equal to 0 (something like str.indexOf(input) >= 0).

Say, you want to return a match that is case-insensitive. Then the substring approach will not work. So, you need to jump to using regular expressions. In JavaScript, this would become,

var re = new RegExp(input,"i"); // the flag i indicates case insensitive
if(re.test(str)) { /* do-some-thing-here */ }

So far, so good. Now, what happens if the user types in some special characters that are typically used in regular expressions as some special control characters? For example,

‘(‘ and ‘)’ are used for grouping and variable capturing
‘[‘ and ‘]’ are used for character set
‘{‘ and ‘}’ are used to indicate the cardinality
‘\’ is escape character
‘*’ is used to indicate 0 or more
‘+’ is used to indicate 1 or more
‘?’ is used to indicate 0 or 1

In this scenario, if someone types in a string with any of the above characters, then the above javascript will fail. So, in order to fix this, the user input string should be first fixed to make it a valid regular expression. This can be done using

var rere = new RegExp("[({[^$*+?\\\]})]","g"); /* you need 2 '\' s to mean 1 '\' and another '\' to treat ']' as special character instead of the characters ending bracket */
var reinput = input.replace(rere,"\\$1"); /* replace the special characters with a \ before them */
var re = new RegExp(reinput,"i");
if(re.test(str)) { ... }

Now if you are generating the above JavaScript code in perl, it gets a bit more complicated. Why? Because, ‘\’s are themselves have escape semantics. Also, $ symbol has special meaning in perl. So, for each ‘\’ above, it would be doubled and also, $ would be escaped as well.

Yes, it gets confusing, but it’s doable. You can see this in action at Flat Panel Plasma/LCD HDTVs. It contains a search box on the top of the image cloud and lets the user to input a string such as Panasonic, Samsung, Sony etc or even product numbers with special characters like TH-42PZ700U to find the corresponding product on the image.

Leave a comment

Filed under javascript, perl, regular expressions, Tech - Tips

LWP::UserAgent And Fetching UTF8 encoded XML (RSS/Atom Feeds)

LWP::UserAgent is supposed to be smart enough to parse the html headers (by default) and figure out the encoding. However, some of the xml responses out there don’t pass the right header information. Instead, they rely on the encoding attribute of the xml. For example


<?xml version="1.0" encoding="utf-8"?>

It could be utf-8 or some other encoding. So, one way to deal with this is using the following code


my $content = $response->content;
my $encoding = 'utf8'; # assume this is the default
if($content =~ /encoding="([^"]+)"/) {
$encoding = $1;
}
$content = $response->decoded_content((charset => $encoding));

That’s pretty much it.

Leave a comment

Filed under Tech - Tips

List of Available Perl Modules

I use a free hosting website for a domain of mine. I couldn’t find a webpage with all the available perl modules supported by the hosting company. So, I used the following script to figure out what modules are available.


#!/usr/bin/perl

print "Content-Type:text/plain\n\n";

map {
print "$_\n";
my $cmd = "find $_ -name '*.pm'";
system($cmd);
} @INC;

So, by accessing this page from the website, I could figure out all the modules available.

Leave a comment

Filed under Tech - Tips, web hosting

Tag Cloud Algorithm/Logic/Formula

I wanted to implement a very efficient tag cloud generator. Initially I thought it’s a simple task, but realized making it efficient is a bit challenging. I came up with a bunch of ideas on how to do that and then searched on the web to find if there are any articles related to it. I noticed that most of them talk about how to divide the data into buckets, using some sort of a formula including logarithms etc. There are bits and pieces of code here and there, but somehow nothing excited me. So, let me put together some of my thoughts on this.

A tag cloud requires a tag and a number associated with that tag. That number is usually a metric. What’s so special about a tag cloud? Typically information in business applications is presented as a table which can then be sorted. So, at any time, user can sort by the name of the entity in the report or by the metric of that entity. For example, by customer name or the dollar amount spent by the customer. However, what a tag cloud offers is the ability to get the ordering of both the entity and the metric in a single visual representation. This is done by laying out the data in the order of the entity but changing the size/color intensity of that entity based on the metric value. As a result, while the user can scan top to bottom (and left to right) for alphabetical ordering of the entities, user can also scan for the font-size/color intensity at the same time. So, an extra sort is avoided to gather the ordering for each. Ofcourse, for precise details, one has to sort either for the entity or the metric explicitly.

Now, the next question is, how to vary this size/intensity metric? Is some linear interpolation sufficient enough? Does it have to be logarithmic? This to a large extent depends on the data distribution. If the difference between the highest value and the least value of the metric is so large (o(10^n)), then logirthmic interpolation may help. However, sometimes it may not be worth showing every entity in the tag cloud. Just the top N entities are good enough. If we go with the top N approach, then max and the min of the top N entities may not be that wide spread and in this case a linear interpolation should suffice.

One reason I would caution against using a logarithmic interpolation is that it’s expensive to compute and if you are doing it real-time and with huge volume, then that’s going to be CPU intensive. So, try using the topN and linear interpolation.

Next, in the linear interpolation, how do we set the min and max boundaries for the font size/color intensity? I notice that Amazon.com for example, is ranging it’s font sizes between 80% and 280%. So, the lowest tag in the cloud would get a font size of 80% and the highest tag 280%. I have decided to go with the following formula

150*(1.0+(1.5*m-maxm/2)/maxm)

This nicely gives a font size from 75% to 300% as the metric changes from a potential 0 to maxm. Check Tag Cloud Generator for this formula in action.

Ok, if we go with this topN approach, then the next question is how do we get this top N? For this, one has to invariably write a SQL statement. Something like

“select entity,metric from fact order by metric desc” which gives all the entities.

One can refine this to restrict only to the topN by doing the following

“select entity,metric from fact ordre by metric desc limit 0,<n>” where you can plugin a particular number suitable for your application.

Now, with the above SQL, we obtained the Top N entities. However, we want them in alphabetical order as that’s how we want to display the cloud. How do we do this? One approach is to fetch them all first and then do a sort in the middle-tier. Depending on the size of the N and the number of middle tiers you have, you have to chose doing this in middle tier vs database. Assuming you have a single middle tier server, then perhaps doing in the database (also a single server) may not be bad. So, the above SQL will refine to

“select * from (select entity,metric from fact order by metric desc limit 0,<n> ) order by entity”

In the above configuration of a single mt and db server, chosing to do this in database gives the advantage of not having to create an array of records in the middle-tier for doing the sort as the sort is done in the database itself (which I am assuming has more optimal sorting strategies). So, one can just loop through the result set and output the entities.

However, there is one small problem with this. By sorting the TopN alphabetically in the database itself, we don’t have the max metric value. If we don’t have the max metric value, how do we then really calculate the size/intensity? So, does it mean I have to get the results set into an array first and then scan through to get the max? Then that defeats the purpose of double sorting in the database as mentioned above.

With Oracle, it’s possible to use Analytical functions and get the max of the entire set as a column in the query. But hae, most guys out there are using MySQL for their web apps. Isn’t it? So, what next?

That’s when I thought of using the javascript to do the fontsize calculation on the client side! Yes, the idea is, loop through the results set and generate the HTML code.
And in due process maintain the max value and output it as a javascript variables that will be used in the client side computation. Now, when the tags are generated as links, make use of the link’s title attribute to capture the metric value. Like the title may read “some description: “.

Now, in the javascript, you can loop through each of the link, compute the font size, and set it for the link. A snippet of that function would look like

function processCloud(id,max) {
var cloud = getElement(id);
if(!cloud) return;
var tags = cloud.getElementsByTagName("a");
for(var i=0;i<tags.length;i++) {
var tag = tags[i];
var title = tag.getAttribute("title");
var f = title.substring(title.indexOf(":")+1);
var fontSize = (150.0*(1.0+(1.5*f-max/2)/max))+"%";
tag.style.fontSize = fontSize;
}
}

Here, getElement is a utility function that gets the element from the document based on a given id. So, your tag cloud can be placed in a div element with an id and that’s the id you pass to the processCloud function along with the max value that is computed as part of generating the html.

That’s it. This essentially does the following optimizations

1. Since we first sort by metric and limit only the top N elements, there is no need to bring in all the elements into the middle tier.
2. Since we then sort the data by name, there is no need to create an array in the middle tier and do the sort.
3. Finally, since the fontsize/intensity calculation is pushed to the client side, there is no need to create an array in the middle tier.

That’s all there is. Hope this helps in your application!

14 Comments

Filed under keyword cloud, tag cloud, tags, Tech - Tips, Web 2.0, word cloud

Borderless IFRAME

Google’s adsense code is a bunch of javascript which actually generates an iframe. However, the adsense region doesn’t appear to be part of a separate iframe. It looks part of the overall page being viewed well integrated into it. I was trying to get a similar effect. It worked in Firefox but IE was always rendering the iframe with a bevelled edge border. So, after a bit of searching, found that the trick is to add style=”border:0″ to the body element of the url that the iframe renders (not the body of the page containing the iframe). I did this and it worked like a charm.

7 Comments

Filed under Tech - Tips