Monthly Archives: May 2008

Why I (rarely) Hate Java?

I have spent a lot of time learning the many libraries of java to hate it actually. I have decent knowledge in perl that I am reluctant to learn other scripting languages like Python. Similarly, if I can write cgi scripts in perl, why learn PHP or Ruby (on rails or otherwise).

I had a strange Out Of Memory Heap exception and after reviewing the code didn’t find anything obvious in the java code that hinted at any memory leak. Then it turns out, my regular expression matching is what is causing the memory problem. That is in turn caused by the String.substring.

If you look at Java source code, in java/lang/String.java, you would notice some comments for the

public String(String string)

implementation. First, why would one want to create a String of another string? Well, here is the reason. Thing is, when you get a substring of a string in java, it doesn’t actually create a separate array to store that substring. Instead, the array of the original string is shared and an offset and length are used to track the substring. This type of implementation is possible in Java because strings are immutable.

So, in my use case, I have been fetching a bunch of large HTML pages and doing some pattern matching and extracting some tokens and keeping them in an array. So, even though they happen to be small tokens within my program and hence my initial code review assuming that I am only consuming very little memory, because those tokens happened to be substrings of the entire html page, the memory consumption turned out to be very high. Actually they are the return values of the javax.util.regex.Matcher.group(1). So, instead of directly adding the return value into the array, I created a string

String token = new String(matcher.group(1));

and then added it. This solved the memory problem.

Granted you don’t need to know about memory management when using Java as there is the garbage collector that takes care of things for you. But now and then, you get into this type of issues that require a little bit more digging (not the social type).

2 Comments

Filed under Java

Why I like Googlebot and not Yahoo! Slurp

If you are a webmaster who monitors your website statistics periodically, you know that there are a bunch of crawlers, mostly from Google and Yahoo! visiting your website. One thing I noticed is that Googlebot typically visits your website from a single ip address at any given day (frequency and ip variation may perhaps depend on the popularity of the website) while Yahoo! slurp visits the site from multiple ip addresses. I think Yahoo! does this perhaps to parallelize their crawling. However, between the option of parallelizing crawling on one site vs multiple sites, the later is probably desirable for a few reasons. One is the fact that the website being crawled will need to expend less resources (think of keepalive, no concurrent crawler connections). The other issue is, if you use a normal web statistics software that doesn’t offer more powerful analytics by filtering out crawling visits, the number of visitors is going to be high if it’s crawled from multiple IPs. Also, the latest visits report on my website’s cpanel groups visits by ip address and as a result, there are too many entries for Yahoo! while there is a consolidated single entry for Googlebot. I wonder if there is a way to specify the max crawlers per bot.

1 Comment

Filed under bots, Googlebot, Yahoo! slurp

Can we hibernate, seriously?

I come from, what some may think as an old school, the development environment where a lot of SQL is hand-coded and well tuned. So, to me using a generic framework that can understand the database schema based on some meta-data and automatically translate a generic syntax into the target database is a bit concerning. I have done performance tuning on Oracle, MySQL, SQLite and Derby and my experience had been that, while abstracting the SQL generation such that the same SQL definition can run on all of these databases is probably not that difficult, for anything that’s more serious, such as a complex reporting SQL, not all databases behave the same way with the same form of SQL adjusting a bit for their syntactic differences. For example, check my articles MySQL COUNT DISTINCT vs DISTINCT and COUNT, SQLite Join and Group By vs Group By and Join and Experience with Derby to see how each of these databases required restructuring the SQL statements fundamentally so different, that a generic ORM (object relational mapping) such as Hibernate will not be sufficient for complex SQLs.

Depending on the application, I would say 60 to 80% of the SQLs could be very simple mostly providing the basic CRUD services and can be done using the generic frameworks. But that remaining 20 to 40% is where it may be desirable to hire database experts to hand-tune SQLs.

Leave a comment

Filed under Hibernate, MySQL, Oracle, ORM, performance tuning, SQL Tuning, SQLITE