Monthly Archives: December 2007

MySQL: SELECT RANDOM ROW, Very Efficient

When living in a world of chaos, it shouldn’t be a surprise on the requirement to select a random row from a table. Randomness helps to make an otherwise static web page a bit more dynamic. Or it helps to rotate a banner or an ad or a forum post and so on.

I looked at the various methods of selecting random rows from a table and wanted to write down what I have done for the system I am working on.

I have a table where there is an ID field that is auto incremented. In addition, I already have a need to fetch the latest ID. This is mostly the MAX(id). However, I also have a status field that prevents rows with a certain status to not show up.

Here is how I ended up with fetching a random id.

1) First get the max id of the table. Something like

select id,... from tablex where status = 1 order by creation_date desc limit 0,1

Here, I have an index on creation_date. So, the cost of essentially executing the above query is traversing down a b-tree index from the maximum value side and resolving the rows in the table to filter by status and get the first row. Assuming most of the recent rows have a status of 1, the number of rows resolved using an index should be 1 or just a few.

2) Then generate a random value with the above fetched id as the max. Keep doing this till the value is != 0. This is because, the ID starts from 1 in my case. Also, it’s possible to have other variations such as the ID being more than x% of the max ID. This typically helps in fetching a more recent random row if that’s what is desired.

3) Now, just do
select id,* from tablex where status = 1 and ID = ? and bind the ID with the random ID generated in the previous step. There are two reasons why this
query may not result in a row. 1) the status of this row is not appropriate 2) the id
perhaps doesn’t exist in the table (may be it’s deleted). Assuming the chances of
these is almost negligible, it’s possible to find a row almost always immediately. The
cost of the above SQL is nothing but fetching a row using a unique index which is
quite efficient. Just to keep it generic, loop through steps 2 and 3, till a row is identified. In large data sets, it’s likely to always find a row eventually, but just in case, have contingency for not finding a row after, say N (perhaps 5) iterations.

That’s pretty much it. So, given that I already have a need to fetch the latest row
in the table, the cost of step 1 is not a factor for my random row selection. This just
left with issuing just another unique index based SQL. For a homepage that has to
display both the latest row(s) and a random row, this technique is quite efficient.

6 Comments

Filed under MySQL, performance tuning

The Prime Ministers: A Mom And Son, A Dad And Daughter

Wikipedia comes handy to do research when major events like yesterday’s (PST time, not WordPress’s GMT) Benazir Bhutto’s assasination happen. With this recent sad event comes a commonality between India and Pakistan.

India’s Prime Minister Indira Gandhi and her son Rajiv Gandhi who was also India’s Prime Minister were both assassinated.

And now,

Pakistan’s Prime Minister Zulfikar Ali Bhutto and his daughter Benazir Bhutto who was also Pakistan’s Prime Minister were executed (through a controversial trial) and assassinated respectively.

Some key facts, thanks to the collaborative, community driven and copyleft license from Wikipedia

Fact The Gandhis (India) The Bhuttos (Pakistan)
Indira Gandhi Rajiv Gandhi Zulfikar Ali Bhutto Benazir Bhutto
Picture
Birth Date Nov 19th, 1917 Aug 20th, 1944 Jan 5th, 1928 Jun 21st, 1953
Died Oct 31, 1984 66yrs May 21st, 1991 46yrs Apr 4th, 1979 51yrs Dec 27th, 2007 54yrs
Cause of Death Assassinated Assassinated Executed Assassinated
Prime Minister# 3rd and 6th 7th 10th ???
PM during 66/01/19-77/03/24
80/01/15-84/10/31
84/10/31-89/12/02 73/08/14-77/07/05 88/12/02-90/08/06
93/10/19-96/11/05
Age at First PM term 48yrs 40yrs 45yrs 35yrs
Higher Education   Imperial College London, part of the University of London, and at Trinity College, University of Cambridge   Lady Margaret Hall, Oxford, Radcliffe College, Harvard University

It took me a while to assemble this information inspite it’s readily available on Wikipedia.
Now if only all the Web 2.0 startups in the people “data mining” area such as Spock.com could do this automatically? Or may be the Freebase.com from Metaweb.com, the guys that are trying to create a semantic web.

Leave a comment

Filed under Benazir Bhutto, Indira Gandhi, Prime Minister, Rajiv Gandhi, Zulfikar Ali Bhutto

URL ReWrite & Relative URIs

Have you noticed how the permalink urls of wordpress contain the date as yyyy/mm/dd/post-title? And if you just type https://poeticcode.wordpress.com/ you get all the posts in that year and if you use // then all the posts within that month? This can be done using what’s called URL Rewriting. So, a URL such as

https://poeticcode.wordpress.com/2007/12/ gets rewritten to something like

https://poeticcode.wordpress.com/posts.do?year=2007&month=12 (this is not the exact link that wordpress uses).

This is usually a good approach, especially since it’s considered as a SEO friendly format. However, with this approach, one issue is that the generated HTML assumes the base of the document to be all the way up to /2007/12, when in reality, your base is actually at /. So, this can be easily fixed by using the BASE tag of the HTML.

If you are like me, who maintains a separate test site and a production site, then it gets a bit tricky. This is because of two reasons. One is that the name of the test and production servers is different. Second, the test server may not be setup to be starting from the root directory (/). For the first problem, it may be tempted to write the base tag with href pointing to just ‘/’. However, as per the BASE tag reference at w3.org, “This attribute specifies an absolute URI that acts as the base URI for resolving relative URIs.” That is, the base tag’s href has to be absolute.

So, the following perl code help to fix up this base tag.

$server = $ENV{SERVER_NAME};
$script = $ENV{SCRIPT_NAME};
$script =~ s/[^\/]+$//; # knock off the name and retain the directory
$base = "http://$server$script";

then, you can out put

Here, there is no need to use the $script if the test setup is also at the root directory level. Then, it would just be

$base = “http://$server/”;

I am sure this technique can be extended to take care of the http scheme (http vs https) and the non-default port.

Leave a comment

Filed under URL Rewrite

TEXTAREA HTML Editor, Open Source

I needed a textarea html editor so that I could copy and paste some HTML content from one page into the system. After a bit of a research, I first tried out Yahoo! User Interface Library (YUI). But then I had to give up that as it was both slow and also got confused when there are multiple textareas in the same page.

After a bit more research, I came across TinyMCE which is fast, worked with multiple textareas, easy to integrate and best of all, it’s licensed under LGPL.

BTW, I use Firefox and so one of the criteria is to use a solution that works in Firefox. There was another application I tried out before TinyMCE but abandoned it as it works only with IE. Also, my specific application is used only by me (the admin) and so I don’t care for cross-browser compatibility. Just that it works in IE. So, I came across Mozilla’s own rich text editing solution but gave up when I realized in that solution that it works with iframes. My criteria is to be able to easily integrate into existing html form that’s already using the textarea elements.

Leave a comment

Filed under rich text editor, textarea html editor

Where is Google In WordPress’s Referrers Report?

WordPress gives Referrer details so that you know where your visitors are coming from. I have seen answers.com, dogpile.com and a plethora of other websites but I haven’t seen Google, Yahoo!, MSN/Live. I know I do have traffic from these websites. When you get more than 200 hits to a single page on a Christmas day, you know there got to be a few of them coming from these search engines. So, that’s when I looked at the past history and noticed that these top search engines are never shown in referrer links. Ofcourse, WordPress shows the “Search Engine Terms” statistics. But I would like to know how many of those are from Google vs Yahoo! vs MSN/Live vs AOL vs Ask.com.

1 Comment

Filed under Wordpress

Childzilla

Most people know Mozilla and most geeks may know Bugzilla. But what about Childzilla? Yes, the there is a website called Childzilla that tracks the chores and reward points of kids. Is the name of this domain chosen by the owners to indicate that it’s an application appropriate for kids who are well, as scary as the Mozilla logo or it’s a tracking system similar to Bugzilla? Besides the name, I think it’s a cool online system and best of all it’s for free. So, why not all you parents out there give it a try?

Leave a comment

Filed under Childzilla

Shared WebHosting, Database Security

It is important to realize that even a small website, even it is not profiting, should give importance to the security. Many users use the same password or a little variation of it when registering at many websites. So, a compromise to the password at your small website can lead to a bigger problem at a more serious website for your user.

Here are a few tips on how to ensure there is a decent level of security on your site

1. Always ensure to one-way encrypt user passwords. You only compare the encrypted password with the encrypted user-input at the login time to see if they match. With this model, there is no password retrieval mechanism. User has to request for resetting the password and you just send an email with a url containing a large key that’s hard to guess and when the user clicks the url, you can let the user reset the password.

2. A lot of code on the web shows hard-coded password when connecting to the database. For example, in perl it would be something like

$con = DBI->connect(‘dbi:mysql:your-database:localhost’,’your-db-name’,’plain-text-password’);

As there is no connection pooling or an application level connection management with simple cgi-scripts, the tendency is to have the above piece of code in each cgi-script. This essentially means that the database password has been written in plain text in multiple files. One choice is to put the connect statement in a file and include it into all the scripts that require database connection. With this, it’s easy to change the database details as necessary in a central place. In addition, ensuring proper read file permissions to this single file is sufficient to a large extent.

1 Comment

Filed under database security

Selling A House On Amazon!

Check out this Home Listing On Amazon!

Call it marketing technique or desperation. The description of the listing goes

Product Description
SHORT SALE!! Owe bank $292,994.. Hardship, pre-foreclosure. Resort Living with Pools, Spa, and Tennis located in Mountain View Gated Community”. The best house available. Buyer to pay closing cost. Patio Cover in back yard. Large Corner Lot. Panoramic View of Mountains. Newer home. Comparables from $380 – $412 K. Next to Golf Course. North of Mission Lakes Blvd. 9 Miles to Palm Springs. Upgraded galore Beds: 3 Baths: 2 Sq. Ft.: 1,500 $/Sq. Ft.: $195 Lot Size: 7,900 sq. ft. Property Type: Single Family Residential Detached Year Built: 2006 Stories: One Level View: Mountain Area: Desert Hot Springs Subdivision: Mountain View Country Estates County: Riverside MLS#: 41294751 Additional Rooms * Master Suite * Living Room * Breakfast Counter/Bar * Breakfast Nook Flooring Information * Ceramic Tile * Wall-to-Wall Carpet Water & Sewer Information * District Water Source * Front Sprinklers Only * Auto Timer for Sprinklers * Sewer Connected & Paid Pool & Spa Information * Has Pool * Community Pool * Has Spa * Community Spa Interior Features * Window Blinds Property Information * Has Yard * CC&R’s * Gated Community * Land: Fee School Information * School District: Palm Springs Unified Fireplace Information * Has Fireplace * In Living Room Appliances Included * Dishwasher * Garbage Disposal * Microwave * Built-In Gas Building Information * Direct Garage Access * Main Floor Master Bedroom * Concrete Tile Roof * Square Footage Source: Seller Patio Information * Has Patio * Concrete Slab Patio Lot Information * APN: 661470021 * Lot #: 310

It can’t be a joke since Amazon actually assigned an ASIN (Amazon Store Item Number?) for this house and it’s B00106CN44.

I can already imagine mashups between Amazon.com and Zillow.com with all the MLS listings!

Now how do you like creating a Wishlist for houses, My Favorites for Houses, a Listmania, A deals widget for Houses, an Omakase (contextual) links and may be even recommendations based on your past purchase history (of houses or other things on Amazon)! Or may be even a donations widget asking people to fund your favorite home.

Right now it seems to be a 3rd party seller’s listing. Now, imagine if Amazon were to enter into a real estate brokering business. Using it’s dynamic price optimization, you would see one price and someone else a different price, give or take a few thousands of dollars!

And all of a sudden you get New Homes, Used Homes and Like-New homes. If a roof is newly done or a kitchen is remodeled, then it could even go as a “Refurbished Home”! Of course, there is no packaging with UPC code that has to be sent to get a manufacturer’s rebate. Practically, everything that Amazon does with it’s products, I am imagining how it goes with a house listing!

Here is the best thing, some kid puts a home in the shopping cart and a day later the rich dad with no credit limit tries to buy a book using One-Click and guess what, “Honey, the monthly credit-bill this month came to a million plus, what did you buy?” Just kidding!

Leave a comment

Filed under Amazon.com, Home Selling, Realestate, Zillow

JavaScript Performance Tuning

Most server-side programming languages have tools to do performance tuning on applications written in them. With the popularity of AJAX based applications and other rich-client HTML applications on the rise, more and more code is being written in JavaScript. So, how the heck do you go about troubleshooting performance issues for these type of client-side heavy applications written in JavaScript?

I faced this issue recently in two different applications where the Javascript library was not performing fast enough causing tardy user experience. I figured out that Firebug is an extension for Firefox that can be used to do the performance tuning. The way this works is,

a) install Firebug (and restart the Firefox browser)
b) enable Firebug if not currently enabled
c) load the page that needs performance tuning
d) open up Firebug and in the console, click on Profile
e) start using the UI controls in the page that have performance problem
f) Click on the profile and see a table of all the function calls ordered by the amount of time spent

This gives you an idea of the functions that are slow. The granularity of this report is only by function (thought it would have been good if there is a way to have line level granularity for some hot-spot functions to pin-point the line number). From this, it’s possible to figure out what’s going on and go about optimizing. In my specific cases, the two things I had to do one in each app are

1) Remove regular expression object creation and comparing from an inner loop
2) Reduce the number of Math.round calls

The performance gain in case of 1) was significantly high.

Leave a comment

Filed under Firebug, javascript, JavaScript Performance, performance tuning

Amazon SimpleDB, Text Indexing, BitMap Indexes

Today I just came to know about Amazon SimpleDB webservices from slashdot. Read the documentation which didn’t really give details about the internals of this technology. So, using my knowledge of information retrieval and databases, I am going to do a few comparisons. First some terminology,

Amazon SimpleDB => Databases
Domains => Tables
Items => Records
Attributes => Columns

A few key differences are

1) there is no need to define the data type of the attribute. This is because, the storage mechanism is mostly inverse indexing treating each value as a token.
2) values of an attribute can be both a string and a number. The token logic goes for this as well.
3) an attribute can have multiple values. Some databases like Oracle support user defined data types and varrays making it possible to capture multiple values.
4) different items can have different attributes. However, the doc indicates “256 total attribute name-value pairs per item”. It wasn’t clear if the union of all attributes across the items in a domain is 256 or for each item individually it is 256 (but much more across items). If it’s across the items, then this is no different than having a static table with 256 character columns and using a bit-map index on all of them.

Bitmap indexes are extremely fast. They work best when the ratio of the distinct set of values to the total records is low (fewer distinct values). Due to their vector storage nature, the query criteria are resolved individually (unlike in case of other types of indexes where either nested-loops, hash-join or merge-join is used), which is nothing but just fetching the index which is a vector and set-operations are performed on the bit vectors. This works well for “=” operator.

For example, there are a million records of people and Gender is indexed as a bit-map index, then getting the list of all the Males in this set of population is no more than loading the bitmap vector and picking all the records whose bits are set to 1. The “!=” operator is also not hard to implement it.

Dealing with numeric types and the related operators such as >= and and < is a bit expensive. The way it works is, the distinct values possible between the specified range and available among the distinct indexed values is identified and for each of these values, the index is loaded and a set union operation is performed. Though, a CBO database may be smart enough to go directly for a FTS (Full Table Scan) than loading all the indexes and then resolving the table.

From the SimpleDB documentation which mentions

“# Positive integers should be zero-padded to match the largest number of digits in your data set. For example, if the largest number you are planning to use in a range is 1,000,000, every number that you store in SimpleDB should be zero-padded to at least 7 digits. You would store 25 as 0000025, 4597 as 0004597, and so on.

# Negative integers should be offset and turned into positive numbers and zero-padded. For example, if the smallest negative integer in your data set is -500, your application should add at least 500 to every number that you store. This ensures that every number is now positive and enables you to use the zero-padding technique.”

it appears that the SimpleDB has been optimized to deal with constant-length positive numbers represented as string tokens. This is perhaps understandable since this technology, most likely developed to support Amazon’s highly scalable product catalog has most numeric attributes that are only positive (I mean, when was the last time you got an item at a -ve price? I once did after the manufacturer rebate and the retailer rebate both got cleared for an item).

Leave a comment

Filed under Amazon SimpleDB, BitMap Indexes, SimpleDB, Text Indexing