When living in a world of chaos, it shouldn’t be a surprise on the requirement to select a random row from a table. Randomness helps to make an otherwise static web page a bit more dynamic. Or it helps to rotate a banner or an ad or a forum post and so on.
I looked at the various methods of selecting random rows from a table and wanted to write down what I have done for the system I am working on.
I have a table where there is an ID field that is auto incremented. In addition, I already have a need to fetch the latest ID. This is mostly the MAX(id). However, I also have a status field that prevents rows with a certain status to not show up.
Here is how I ended up with fetching a random id.
1) First get the max id of the table. Something like
select id,... from tablex where status = 1 order by creation_date desc limit 0,1
Here, I have an index on creation_date. So, the cost of essentially executing the above query is traversing down a b-tree index from the maximum value side and resolving the rows in the table to filter by status and get the first row. Assuming most of the recent rows have a status of 1, the number of rows resolved using an index should be 1 or just a few.
2) Then generate a random value with the above fetched id as the max. Keep doing this till the value is != 0. This is because, the ID starts from 1 in my case. Also, it’s possible to have other variations such as the ID being more than x% of the max ID. This typically helps in fetching a more recent random row if that’s what is desired.
3) Now, just do
select id,* from tablex where status = 1 and ID = ? and bind the ID with the random ID generated in the previous step. There are two reasons why this
query may not result in a row. 1) the status of this row is not appropriate 2) the id
perhaps doesn’t exist in the table (may be it’s deleted). Assuming the chances of
these is almost negligible, it’s possible to find a row almost always immediately. The
cost of the above SQL is nothing but fetching a row using a unique index which is
quite efficient. Just to keep it generic, loop through steps 2 and 3, till a row is identified. In large data sets, it’s likely to always find a row eventually, but just in case, have contingency for not finding a row after, say N (perhaps 5) iterations.
That’s pretty much it. So, given that I already have a need to fetch the latest row
in the table, the cost of step 1 is not a factor for my random row selection. This just
left with issuing just another unique index based SQL. For a homepage that has to
display both the latest row(s) and a random row, this technique is quite efficient.