Category Archives: MySQL

Can we hibernate, seriously?

I come from, what some may think as an old school, the development environment where a lot of SQL is hand-coded and well tuned. So, to me using a generic framework that can understand the database schema based on some meta-data and automatically translate a generic syntax into the target database is a bit concerning. I have done performance tuning on Oracle, MySQL, SQLite and Derby and my experience had been that, while abstracting the SQL generation such that the same SQL definition can run on all of these databases is probably not that difficult, for anything that’s more serious, such as a complex reporting SQL, not all databases behave the same way with the same form of SQL adjusting a bit for their syntactic differences. For example, check my articles MySQL COUNT DISTINCT vs DISTINCT and COUNT, SQLite Join and Group By vs Group By and Join and Experience with Derby to see how each of these databases required restructuring the SQL statements fundamentally so different, that a generic ORM (object relational mapping) such as Hibernate will not be sufficient for complex SQLs.

Depending on the application, I would say 60 to 80% of the SQLs could be very simple mostly providing the basic CRUD services and can be done using the generic frameworks. But that remaining 20 to 40% is where it may be desirable to hire database experts to hand-tune SQLs.

Leave a comment

Filed under Hibernate, MySQL, Oracle, ORM, performance tuning, SQL Tuning, SQLITE

MySQL: SELECT RANDOM ROW, Very Efficient

When living in a world of chaos, it shouldn’t be a surprise on the requirement to select a random row from a table. Randomness helps to make an otherwise static web page a bit more dynamic. Or it helps to rotate a banner or an ad or a forum post and so on.

I looked at the various methods of selecting random rows from a table and wanted to write down what I have done for the system I am working on.

I have a table where there is an ID field that is auto incremented. In addition, I already have a need to fetch the latest ID. This is mostly the MAX(id). However, I also have a status field that prevents rows with a certain status to not show up.

Here is how I ended up with fetching a random id.

1) First get the max id of the table. Something like

select id,... from tablex where status = 1 order by creation_date desc limit 0,1

Here, I have an index on creation_date. So, the cost of essentially executing the above query is traversing down a b-tree index from the maximum value side and resolving the rows in the table to filter by status and get the first row. Assuming most of the recent rows have a status of 1, the number of rows resolved using an index should be 1 or just a few.

2) Then generate a random value with the above fetched id as the max. Keep doing this till the value is != 0. This is because, the ID starts from 1 in my case. Also, it’s possible to have other variations such as the ID being more than x% of the max ID. This typically helps in fetching a more recent random row if that’s what is desired.

3) Now, just do
select id,* from tablex where status = 1 and ID = ? and bind the ID with the random ID generated in the previous step. There are two reasons why this
query may not result in a row. 1) the status of this row is not appropriate 2) the id
perhaps doesn’t exist in the table (may be it’s deleted). Assuming the chances of
these is almost negligible, it’s possible to find a row almost always immediately. The
cost of the above SQL is nothing but fetching a row using a unique index which is
quite efficient. Just to keep it generic, loop through steps 2 and 3, till a row is identified. In large data sets, it’s likely to always find a row eventually, but just in case, have contingency for not finding a row after, say N (perhaps 5) iterations.

That’s pretty much it. So, given that I already have a need to fetch the latest row
in the table, the cost of step 1 is not a factor for my random row selection. This just
left with issuing just another unique index based SQL. For a homepage that has to
display both the latest row(s) and a random row, this technique is quite efficient.

6 Comments

Filed under MySQL, performance tuning

MySQL: String concat multiple rows

If you are implementing a tagging system like what you see with WordPress, then you may have a requirement to fetch all the tags of each article in the blog when displaying the blog. As each article has many tags, one option is to first fetch all the articles and for each article fetch the list of tags. However, there is a better way to do this in one single SQL when using MySQL.

MySQL has the function group_concat that allows one to condense a set of rows into one single row by concatenating the list of values.

The SQL will be something like

select article_id,article_title,article_text,(select group_concat(tag) from tags t where t.article_id = a.id) tags from articles a …;

Here, the subselect used for the tags column is what I am talking about. It uses the group_concat to concatenate all the tags together. By default it uses ‘,’ as the separator and I believe that can be altered.

PHP & MySQL Web Development is a book for learning web development with PHP and MySQL.

20 Comments

Filed under MySQL, tagging

MySQL Regular Expressions

I have a table that stores a bunch of keywords. The keywords can be either with or without quotes (“). That is, the keywords could be PageRank, “Web 2.0” and startups. My requirement was to sort these keywords but ignoring the quotes. That is, instead of displaying them as “Web 2.0”, PageRank , startups, I wanted them to show up as PageRank, startups, “Web 2.0”

That’s when I started looking at MySQL regular expressions. A few interesting things. First and foremost, it looked odd to me that the syntax is something like

“select stringcolumn regex ‘some-pattern'”

instead of

“select regex(stringcolumn,’some-pattern’)”

The second thing is in how I had to write my sql to achieve the above requirement. I ended up doing something like

select …, case when keyword regex ‘”.*”‘ = 1 then substr(keyword,2,length(keyword)-2) else keyword end okw from … where … order by owk;

This felt quite ugly, but perhaps a better hypothetical syntax could have been like

regex(keyword,'”(.*)”‘,”nvl($1,$0)”)

where the 3rd argument is an SQL expression evaluated with special bind variables captured from the regular expression. (Ignore the usage of nvl which is a Oracle syntax, but you get the point).

1 Comment

Filed under MySQL

MySQL: Updating sets of columns

I was looking for updating a bunch of aggregate values from a child table into a master table. So, my requirement was something like

update parent_table set (minx,avgx,maxx) = (select min(x),avg(x),max(x) from child_table where parent_id = parent_table.id) where id = :parentid

I.e, ability to update multiple columns from the same subquery. However, MySQL currently doesn’t support this. I checked Oracle’s syntax, and found that it does have such syntax. In MySQL, this will have to be done by 3 separate subselects which will be 3 times more
expensive.

If you know of making it work optimally for MySQL let me know. Ofcourse, if the outer clause is for only a single id, like the :parentid bind I had, then I could execute the subquery separately, pick the min, avg and max values and updated them directly. However, in my actual requirement, the parentid is not just a single id, but it itself comes from a subquery.

1 Comment

Filed under MySQL

MySQL: COUNT DISTINCT vs DISTINCT and COUNT

Today I had the pleasure (or the pain?) of tuning a bunch of SQLs written for MySQL (5.0.26). One that bothered me most was a sql that took about 210 seconds and it appeared to be a very innocent SQL except for the fact that the where clause is useless and ending up in a full table scan. It was of the form

select a,b,count(distinct c),count(distinct d) from a-bunch-of-tables-and-where group by a,b;

To rule out the possibility of problem due to full table scan, I tried to do a simple query with out the distinct in the count and to my surprise, it returned back in under 3 seconds. Not bad for more than half-a-million rows!

Now, this difference of 210 to 3 seconds really worried me. After a bit of searching, came to realize that this is currently a limitation with MySQL as mentioned at http://forge.mysql.com/worklog/task.php?id=3220

So, I changed the query to the form

select a,b,count(distinct c),count(d) from (select distinct a,b,c,d from ... where ...) group by a,b;

and the query started working returning results in 12 seconds. Ofcourse, 12 seconds is no good, but as the query is for an aggregate report, I am fine with it.

In general, all the SQLs I tuned today required using sub-queries. Having come from Oracle database world, things I took for granted weren’t working the same with MySQL. And my reading on MySQL tuning makes me conclude that MySQL is way behind Oracle in terms of optimizing queries. While the simple queries required for most B2C applications may work well for MySQL, most of the aggregate reporting type of queries needed for Intelligence Reporting seems to require a fair bit of planning and re-organizing the SQL queries to guide MySQL to execute them faster. With Oracle CBO, that’s usually never the case. Things are far more intuitive and easy in Oracle world.

If anyone has any other stories of their performance tuning experience with MySQL, feel free to comment on them.

High Performance MySQL is the latest book on MySQL Performance tuning.

11 Comments

Filed under MySQL, performance tuning

JDBC parameter for null timestamps in MySQL

I just came to know that support for timestamps with null values in MySQL JDBC using mysql-connector-java is not default! It throws an exception. However, using zeroDateTimeBehavior parameter with a value of convertToNull in the jdbc url will do the trick.

Leave a comment

Filed under Database, JDBC, MySQL