MySQL: COUNT DISTINCT vs DISTINCT and COUNT

Today I had the pleasure (or the pain?) of tuning a bunch of SQLs written for MySQL (5.0.26). One that bothered me most was a sql that took about 210 seconds and it appeared to be a very innocent SQL except for the fact that the where clause is useless and ending up in a full table scan. It was of the form

select a,b,count(distinct c),count(distinct d) from a-bunch-of-tables-and-where group by a,b;

To rule out the possibility of problem due to full table scan, I tried to do a simple query with out the distinct in the count and to my surprise, it returned back in under 3 seconds. Not bad for more than half-a-million rows!

Now, this difference of 210 to 3 seconds really worried me. After a bit of searching, came to realize that this is currently a limitation with MySQL as mentioned at http://forge.mysql.com/worklog/task.php?id=3220

So, I changed the query to the form

select a,b,count(distinct c),count(d) from (select distinct a,b,c,d from ... where ...) group by a,b;

and the query started working returning results in 12 seconds. Ofcourse, 12 seconds is no good, but as the query is for an aggregate report, I am fine with it.

In general, all the SQLs I tuned today required using sub-queries. Having come from Oracle database world, things I took for granted weren’t working the same with MySQL. And my reading on MySQL tuning makes me conclude that MySQL is way behind Oracle in terms of optimizing queries. While the simple queries required for most B2C applications may work well for MySQL, most of the aggregate reporting type of queries needed for Intelligence Reporting seems to require a fair bit of planning and re-organizing the SQL queries to guide MySQL to execute them faster. With Oracle CBO, that’s usually never the case. Things are far more intuitive and easy in Oracle world.

If anyone has any other stories of their performance tuning experience with MySQL, feel free to comment on them.

High Performance MySQL is the latest book on MySQL Performance tuning.

Advertisements

11 Comments

Filed under MySQL, performance tuning

11 responses to “MySQL: COUNT DISTINCT vs DISTINCT and COUNT

  1. Nick

    I’m working on a database table with about 20 million records.

    Strangely , ‘select count(distinct X) from Y where Z’ is much slower than ‘select distinct X from Y where Z’ (where table Y is indexed on X,Z).
    The first query takes around 5 minutes while the second takes .03 seconds. I can only guess that for some reason the first query doesnt use the index.

    If I’m wanting to use the count function as part of an aggregated query, is there some way around this problem?

  2. S

    Try

    select count(1) from (select distinct X from Y where Z)

  3. Pingback: SQLITE: Join And Group By vs Group By And Join « poeticcode

  4. Pingback: Can we hibernate, seriously? « poeticcode

  5. smart

    I have a same problem. I was tested it on Mysql 5 and Postgresql 8.4, but got the same result.
    Is such that query:
    select distinct(count a) from x;
    mutch slower than
    select count(1) from (select distinct a from x) as foo;
    I can not suggest a reason, but I see that the first query does not use indexes.

  6. Em

    Thanks! That was useful.

  7. Pingback: J2EE 在路上 » 常用MySQL query总结

  8. Was looking for exactly this info, and the results have been a massive improvement. Totally counter-intuitive on the timing front, though.

  9. Will Faulkner

    Thanks for this – I was having massive issues trying to use count with distinct trying to get a handle on candidates (my client is a recruiter) who were on the database with previous experiences recorded which references an unknown or a null employer.

    Initially this was timing out at 10 minutes:

    select count(distinct can.candidate_id) from candidate can inner join candidate_permissions cp
    on can.candidate_id = cp.candidateID
    inner join candidate_experience ce
    on can.candidate_id = ce.CandidateID
    where cp.roleid = 16843326
    and (isnull(ce.PreviousEmployer) or ce.PreviousEmployer = “unknown”
    or ce.PreviousEmployer = “”
    or ce.PreviousEmployer = ” “);

    The following (very similar) query completed in under 2 seconds!

    select count(*) from (select distinct can.candidate_id from candidate can inner join candidate_permissions cp
    on can.candidate_id = cp.candidateID
    inner join candidate_experience ce
    on can.candidate_id = ce.CandidateID
    where cp.roleid = 16843326
    and (isnull(ce.PreviousEmployer)
    or ce.PreviousEmployer = “unknown”
    or ce.PreviousEmployer = “”
    or ce.PreviousEmployer = ” “)) as counttable;

    I agree with Imogen though, totally counter-intuitive but I’ll not argue!

  10. Thank you for this post. It really helped me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s