MySQL performance: INNER JOIN vs. sub-select
Posted by
Brad Wood
Aug 01, 2008 04:51:00 UTC
I ran into an interesting behavior with MySQL this week. I was helping someone speed up a slow page and a sizable increase in performance was achieved by simply re-arranging the SQL statement. The page was calling a SQL statement inside of a loop-- probably around 150 times on a page load. My initial idea (and still the best long-term one I think) was to gather all the information from the database in a single cfquery and not hit the database over and over. That approach would have required quite a lot more refactoring of code, so first we tried to squeeze some better performance out of the SQL already being called.This site used MySQL which I am not as familiar with compared to MSSQL, but I started by looking at an EXPLAIN statement. In MySQL the placing the keyword EXPLAIN before your select will cause the engine to output a result set which describes how the select was performed.
Version 2 (faster):
I'm not going to get too deep into what all that means. If you want a full explanation, check out the MySQL reference here: http://dev.mysql.com/doc/refman/5.0/en/using-explain.html There's quite a bit of information there.
The main difference I noticed from the plan was that the first method figured it would go through the entire table1 (about 250,000 records). The second select only expected to hit about 50 records in table1.
What's a bit more interesting is that I have tried this setup both ways on MSSQL and there never seemed to be much of a difference between the two. It seems that Microsoft's query optimizer works a little differently. I don't have access to MSSQL right now or I would compare the plan there as well.
Overall, simply by re-arranging that query a bit we were able to get a 20 second page load down to well under a second. I love improving performance. A fast web site is a happy web site.
[code]EXPLAIN SELECT * FROM my_table WHERE column = 'foo'[/code]I will start by saying the amount of information MySQL (community edition) gives you when compared to a much more robust (and expensive) DBMS is much smaller. Perhaps the Enterprise edition of MySQL is better, but there's a REASON you pay for the Enterprise stuff. I LOVE how MS Enterprise Manager graphically maps out the entire query. The select in question was basically this:
[code]SELECT count(id) AS totalcount FROM table1 WHERE fkid IN (SELECT id FROM table2 WHERE fkid2 = 100)[/code]You can see it used a sub-select on table2 in the where clause to limit the records from table1. Table1 had a few hundred thousand records and table2 around twenty thousand. It began performing about 30 times faster when re-wrote it like so:
[code]SELECT count(table1.id) AS totalcount FROM table1 INNER JOIN table2 e ON table1.fkid = table2.id AND table2.fkid2 = 100;[/code]All I did was move the sub-select into an inner join. Let's take a quick look at the execution plans: Version 1 (slower):
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
1 | PRIMARY | table1 | index | {null} | fkid | 4 | {null} | 249844 | Using where; Using index |
2 | DEPENDENT SUBQUERY | table2 | unique_subquery | PRIMARY,fkid2 | PRIMARY | 4 | func | 1 | Using where |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
1 | SIMPLE | table2 | ref | PRIMARY,fk1d2 | fk1d2 | 4 | const | 50 | Using index |
1 | SIMPLE | table1 | ref | fkid1 | fkid1 | 4 | table2.id | 4 | Using index |
Tags: Performance, SQL
Barney
Correlated subselects on MySQL are almost always slower than joins, though the difference in a lot of cases (small row sets, for example) it's completely irrelevant. The readability of subselects is often enormously better, however. As such, I usually default to subqueries whenever possible, and go back and refactor to joins only if needed.
And your completely right about MSSQL, it's query optimizer is enormously better that MySQL's, at least when it comes to subselects.
jeff
Yeah i figured this out a couple of weeks ago from a bug post on mySQL's website.
http://www.digitalrewind.com/blog/2008/06/11/MYSQL-Slow-Subquery-using-IN
It's amazing how much it speeds up MYSQL.
Brad Wood
@Jeff: Thanks for the link.
@Barney: Thanks for the insight. I've always found an inner join to be easier to read, but it guess it's all a matter of personal preference.
Brian Kotek
Another option is to use EXISTS where possible, as it will also usually be much faster than a correlated subquery.
Another option is using can be using an inline view, meaning a select statement that is aliased and used in joins within the FROM clause (Google "inline view" for much more). All major DBs support this and they can be very useful for performance as well as getting data in one query that would normally require multiple queries.
Brian Kotek
Whoops sorry for the bad grammar there. :-/
Barney
Inline views actually don't seem to be as performant as they should on MySQL. We were doing some largish (couple million rows) reporting queries and using inline views and they were slow. Switching to explicitly temporary tables made a HUGE difference. Unfortunately MySQL doesn't support the WITH clause like MSSQL and Oracle do, which is basically equivalent.
Brad Wood
@Barney: When you say the "with clause" are you talking abut Common Table Expressions? I don't know if I care for them from a readability stand point always, but they can make for some really useful code. Recursion is one really nice use of CTE's.
Heck, I'd be happy if MySQL supported table variables though.
@Brian: Check out this MySQL bug report: http://bugs.mysql.com/bug.php?id=4040 The comments down near the bottom explain that MySQL 4.1 "rewrites IN via EXISTS like subquery " and in MySQL 5 they "planned make hash/temporary table optimization for such queries"
Raymond Camden
I hope Brad doesn't mind, but I was the person he helped. The site was coldfusionbloggers.org and the feeds tab. The improvement was incredible.
ike
That's a question I'd wondered about for a long time, but never got around to actually testing... I'd always suspected that a subquery was likely to be slower than an equivalent join clause, though I expected that to be true irrespective of the database platform. But I guess it really does depend on the engine used to create the execution plan.
Andrew
Ok, so I've been wondering, what if I have my sub-select on the select part of the query? That would be something like this:
SELECT name, (SELECT something FROM aTable WHERE person.id = aTable.personID) as Something FROM person WHERE person.id = @theID
wich would perform better? or is it a bad practice to do what I'm doing? if so, then the 'proper' way would be to use a Join instead?
Brad Wood
@Andrew: I've always assumed it would not perform well to have a sub select in the main select clause, but I've never actually tested it.
I have often been suprised by MS SQL's ability to come up with a good execution plan even with weird SQL.
That being said, I have rarely had the need to do that. Usually a join will do the trick nicely.
MR
Thanks very much!
p
Thanks a lot for posting this. I had a simple web page that took up to 1,5 seconds to query a MySQL database of just a few hundred rows, using two subselects. Implementing the change you suggest (which I didn't previously realize was possible) made an enormous difference: it now always loads in less than 0,1 seconds.