A while back I wrote about SAAS BI and why I thought it may not be the best strategy to avail BI as a SAAS. A few commenters didn’t agree with me.
Guess what? Today, at Open World, Larry Ellison, the guy who would have better understanding of where the databases are going within the next decade, more than most of us, has unveiled one of the coolest products of the decade in the BI space. No, it’s not yet another cool reporting tool or a sexy graph renderer or a user friendly pivoting or some meta data layer that allows analysts create ad-hoc queries without learning about outer joins and SQL aggregations. The innovation is at the very core layer, the hardware layer.
When you hear Larry talk about TPC-H, the challenges that large companies are facing today with tera-bytes of data, mind-you, inspite of a big database grid, you wonder how the SAAS BI guys can solve the problem of extracting those tera-bytes of data into their own hosted environments.
Exadata Storage Server is essentially a specialized storage hardware that has database level knowledge (think of it as a hard disk with part of the database logic put together) and so, when a query has to be executed, instead of having to return blocks of data from the storage server, only the results set is returned back. This is a big big big thing. If you have a large table and your query requires a full-table scan on it, in the traditional approach, those thousands of data-blocks have to be returned (even with optimizations such as multi-block fetch for FTS). But in the new approach of specialized storage solution with built-in database logic, it just needs to return the summary data. That’s a lot of bandwidth optimization between the storage grid and the database grid. Apparently, currently the pipes between the Exadata storage server and the database grid have so much high-bandwidth that they are no longer the bottleneck. Seems, now instead the hard-drives are the bottleneck! Well, if only the SSDs (Solid-State Disks) mature and become more price effective, then perhaps that equation would shift. But, atleast for now, Oracle’s Exadata server blows the competition away by a huge margin. Forget about SAAS, we are talking about raw processing power here!
The benchmarks shown in the presentation indicate that this solution performed much much better than the solutions like Teradata and Netezza.
I come from, what some may think as an old school, the development environment where a lot of SQL is hand-coded and well tuned. So, to me using a generic framework that can understand the database schema based on some meta-data and automatically translate a generic syntax into the target database is a bit concerning. I have done performance tuning on Oracle, MySQL, SQLite and Derby and my experience had been that, while abstracting the SQL generation such that the same SQL definition can run on all of these databases is probably not that difficult, for anything that’s more serious, such as a complex reporting SQL, not all databases behave the same way with the same form of SQL adjusting a bit for their syntactic differences. For example, check my articles MySQL COUNT DISTINCT vs DISTINCT and COUNT, SQLite Join and Group By vs Group By and Join and Experience with Derby to see how each of these databases required restructuring the SQL statements fundamentally so different, that a generic ORM (object relational mapping) such as Hibernate will not be sufficient for complex SQLs.
Depending on the application, I would say 60 to 80% of the SQLs could be very simple mostly providing the basic CRUD services and can be done using the generic frameworks. But that remaining 20 to 40% is where it may be desirable to hire database experts to hand-tune SQLs.
I recently found that the connect by clause performance is much better in Oracle 11g compared to Oracle 10g. In 10g, the explain plan showed either a FULL TABLE SCAN or a INDEX FULL SCAN as the last step. However, in 11g, this is not the case. The number of consistent gets is much lower.
Today, I was trying to help a friend in migrating data from one data model to another. The first data model had all the levels of the dimension in the same table while the second data model had them in different tables. So, instead of writing some complex procedural logic, I tried using Oracle’s Multi-Table Insert which essentially has the following syntax
when then into <table> values (…)
when then …
select … from where … connect by … starts with … order by level;
The data models in this specific architecture are auto generated and have the foreign key constraints automatically created. As a result, I tried to order the data by the level of the dimension hierarchy so that the parent level is always inserted before the child level. However, the above sql still gave constraint violation error.
After a bit of a research, apparently, Oracle doesn’t guarantee the order of inserts in spite of the explicit ordering in the select clause. Well, I fixed the problem in a quick and easy manner using a pl/sql block with looping on the level from 1..n.
Anyway, I think the Multi-Table Insert is a cool feature but given the fact that this has been designed specifically keeping ETL in mind, it would be great if this can be enhanced to honor the ordering of the data.