As analytical and BI applications became more and more prevalent in the industry, it quickly became obvious that the "Big Three" had significant scalability and costing issues when employed in a data warehousing scenario. For one thing, these legacy products were designed for transactional processing, and not analytical work, two very different use cases. For another, information volumes were growing exponentially, well beyond the software’s capability to keep up. (Keep in mind there hadn't been any significant breakthroughs in relational database technology for over 40 years). What's worse, all these legacy systems still cost big bucks while failing to solve new pain points quickly enough.
To address this, over the past recent years, dozens of new companies started offering dedicated analytical database engines. In other words, some very bright people started rethinking and reengineering databases specifically for handling analytics on a massive scale. And some very desperate customers started paying serious attention to these new offerings, using them, and spreading the word. Some of the better known players include Teradata, Sybase (their IQ product), Netezza, and Vertica. Today virtually all large enterprises rely on a proprietary analytical database solution to manage their business both strategically and tactically.
If you really want to learn from the best on the topic, suffice to say that Curt Monash is probably the ultimate authority in this business. His recently-posted Powerpoint deck on how to select a data warehouse DBMS (http://www.monash.com/uploads/How-to-buy-data-warehouse-DBMS-February-2009.ppt) is about as clear and informational as anything else out there. It also has an extensive list of "new-breed" players in the market. And anyone seriously following this game cannot fail to read Curt's blog on a daily basis at www.dbm2.com.
That being said, let me try and simplify even more.
In one way or another, these "new breed" players all boast one version or another of a "secret sauce" that makes them uniquely suitable to handling terabytes of information, giving them "simpler, faster cheaper" advantages. At the end of the day, and minus all the marketing hoopla and chest-beating frenzy, these folks more or less make the following arguments:
- Information stores are growing so fast that conventional OLTP databases cannot keep up with current and projected data volumes.
- Conventional engines cannot serve analytical needs as they were designed for operational high-flow transactions where dealing with rows is efficient for write-many/read-few usage patterns.
- In data warehouse situations, reads are frequent, while writes are not and most reads pull fairly small chunks of data.
- This being the case, you can't build row-centric databases to address transactional issues and expect them to also double as scalable, performant and affordable analytical engines.
At that point, pretty much every company has a proprietary "secret sauce" to address these issues usually comprising one or more of the following ingredients:
Replace row-based with column-based designs at the logical and storage levels.
This approach lets the software deal with entire columns of data in one fell swoop instead of managing rows. So if you have a 20 column table, and a query hitting 3 of those columns only, you can bring back the 3 columns in one shot, instead of bringing back all the rows and then filtering the unwanted fields. This is a grossly oversimplified explanation but that's the general approach taken by the many "columnar" players in this field including Vertica, InfoBright and Sybase who was actually the first to market using this approach with SybaseIQ.
In-memory caching (DIM)
You look at the queries coming in and determine which areas of your data can be cached in memory for better performance. Almost all players do something in that area. It's simply sound design given that most analytical queries (specifically OLAP) tend to hit a small proportion of the entire information store (when you get in the tens of terabytes).
Move engine functions judiciously "closer to the data" for pre-processing queries
This is one way to try and improve disk I/O as one of the major bottlenecks in this space is moving bytes around. CPU to disk speed ration has remained around 1000:1 for decades now. Even using newer SSD technology, the discrepancy is still significant. As the fastest way to do something is to not do it at all, minimizing byte transport yields crucial performance advantages. If you can calculate and generate preliminary results very close to the storage layer and only move those at the last minute as necessary, you can improve I/O bottlenecks. Oracle’s Exadata (its analytical offering) apparently makes use of this feature extensively.
Maximizing storage I/O
Some companies implement their own disk caching and I/O operations at the O/S level using streaming or block transfer approaches.
Automatic indexing algorithms either at load time, processing time or both.
Most new breed players do some sort of indexing internally either during initial data load or when queries come in. Indexes and relationships can be inferred from the questions being asked and subsequent results optimized. Bit indexing on low cardinality fields is also a well-established technique that can have significant performance advantages especially in columnar architectures. In large data warehouses, data integrity is typically enforced during the ETL phase. Also warehouse database schemas are typically not very normalized and consist mainly of surrogate keys. There are no complex relationships and no real-time external data insertions to check against complex relationships.
Proprietary optimization of SQL queries
SQL queries for business intelligence can be aggressively optimized, even in an ad-hoc environment. BI queries are typically easy to categorize, analyze, and consequently optimize.
Custom hardware solutions
SMP, grid or MMP based, share-nothing or share-all implementations. At odds here are two different “schools” of technical thought: those who feel it is more efficient to distribute processing power among independent networked “nodes” (MPP shared-nothing approach) and those who believe it is better to share massive processing power and possibly storage (SMP shared-everything approach). There are technical and cost benefits and drawback to both approaches, as is usually the case in the engineering realm. For a great article on this topic, go to http://ianalyze.net/2009/02/shared-nothing-shared-something-shared.html
Compression can offer significant storage and load-time benefits. The sweet spot is being able to actually query from compressed streams directly without taking the CPU overhead of decompression.
Which "secret sauce" or mix thereof will work better? The answer is always a resounding "it depends". It depends on the business case, on the data, on the economics and the time frame. It depends on a lot of things, and there are few "canned" answers in this business. In my next blog, I will examine how XSPRADA fits into this story and what it is they do a little differently in the world of analytical engines.