Tuesday, July 28, 2009

The Canary in the Gold Mine?

I’ve been claiming for a while that data mining and predictive analytics (PA) were the new hills to conquer in BI and this morning the news came out that IBM had plopped down big money for SPSS. IBM is also investing R&D dollars in ways to manipulate data directly while encrypted and/or compressed. This particular research fascinates me because I believe it will be key to SaaS acceptance, where security is still a significant push-back for obvious reasons. This means analytics might actually have a future on the cloud. And this is important IMHO because this allows for significant progress in the UX systems required to use (drive) mining engines efficiently. The kind of improvements that cannot be generated and deployed quickly enough with fat client implementations. I’m thinking of really interesting things like www.spezify.com for example.

Another interesting trend is pushing analytical capabilities deep into the database engine either via stored procedures or user-defined functions in one or more programming languages (much like .NET inside SQL Server, for example). All this leads me to believe that insightful BI players have been turning their guns on solving the next big pain point of BI which is, IMHO, data mining and predictive analytics. This embedded capability relates to the deep kind of analytics I once blogged about in the context of Greenplum’s MAD paper.

So does this mean we’re all done with OLAP? Not likely, but I think a certain peak has been reached where OLAP has become “bearable”. I don’t really have a 3-5 year “future outlook” on OLAP at this point. Is it still hard to cube and do MDX? Yes. Is it still a pain in the behind to setup large SSAS analytics? You bet. Is setting up a production version of Pentaho’s Mondrian ROLAP for the faint of heart? Not exactly. But there are now multiple alternatives out there in both hardware (faster COTS components, FPGAs, GPUs, MPP) and software (columnar, ALGEBRAIX) realms.

Our own ADBMS at XSPRADA is designed and tuned specifically for OLAP workloads in its present form. Product such as ours have helped “commoditize” OLAP work by shifting design and pre-structuring efforts (cubing, slicing and dicing) from the user (DBA) to the software itself. This is done automatically and based on queries coming in. There is no need to configure cubes, mixed workloads are supported, and all the user really has to do is ask questions. It’s that simple really. Let the software worry about the darn cubes!

So I guess my point is, if there are people still struggling (read: losing time and money) with OLAP in the enterprise, I have to say it’s because they’re either poorly advised or simply not opening their eyes to new tools and techniques currently available. At this point OLAP pain is no longer a necessity. It’s an uneducated choice. From a technical standpoint, it has been addressed. Let’s move on to the next problem please. This is why I think the industry is poised to tackle another challenge now, namely data mining and predictive analytics. Even Curt Monash in a recent blog about the SPSS acquisition writes:

So far business intelligence/predictive analytics integration has been pretty minor, because nobody’s figured out how to do it right, but some day that will change. Hmm — I feel another “Future of … ” post coming on”.

Sorry Curt, I beat you to it J

Mining is a totally different segment of the business intelligence endeavor. When you do OLAP, you’re asking “tell me what happened and why”. When you do mining, you have no clue what happened and much less why. In mining you’re asking “tell me what I should be looking at” or “tell me what’s interesting in this data?” And predictively, you’re asking “tell me what’s likely to happen” – as in, show me the crystal ball. Mining is not a pre-structured, pre-indexed kind of “cubing” world. It’s an ad-hoc discovery process. It’s iterative. Much like the way a human brain functions when discovering information, and trying to make sense of it. This “human-like” behavior is actually one of QlikView’s usability pitches. In mining, the relational model is a hindrance, not an asset, because relationships are not necessarily canned or static. Predictive analytics are more of an art than a science as well. These concepts don’t fit nicely in pre-structured, tabulated formats.

Additionally, mining and PA are creative endeavors (whereas OLAP is not). This is why it’s important to let users define their own “stuff” so they can trial-and-error through the problem. Conventional database engines don’t support this type of workload elegantly. It’s simply not “structured” nicely like OLTP or OLAP. You can’t easily (or cost-effectively) try, erase and re-start with conventional engines. They're not forgiving.

So what’s needed are systems that can first intelligently process data upstream in ELT mode because acquiring statistic on incoming data (at varying rates) is an important step for analytics. XSPRADA’s engine starts analyzing data statistically upon initial presentation. More importantly, it keeps doing so automatically in real time, and continuously via comprehensive optimization. This is a unique feature that causes the system to continuously re-evaluate system resources against queries and data to seek out additional or more effective optimizations.

Next, you need systems that can tell you where NOT to look. Because in this type of work, pertinent data is often clustered in very specific areas (as in 5% of 100TB perhaps). And user questions tend to hit within small percentages of those clusters. Yes there are always exceptions, but generally-speaking, that’s what happens. So what you DON’T want are systems that spend a lot of time scanning boatloads of data (needle in the haystack). What you need is intelligent software that can quickly eliminate vast areas of informational “no-man’s land” based on incoming queries. In such a problem space, throwing additional monies at ever more powerful metal is a self-defeating approach. It’s the software stupid! J

As it turns out, XSPRADA’s ALGEBRAIX technology is very good at eliminating "useless" (read: at a given time) data spaces. Not only that, but it also shines at inferring subtle relationships between different entities. The kind of relationships a human wouldn’t even think of asking on her own. It’s also very good at recognizing patterns (both in queries and targeted result sets).

In a way, you would expect that a system built on pure mathematical foundation would be particularly well suited to data mining workloads. And it sure is. This is the beauty of having a “wide” and rich enough technology that is as easily and readily applicable to a multitude of different BI problems. It means you don’t need to re-invent the wheel or re-architect your system every time a new problem space opens up. And that, in the business intelligence technology world is a rare find indeed.


  1. Hi Jerome-

    Good assessament of the challenge and the opportunity. I have been trying to reiview these blogs and distill them into the true market opportunity. Any ideas on industries?

  2. Thanks for reading Steve. I'm sorry I'm not sure I understand your question though.