Tuesday, July 28, 2009

How I Learned to Love Mondrian (confessions of a WISA guy)

I’ve been playing with Pentaho’s Mondrian for almost a year now on and off. I have to say, everytime I mess with that stack I am more and more impressed by its richness and capabilities. And twelve months ago, when I started learning it, I was what you could call “severely LAMP-challenged”. I’ve sure made a lot of progress since then and wanted to talk about this as I figured it might help other Microsofties out there needing (or wanting) to put a toe in these mysterious LAMP/OSS waters.

The first thing I ever did with Mondrian was figure out how to install it on a Windows platform. The reason I did was twofold. First, we didn’t have appropriate Linux hardware/software in house at the time, and second, I have way more experience on Windows so it’s a lot easier for me, and third, I wanted to do it locally and avoid dealing with cross-platform bridging at the moment (our ODBC drivers are Windows only as well). Path of least resistance is an engineering mantra in my book.

Lucky for me I had worked with Java and Apache Tomcat in the dot-com days so I had no trouble pulling and installing the JRE/JDK and the web server itself (which comes as a Windows service). Next, I deployed the Mondrian WAR file into the Tomcat webapps folder which caused it to be automatically “deployed” as a web application. Way easier than deploying ASP.NET applications (but you didn’t hear this from me ).

Then, I fired up the Mondrian landing page, clicked on the Jpivot link and, of course, kaboom. Yes, without a JDBC driver, Mondrian is not a happy camper. It took me a little longer to figure out the Sun JDBC-ODBC bridge and how to plug corresponding connection string it into numerous Mondrian files to replace the default connections there (which are all for MySQL if I recall).

The Mondrian documentation isn’t great but if you Google long enough you can usually find some other poor slob with a similar problem and, with luck, published solutions online. [Side note: I once had the nerve to email Julian Hyde, their Chief Architect, about some technical question. He abruptly suggested I don’t bother him and use the “community” forums instead]. Unfortunately those forums are often useless for non-enterprise (read: non-paying) users. There are flurries of unanswered questions and problems up there. This is a generic OSS problem I suppose. You get what you pay for J

So finally I had my bridge setup, along with a DSN called MondrianFoodMart (the default) pointing at the default Access database (distributed with Mondrian). And now I was able to fire up Mondrian on top of the database and do a couple drills, run a couple MDX queries. Bliss.

Next, I took that Access database and exported it to CSV format with corresponding DDL. This is the way to feed the XSPRADA engine. Fired up our RDM/x server and ran the DDL scripts. Then re-defined the MondrianFoodMart DSN to point to us via our ODBC driver (32-bit only, 64-bit won't fly with the bridge - painful lessons learned...). Reloaded the Mondrian page, and voila! Mondrian was now talking to RDM/x and displaying the Sales cube.

One point of the exercise was being able to show Mondrian OLAP on top of our database. Another was being able to show our database’s behavior in time as more and more queries come in (hint: it gets faster). Now, with Mondrian, this is a little tricky because the platform is heavily cache-based. Mondrian shoots initial queries at a relational system and proceeds to cache heavily as it aggregates results. So the more you use it, the more it caches. Obviously the reason for this is that Mondrian is designed to run on top of relational databases, and not “OLAP-intelligent” engines such as ours. It has to translate the MDX into straight SQL queries every time, as it fills its caches initially. Nevertheless it does re-hit the database as needed when you start slicing and dicing on new dimensions or facts, as one would expect. So you can actually see our engine’s “dinosaur tail” behavior as I once described in a previous post.

Now, the MondrianFoodMart database is fine for setting up the stack, but it’s not particularly interesting in so far as data volume goes if you’re in the VLDB space. More recently, I attempted to setup the TPC-H/SSB sample data under Mondrian, meaning I tried to manually create the XML defining some SSB cube. There is a whole fairly complex XML language Mondrian uses to define and connect fact tables (measures) with associated dimensions. They have a UX driven tool called Workbench but I could never get it to work on my system (and didn’t have enough time to keep messing with it). With online help and using the FoodMart.xml sample file, I was able to get a basic cube up in about a day. Nothing fancy, but now I can OLAP into arbitrarily large data sets and that’s a good thing.

As cool as it is seeing our stuff run under Mondrian, I always dreamed of doing the same thing under Excel (as in 75% market share, yeah I want to support that please). Until recently, I thought this would not be possible until we implemented MDX in the engine but then I saw the light. It is called the SimbaO2X connector and it rocks!

[Start Commercial] Did I mention how much I love this company Simba? They pretty much wrote the book on data connectivity. Within a day they had me a 30-day trial version of their O2X offering, no questions asked. And follow-up to boot. Their stuff works, and they know how to take care of people. What a concept! [End Commercial]

This SimbaO2X puppy lets ODBO clients (say like Excel) talk to XML/A OLAP servers (say like Mondrian). Note, there is a similar offering from Pentaho called Pentaho Spreadsheet Services. It carries a small yearly license fee from what I understand. Supposedly you can email Pentaho sales for additional information and a local contact. I’m still waiting for their reply. Hey it’s OSS…Did I mention you get what you pay for?

Either way, the relevant fact is that the SimbaO2X connector works without a hitch. I am finally able to create and manage pivot tables from Excel, talking to Mondrian (via XML/A), talking to RDM/x (via ODBC)! This is the bomb! I need to really get a deeper understanding of MDX capabilities now. But the more I learn about it the more impressed I get, and the better demos I can do.

The Canary in the Gold Mine?

I’ve been claiming for a while that data mining and predictive analytics (PA) were the new hills to conquer in BI and this morning the news came out that IBM had plopped down big money for SPSS. IBM is also investing R&D dollars in ways to manipulate data directly while encrypted and/or compressed. This particular research fascinates me because I believe it will be key to SaaS acceptance, where security is still a significant push-back for obvious reasons. This means analytics might actually have a future on the cloud. And this is important IMHO because this allows for significant progress in the UX systems required to use (drive) mining engines efficiently. The kind of improvements that cannot be generated and deployed quickly enough with fat client implementations. I’m thinking of really interesting things like www.spezify.com for example.

Another interesting trend is pushing analytical capabilities deep into the database engine either via stored procedures or user-defined functions in one or more programming languages (much like .NET inside SQL Server, for example). All this leads me to believe that insightful BI players have been turning their guns on solving the next big pain point of BI which is, IMHO, data mining and predictive analytics. This embedded capability relates to the deep kind of analytics I once blogged about in the context of Greenplum’s MAD paper.

So does this mean we’re all done with OLAP? Not likely, but I think a certain peak has been reached where OLAP has become “bearable”. I don’t really have a 3-5 year “future outlook” on OLAP at this point. Is it still hard to cube and do MDX? Yes. Is it still a pain in the behind to setup large SSAS analytics? You bet. Is setting up a production version of Pentaho’s Mondrian ROLAP for the faint of heart? Not exactly. But there are now multiple alternatives out there in both hardware (faster COTS components, FPGAs, GPUs, MPP) and software (columnar, ALGEBRAIX) realms.

Our own ADBMS at XSPRADA is designed and tuned specifically for OLAP workloads in its present form. Product such as ours have helped “commoditize” OLAP work by shifting design and pre-structuring efforts (cubing, slicing and dicing) from the user (DBA) to the software itself. This is done automatically and based on queries coming in. There is no need to configure cubes, mixed workloads are supported, and all the user really has to do is ask questions. It’s that simple really. Let the software worry about the darn cubes!

So I guess my point is, if there are people still struggling (read: losing time and money) with OLAP in the enterprise, I have to say it’s because they’re either poorly advised or simply not opening their eyes to new tools and techniques currently available. At this point OLAP pain is no longer a necessity. It’s an uneducated choice. From a technical standpoint, it has been addressed. Let’s move on to the next problem please. This is why I think the industry is poised to tackle another challenge now, namely data mining and predictive analytics. Even Curt Monash in a recent blog about the SPSS acquisition writes:

So far business intelligence/predictive analytics integration has been pretty minor, because nobody’s figured out how to do it right, but some day that will change. Hmm — I feel another “Future of … ” post coming on”.

Sorry Curt, I beat you to it J

Mining is a totally different segment of the business intelligence endeavor. When you do OLAP, you’re asking “tell me what happened and why”. When you do mining, you have no clue what happened and much less why. In mining you’re asking “tell me what I should be looking at” or “tell me what’s interesting in this data?” And predictively, you’re asking “tell me what’s likely to happen” – as in, show me the crystal ball. Mining is not a pre-structured, pre-indexed kind of “cubing” world. It’s an ad-hoc discovery process. It’s iterative. Much like the way a human brain functions when discovering information, and trying to make sense of it. This “human-like” behavior is actually one of QlikView’s usability pitches. In mining, the relational model is a hindrance, not an asset, because relationships are not necessarily canned or static. Predictive analytics are more of an art than a science as well. These concepts don’t fit nicely in pre-structured, tabulated formats.

Additionally, mining and PA are creative endeavors (whereas OLAP is not). This is why it’s important to let users define their own “stuff” so they can trial-and-error through the problem. Conventional database engines don’t support this type of workload elegantly. It’s simply not “structured” nicely like OLTP or OLAP. You can’t easily (or cost-effectively) try, erase and re-start with conventional engines. They're not forgiving.

So what’s needed are systems that can first intelligently process data upstream in ELT mode because acquiring statistic on incoming data (at varying rates) is an important step for analytics. XSPRADA’s engine starts analyzing data statistically upon initial presentation. More importantly, it keeps doing so automatically in real time, and continuously via comprehensive optimization. This is a unique feature that causes the system to continuously re-evaluate system resources against queries and data to seek out additional or more effective optimizations.

Next, you need systems that can tell you where NOT to look. Because in this type of work, pertinent data is often clustered in very specific areas (as in 5% of 100TB perhaps). And user questions tend to hit within small percentages of those clusters. Yes there are always exceptions, but generally-speaking, that’s what happens. So what you DON’T want are systems that spend a lot of time scanning boatloads of data (needle in the haystack). What you need is intelligent software that can quickly eliminate vast areas of informational “no-man’s land” based on incoming queries. In such a problem space, throwing additional monies at ever more powerful metal is a self-defeating approach. It’s the software stupid! J

As it turns out, XSPRADA’s ALGEBRAIX technology is very good at eliminating "useless" (read: at a given time) data spaces. Not only that, but it also shines at inferring subtle relationships between different entities. The kind of relationships a human wouldn’t even think of asking on her own. It’s also very good at recognizing patterns (both in queries and targeted result sets).

In a way, you would expect that a system built on pure mathematical foundation would be particularly well suited to data mining workloads. And it sure is. This is the beauty of having a “wide” and rich enough technology that is as easily and readily applicable to a multitude of different BI problems. It means you don’t need to re-invent the wheel or re-architect your system every time a new problem space opens up. And that, in the business intelligence technology world is a rare find indeed.

Monday, July 27, 2009

Who leaves a country packed with ponies to come to a non-pony country?

Usually I think of myself as a “hot-shot” when it comes to XSPRADA technology and its applications. This is because I’ve been involved with it for ten years, and that kind of history builds bonds. In a word, having lived and breathed it for so long, I’m severely biased, but at least, I’m aware of it. It’s a completely different story when you talk to a user who also happens to be biased from experience running the stuff to solve real problems. That’s when you hear praise that makes you step back and go “wow, we really do shine here above and beyond”. There is nothing sweeter than an adamant customer evangelist.

One such person we shall call Tim (why not, since it's his real name). He works for a major DOD contractor out here in California. Tim asked me to withhold his company’s name for obvious reasons. He’s been running POC projects using XSPRADA technology for years. As a matter of fact, Tim once ran a real-time CEP version of our engine (which can handle both real time and historical input) for a demo bid project he needed to put together where no other vendor could come close.

Tim is the ultimate engineer’s engineer and one of the smartest folks in the “information” field I’ve ever met. He’s got experience galore and has been around the block a few times. Currently there are other groups in Tim’s shop running the XSPRADA engine for other purposes, and he keeps abreast of those POCs as well. I could tell you what they entail but then I’d have to kill you . Suffice to say that the engineering being done there would blow most people’s minds (as in, holy cow, we're actually doing this?!?).

It turns out Tim is biased as well, but he’s biased from a user perspective. This is "been-there-done-that" advocacy. And that, in my book, is far more compelling than any argument coming from an insider like myself. Although I don’t know too many people who can explain and position the technology as well as I can (I’m so modest, no pictures please! ), Tim had the following analysis recently and I thought it was so “perfect” I had to reproduce it here (bolding my own):

“One of the distinctions I’ve been using lately to explain the difference between set based data processing and most everything else (row based, column based, partition based,…) is that most other DBs are based on defining somewhat arbitrary bins of data of fixed size or dimension (tables with fixed columns in RDBMs, column collections for things like Vertica ,et al, and chunks in Google’s BigTable designs, to name a few). Then there is significant overhead to partition the incoming/outgoing data to fit into these fixed containers. Inevitably, any operation on these artificial partitions will include wasted processing or I/O on irrelevant data that just “happens” to live in the affected partitions. This is a huge waste of time and resources. In addition, these bins are continually reused by means of destructive updates which require them to be locked during transactions to avoid data corruptions. This is the other main source of waste in that significant delays are now imposed not only on the relevant data involved in the operation, but also on collateral data that might be holding up other operations unnecessarily. These two effects are mutually opposed: larger partitions would help the I/O problem, but at the expense of exacerbating the locking problem. And vice-versa. By contrast set based data systems, like XSP, use completely variable sized containers (the sets) dynamically partitioned based on operational relevance and not on any predetermined partition sizes. In this way, the amount of irrelevant data moved across the I/O boundary for any data operation is significantly reduced. And because these sets are immutable, there is no locking interference with other concurrent data operations.”

To put it in Southern California speak, I was like, wow, this dude really gets it! And there isn’t much I can add to Tim’s conclusions. I bow to the completeness of his understanding; his analysis stands on its own.

The context of our exchange was about pre-structured or “canned” mechanisms used by other database engines versus the flexibility of our approach. It’s what I naively call “bucketizing”. And this topic is of course related to the ADR functionality I was discussing in my last post. But it also pertains to the “schema agnosticism”, parallelism, and ACID aspects of the database I’ve mentioned in the past.

The XSPRADA engine dynamically adapts to queries and data. It avoids the inherent rigidity prevalent in all other technologies. Yes it’s compelling to handle analytics via columns, but that’s only one of many ways you can address the problem. And if your technology is “columnar” in nature, it’s the ONLY way you can address the problem. You’ve put all your eggs in one basket. At the end of the day, you’re still looking at the world in a tabular format (which happens to be vertical). So it’s what I call the one-trick pony approach.

There’s nothing wrong with one-trick ponies if you need to solve a very specific business problem quickly and efficiently. But from a holistic business perspective, it’s a scary proposition. If you’re running an enterprise, you want flexibility. You want the ability to address problems as they come up using all available means at your disposal. You’re looking for a wide array of tools and methods to win battles, not a single weapon system. And this is what XSPRADA technology offers: “the ability to apply the right technique for any question for any data at any time”.

If Tim is concerned with waste and inefficiency, it’s because his shop deals with tera and petabyte volumes of data with limited shelf-life. In other words, waste is not an option. This isn’t about airline transactions messing up your trip or bank accounts being debited incorrectly by the way. This is about national security and people living or dying in real-life tactical situations.

In applications like this, one-trick ponies don’t cut the mustard. And this is why Tim and several other groups in his company have been looking at XSPRADA technology for years. There simply isn’t anything out there that can meet their requirements, and believe me they’ve tried all the usual suspects. To Tim and his colleagues, the unfair competitive advantage XSPRADA can deliver to their company (and clients) is well worth the risk of evaluating technology that’s a little out of the ordinary.

Friday, July 17, 2009

ADR and how I got kicked out by Kickfire

I had the opportunity to brief Merv Adrian (major BI industry analyst) recently about XSPRADA and in discussing competitive technological differentiation, I highlighted the three major points that make XSPRADA uniquely stand out in a sea of “new-breed” ADBMS vendors, namely our algebraic engine (aka ALGEBRAIX), adaptive data restructuring (ADR), and temporal invariance. Those of you who have honored me with their readership in the past will no doubt be familiar with those terms and concepts. (side note: Merv is the one who coined this ADBMS term that I shamelessly borrow constantly now -- thank you Merv!)

I want to focus a little bit on ADR in this post because in the midst of our discussion, Merv asked me a really good question about it. He said “if you’re busy doing ADR, and more and more queries come in and more and more users get on board, what will the impact be on performance?”. Excellent point! Quite honestly, no one had ever asked me this before so after our call, I did a little more research and came up with a few more questions on my own. All of which I’d like to discuss here.

First of all, to recap, ADR is the process of adaptively restructuring data both logically (say the structures in RAM for example) and on storage (disk) based on the nature of queries coming in. What does this mean? Well for example if a query is clearly pulling only certain columns from a table (as is typically the case in OLAP), ADR will pull these columns out and optimally lay them out on disk for more efficient access. ADR could also include indexing (as in bitmap, in low-cardinality cases) and any other means at its disposal to optimize questions pertaining to these columns. Sound familiar? It should as this is basically the principle behind columnar systems. However ADR doesn’t stop there.

It may, for example, decide that sharding row blocks is a more efficient strategy given a particular query pattern and set out to do just that as well. It may decide that duplicating certain pieces of information on given disks is more I/O efficient. In memory, it may decide to implement different indexing schemes depending on the nature of the queries. In short, ADR has absolute “carte blanche” to take every means at its disposal to optimize the system in real time.

Unlike most other ADBMS out there, the XSPRADA engine is a living breathing entity constantly striving for optimal performance. And it has more than a few tools at its disposal (in other words, not a one-trick pony, compliments of the underlying mathematics) -- But so the question is indeed legitimate: how does this impact performance if at all?

The answer, unsurprisingly, is it depends. First, it’s important to realize the design principle behind ADR. It is called “crowdsourcing”. The philosophy is that the more people hit the database from all angles , the better chance there is of being able to optimize the database. From a technical perspective, overload is always a possibility, as with any other system. If too much is submitted at a given time, the system could theoretically run out of resources and impact performance. But when properly configured (balanced) with proper amounts of storage, the system should not degrade with additional users and queries coming in.

Note that concurrent queries are also beneficial to the system. This is because data streams can be shared between query processes in parallel, thereby reducing disk I/O. Additionally, ADR is not user or query-specific. So there is no risk of one ADR action resulting from Query1 from “clobbering” another ADR action from Query2. When sharing is possible, it is implemented and maintained in the mathematical model (algebraic integrity). But all queries from all users enter the “algebraic space” together in one big pool. This means everyone’s actions benefit everyone else. The XSPRADA engine is a very populist one J

On a completely different topic, I wanted to relate a funny incident that happened to me on the way to the Forum recently. Well okay, it wasn’t exactly a forum per say but rather the newly minted Kickfire on-demand trial process. They call it Cloud-based Trial.

I enthusiastically signed up for the trial and got an assigned time-slice for today. Within minutes, I get the following email from Kickfire’s Karl Van den Bergh, who is Kickfire’s Vice President of Marketing and Business Development (phew! Talk about long names to match long titles).

“Thanks for your interest in Kickfire. Our trialing system is reserved for prospects and partners. As our companies are somewhat competitive we are not able to give access to XSPRADA at this time.”

Wow. Cold man.

But next morning, I get this email from the company’s trial support team:

“Congratulations! This email confirms that your Kickfire On-Demand Trial will begin at 10:00am (PST) on 07/17/2009.”

Then thirty minutes later I get this:

“jerome, Your reservation has been deleted. Reservation #sc14a5eb4e82d0dc.”

By that time I’m thinking okay, somebody at Kickfire Trial Support finally got their behinds kicked (pun intended) for daring to confirm my trial session. Apologizing for this confusion is Karl again:

"Apologies for the registration mix up this morning - we have a number of system administrators who weren't in synch."

Yeah, I’d say. But wait, there’s more if you order now! Early this morning, after my session was supposed to end, I get a phone call from Kickfire asking me how my trial went! I couldn’t help but blurt out “you guys are really confused”. Then I explained that Karl had branded me persona-non-grata at which point I was promptly dropped like a bad case of H1N1.

Now, don’t get me wrong. I love this Kickfire appliance concept. And I have a positive informative relationship with technical folks there who are nothing if not extremely competent (and nice people to boot). And although I cannot play with the Kickfire appliance either locally or remotely, from what I’ve heard and read, these guys have a top notch team, great technology, a seasoned CEO and attractive market positioning.

In thinking about this, my initial reaction was that I was stupidly naïve. Honestly, it never occurred to me that Kickfire would ban any “competitor” (or anyone else for that matter) from remotely playing with their software. The reason is because we don’t do this at XSPRADA. As a matter of fact, anyone can pull our bits for free from our website, including Kickfire (which already has). So to me this has a strange, secretive, “we have something to hide” feel which, given what I know about Kickfire, really took me by surprise.

But it's true I have this weird concept about openness that many other vendors share, and I’m happy to put our stuff in competitive hands (after all, the customers certainly will!) anytime and get any feedback from the experience, negative or not. In my experience, few competitors will go out and trash another vendor’s offering, much less try to reverse-engineer it (at least not in this industry) -- Maybe I’m foolish.

However, it’s true I don’t have a “marketing” bone in my body. As an engineer and evangelist, I’ve always been adamant about transparency, peer review and feedback. So maybe it’s a good think I don’t handle XSPRADA Marketing, or Kickfire’s for that matter! J

Thursday, July 2, 2009

Of Views, Cubes, Patents and Haystacks

In keeping with a past commitments to do so, I plan on staying purely technical in this posting. The first topic I want to bring up relates to database views. The second one deals with OLAP “cubing”. I want to discuss those in the context of the XSPRADA analytical engine RDM/x.

Many conventional database engines support view and materialized views and I believe Oracle actually came up with the concept (I am not 100% sure of that) but numerous other vendors support them including Postgres, MySQL, DB2, SQL Server, etc. In either case, the idea is simple. A view is basically a logical rendition of a SQL query. It looks and behaves like a table. Views can be logical or materialized. A logical view is really just a “pointer” to the actual data. A materialized view actually contains data and is implemented as a full-fledged database object. There are several reasons to use views.

One is performance enhancement and convenience (caching a nasty JOIN or OLAP summaries dynamically, meaning original data changes trigger recalculations). One is abstraction. Another is security and access control (you can grant specific privileges to views while keeping underlying data protected from users). The important thing to remember is that views were conceived in response to a problem, namely that database performance can really suck without them (especially in warehousing applications).

If you didn’t have inherent performance issues in conventional database engines, you wouldn’t need views to begin with. And herein lies the reason why the XSPRADA engine doesn’t support classical views: it simply doesn’t need them. The engine caches and materializes query results dynamically on the fly. As such, “materialized views” are generated on an as-needed basis. But this is true for everything else going on inside the XSPRADA engine because the adaptive data restructuring (ADR) feature dictates that all forms of optimizations be materialized for optimal logical and I/O performance on the fly as queries come in.

Who determines when this happens? The optimizer and the algebraic engine (now officially known as XSPRADA ALGEBRAIX under US Patent 11/383,477). If and when the system detects more than one query involving a particular set of joins, it materializes (and likely caches) the results for subsequent use, assuming that a request pattern has emerged. Similarly, it is always possible to “store” a complex query into a named result set using the INTO extension. For example, you can write something like:

Select p.product_name, s.store_city, sum(f.store_sales) tots from sales_facts f

Join product p on p.product_id = f.product_id

Join store s on s.store_id = f.store_id

Group by s.store_city, p.product_name

Order by tots desc INTO ds;

At that point, the ‘ds’ table will contain the results and can be used independently and updated at will. Currently, the XSPRADA engine will not automatically update ‘ds’ should any change occur to any of the underlying tables. Regardless, ‘ds’ now becomes a regular RDM/x table like any other. There is no distinction between views and tables in RDM/x. And even though views are defined in RDM/x, this is done for sheer syntactic convenience, and using them yields no technical advantage.

On the OLAP front, the system behaves in a similar way. This is why I always say, with RDM/x you don’t need to pre-structure or “cube” your data. The engine does that automatically internally for the user based on his/her query patterns!

When you feed RDM/x a query like this:

SELECT s.store_city, sum(f.total_sales) from store s, sales_fact f

Where s.store_id = f.store_id

GROUP BY s.store_city

RDM/x automatically starts building a “cube” for it. What does that mean? The system creates a set of mathematical expressions relating the detail data to the aggregates you’ve just requested. In the process, instances of the data in intermediate and final form are realized. This city-based slice is “remembered” by the system and any subsequent query against that slice or variation thereof will benefit from immense optimization (and performance increase). This is how subsequent queries just get faster and faster in the XSPRADA system. And if you slice along another dimension next, say products, for example, the same optimization and ADR is performed for that slice. If you ask enough questions involving enough dimensions, you get closer and closer to instantaneous response time. Adding a dimension may cause a slight performance dip on the first shot, but the penalty is nowhere near as painful as having to add another slice or recompute aggregates (cells) in a conventional “manual cubing” system (say like Mondrian or SSAS to name a few).

People have asked (namely Chris Webb, an MDX guru who runs an excellent blog) if there is some sort of internal structure that makes the query run faster over time. The answer is that the re-structuring of the data, combined with the algebraic system and the methods for accessing the data internally (in memory and storage) are what makes these queries get faster with time. Another excellent question was “can you edit these structures?” And the answer is no, as the system is entirely automated and requires no user intervention. Importantly, DML operations (INSERT, UPDATE, DELETE) on the original detail data cause automatic updates of the aggregated data, as long as the aggregates continue to be expressed in terms of the original data. Simply put, the system maintains internal integrity automatically unless, of course, you change the original query.

MDX and dynamic “view” updates are features likely to be implemented by RDM/x in the near future. Without MDX, you lack the convenience of being able to talk to RDM/x directly from clients such as Excel (or SAP for that matter, who has apparently adopted MDX now), although you can clearly question RDM/x via Excel using OLAP queries. MDX is obviously more powerful and convenient. This doesn’t take away from the internal “magic” occurring inside RDM/x when handling OLAP workloads.

And I have just a few more lines to make one last point. Technically speaking, RDM/x is currently heavily biased in favor of OLAP workloads. This is not a consequence of the technology being used per say but rather reflects marketing priorities. I sometimes meet people who try out our product to do search. For example, they’ll try to grab a few select rows from 100,000,000 rows of data. This is not where RDM/x currently shines! I always tell them: if you want to do this, you’re better off using a conventional OLTP system like SQL Server or Oracle. RDM/x is not about finding the proverbial needle in the haystack. It’s about telling you what the haystack looks like.