Thursday, July 2, 2009

Of Views, Cubes, Patents and Haystacks

In keeping with a past commitments to do so, I plan on staying purely technical in this posting. The first topic I want to bring up relates to database views. The second one deals with OLAP “cubing”. I want to discuss those in the context of the XSPRADA analytical engine RDM/x.

Many conventional database engines support view and materialized views and I believe Oracle actually came up with the concept (I am not 100% sure of that) but numerous other vendors support them including Postgres, MySQL, DB2, SQL Server, etc. In either case, the idea is simple. A view is basically a logical rendition of a SQL query. It looks and behaves like a table. Views can be logical or materialized. A logical view is really just a “pointer” to the actual data. A materialized view actually contains data and is implemented as a full-fledged database object. There are several reasons to use views.

One is performance enhancement and convenience (caching a nasty JOIN or OLAP summaries dynamically, meaning original data changes trigger recalculations). One is abstraction. Another is security and access control (you can grant specific privileges to views while keeping underlying data protected from users). The important thing to remember is that views were conceived in response to a problem, namely that database performance can really suck without them (especially in warehousing applications).

If you didn’t have inherent performance issues in conventional database engines, you wouldn’t need views to begin with. And herein lies the reason why the XSPRADA engine doesn’t support classical views: it simply doesn’t need them. The engine caches and materializes query results dynamically on the fly. As such, “materialized views” are generated on an as-needed basis. But this is true for everything else going on inside the XSPRADA engine because the adaptive data restructuring (ADR) feature dictates that all forms of optimizations be materialized for optimal logical and I/O performance on the fly as queries come in.

Who determines when this happens? The optimizer and the algebraic engine (now officially known as XSPRADA ALGEBRAIX under US Patent 11/383,477). If and when the system detects more than one query involving a particular set of joins, it materializes (and likely caches) the results for subsequent use, assuming that a request pattern has emerged. Similarly, it is always possible to “store” a complex query into a named result set using the INTO extension. For example, you can write something like:

Select p.product_name, s.store_city, sum(f.store_sales) tots from sales_facts f

Join product p on p.product_id = f.product_id

Join store s on s.store_id = f.store_id

Group by s.store_city, p.product_name

Order by tots desc INTO ds;

At that point, the ‘ds’ table will contain the results and can be used independently and updated at will. Currently, the XSPRADA engine will not automatically update ‘ds’ should any change occur to any of the underlying tables. Regardless, ‘ds’ now becomes a regular RDM/x table like any other. There is no distinction between views and tables in RDM/x. And even though views are defined in RDM/x, this is done for sheer syntactic convenience, and using them yields no technical advantage.

On the OLAP front, the system behaves in a similar way. This is why I always say, with RDM/x you don’t need to pre-structure or “cube” your data. The engine does that automatically internally for the user based on his/her query patterns!

When you feed RDM/x a query like this:

SELECT s.store_city, sum(f.total_sales) from store s, sales_fact f

Where s.store_id = f.store_id

GROUP BY s.store_city

RDM/x automatically starts building a “cube” for it. What does that mean? The system creates a set of mathematical expressions relating the detail data to the aggregates you’ve just requested. In the process, instances of the data in intermediate and final form are realized. This city-based slice is “remembered” by the system and any subsequent query against that slice or variation thereof will benefit from immense optimization (and performance increase). This is how subsequent queries just get faster and faster in the XSPRADA system. And if you slice along another dimension next, say products, for example, the same optimization and ADR is performed for that slice. If you ask enough questions involving enough dimensions, you get closer and closer to instantaneous response time. Adding a dimension may cause a slight performance dip on the first shot, but the penalty is nowhere near as painful as having to add another slice or recompute aggregates (cells) in a conventional “manual cubing” system (say like Mondrian or SSAS to name a few).

People have asked (namely Chris Webb, an MDX guru who runs an excellent blog) if there is some sort of internal structure that makes the query run faster over time. The answer is that the re-structuring of the data, combined with the algebraic system and the methods for accessing the data internally (in memory and storage) are what makes these queries get faster with time. Another excellent question was “can you edit these structures?” And the answer is no, as the system is entirely automated and requires no user intervention. Importantly, DML operations (INSERT, UPDATE, DELETE) on the original detail data cause automatic updates of the aggregated data, as long as the aggregates continue to be expressed in terms of the original data. Simply put, the system maintains internal integrity automatically unless, of course, you change the original query.

MDX and dynamic “view” updates are features likely to be implemented by RDM/x in the near future. Without MDX, you lack the convenience of being able to talk to RDM/x directly from clients such as Excel (or SAP for that matter, who has apparently adopted MDX now), although you can clearly question RDM/x via Excel using OLAP queries. MDX is obviously more powerful and convenient. This doesn’t take away from the internal “magic” occurring inside RDM/x when handling OLAP workloads.

And I have just a few more lines to make one last point. Technically speaking, RDM/x is currently heavily biased in favor of OLAP workloads. This is not a consequence of the technology being used per say but rather reflects marketing priorities. I sometimes meet people who try out our product to do search. For example, they’ll try to grab a few select rows from 100,000,000 rows of data. This is not where RDM/x currently shines! I always tell them: if you want to do this, you’re better off using a conventional OLTP system like SQL Server or Oracle. RDM/x is not about finding the proverbial needle in the haystack. It’s about telling you what the haystack looks like.

4 comments:

  1. Views are implemented for more than just performance reasons, and they probably deserve to be mentioned, as I think they are still important.

    1) Views provided query abstraction. Commonly views are defined for easy reporting because complex business cases can be represented as materialized or logical views and these views can be joined together, without business users needing to understand the complex data relationships backing the business relationships.

    2) Views can be used for security. A view could be defined like:
    select * from private_table where username_field = USERNAME().

    3) Views can be updatable, which is a very useful mechanism if an often poorly used one.

    ReplyDelete
  2. Justin, yes I believe I mentioned (1) and (2) -- As I point out we don't directly update them (yet) - I'd like to see an approach much like your FlexViews actually (if I understood them properly, which is a big "if" but seems kinda cool to me!)

    http://flexviews.sourceforge.net/about.html

    ReplyDelete
  3. The goal of Flexviews is to provide a "fast refreshable" materialized view for MySQL as long as it follows the familiar SELECT, PROJECT, AGGREGATE and JOIN pattern. This is the pattern used in your example queries, and almost any basic join between dimensions and facts can be materialized and maintained efficiently, as long as their are indexes in place to compute the deltas efficiently. In MySQL you already need those indexes to perform joins anyway, so this conditions is always met.

    More complex queries involving unions, outer joins, inline views or other more complex queries can be materialized, but they must be fully computed on refresh. They aren't locked during refresh though, so you can compute them while active queries hit the database.

    It supports point in time refresh and a continuous propagation delta method which results in low impact delta calculation and application.

    It has a programmatic interface which generates the SQL for the view. Non programmers would have a hard time working with it, which is why I want to tack a MySQL proxy interface onto the front of it so that I can accomplish things like 'CREATE MATERIALIZED VIEW' and have the stored procedures called automatically in the background.

    ReplyDelete
  4. I think this is the critical joint too:

    "They aren't locked during refresh though, so you can compute them while active queries hit the database."

    Thanks for the insight on that.

    ReplyDelete