Tuesday, May 5, 2009

MAD About You.

I haven’t had a minute to sit down and blog lately. Our upcoming “pre-release” software slated for Cinco de Mayo has absorbed all my efforts. I just flew back from Austin, TX last week to participate in final engineering and release touches. At the same time, one of our major Defense prospects just re-activated a huge project so it’s all hands on deck at XSPRADA these days. I love it when a plan comes together.

Nevertheless, I recently picked up on a paper called “MAD Skills: New Analysis Practices for Big Data” referenced on Curt Monash’s blog by one of its authors Joe Hellerstein. Joe is a CS professor at UC Berkeley, if I understand correctly, and a consultant for Greenplum. I expected to read a lot of pro-Greenplum content in there but I don’t feel the major arguments presented are specifically tied to this vendor’s architecture or features per say. What makes this paper really valuable, in my opinion, is that it “resulted from a fairly quick, iterative discussion among data-centric people with varying job descriptions and training.” – In other words, it is user-driven (and not just a bunch of theoretical PhD musings) and as such, probably an accurate reflection of what I consider to be a significant shift in the world of BI lately.

Namely, that the “user” base is shifting from the IT type to an analyst and business person type. Note I didn’t say “end-user”, because the end user is still the business consumer. But what’s really changing is the desire (and ability) to cut out the middle man (read: the IT “gurus”), in essence, at every step of the way from data ingestion to results production. To put it in terms Marx would have liked, the means of production are shifting from dedicated technical resources to actual consumers J -- This is particularly true in the on-demand world, but probably pervasive throughout as well. I think the paper outlines and defines this change.

“The conceptual and computational centrality of the EDW makes it a mission-critical, expensive resource, used for serving data-intensive reports targeted at executive decision-makers. It is traditionally controlled by a dedicated IT staff that not only maintains the system, but jealously controls access to ensure that executives can rely on a high quality of service.”

The starting premise of the paper is that the industry is changing in its concept and approach to EDW. Whereas in the past an “Inmonish” view of the EDW was predicated on one central repository containing a single comprehensive “true” view of all enterprise data, the new model (and practical reality) is really more “Kimbalish” in the sense that the whole of EDW comprises the sum of its parts. And its parts are disparate, needing to be integrated in real time, without expectations of “perfect data” (DQ) in an instant-gratification world. This new model is premised on a MAD model: Magnetic, Agile and Deep.

Magnetic because the new architecture needs to “attract” a multitude of data sources naturally and dynamically. Agile because it needs to adapt flexibly to fast-shifting business requirements and technical challenges without losing a beat. And deep because it needs to support exploratory and analytical endeavors at all altitudes (detail/big picture) and for numerous user types simultaneously.

“Traditional Data Warehouse philosophy revolves around a disciplined approach to modeling information and rocesses in an enterprise. In the words of warehousing advocate Bill Inmon, it is an “architected environment" [12]. This view of warehousing is at odds with the magnetism and agility desired in many new analysis settings.”

“The EDW is expected to support very disparate users, from sales account managers to research scientists. These users' needs are very different, and a variety of reporting and statistical software tools are leveraged against the warehouse every day.”

Fulfilling this new reality requires a new type of analytical engine, in my opinion, because the old ones are premised on a model which, apparently, has not delivered successfully or sufficiently over time for BI. So the question beckons, what set of architectural features does a new analytical engine need to have in order to support the MAD model as described in this paper? And more importantly (from where I stand), is the XSPRADA analytical engine MAD enough?

To support magnetism, an engine should make it easy to ingest any type of data.

“Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic": attracting all the data sources that crop up within an organization regardless of data quality niceties.”

The “physical” data presented to the XSPRADA engine must consist of CSV text files. Currently CSV is the only possible way of presenting source data to the database. CSV is the least common data format denominator so this means the engine can handle pretty much any data source provided it can be morphed to CSV. Since the vast majority of enterprise data lends itself to CSV export, that covers a fairly wide array of data sources. How easy is it to “load” data into the engine? Very easy, especially since the engine doesn’t “load” data per say but works off the bits on disk directly. All it takes is a CSV file (one per table) and a SQL DDL statement such as “CREATE TABLE…FROM ” to “present” data to the engine. “The central philosophy in MAD data modeling is to get the organization's data into the warehouse as soon as possible.” – I’d say this fulfills that philosophy.

To support agility, an engine shouldn’t dictate means and methods of usage and must be flexible:

“Given growing numbers of data sources and increasingly sophisticated and mission-critical data analyses, a modern warehouse must instead allow analysts to easily ingest, digest, produce and adapt data at a rapid pace. This requires a database whose physical and logical contents can be in continuous rapid evolution… we take the view that it is much more important to provide agility to analysts than to aspire to an elusive ideal of full integration”

The external representation of data on disk and the internal “logical” modeling of the data changes dynamically based on incoming queries. The XSPRADA engine is “adaptive” in that sense and constantly looks at incoming queries and data on disk to determine the most optimal way of storing and rendering it internally. This feature is called Adaptive Data Restructuring (ADR). On the flexibility side, the XSPRADA engine is schema-agnostic. This is a fairly unique feature that allows users to “flip” schemas on the fly.

For example, it’s possible to present entire data sets to the engine based on an all VARCHAR schema (ie: make every column VARCHAR). Maybe you don’t know the real schema at the time, or perhaps you don’t care about it or perhaps the optimal schema can only be determined after some analysis is performed. Or perhaps there are inconsistencies in the data or DQ issues preventing a valid “load” based on a rigid schema. Or maybe it was just easier and quicker to export all the data as string types in the CSV. In either case, the XSPRADA engine will happily ingest that data. Later on, you can CAST each field as needed into a new table on the fly and run queries against the new model, or try others as needed. Similarly, ingestion validation is kept to a minimum by design. For example, it’s quite possible to load an 80-char string into a CHAR(3) field. This is not possible with conventional databases. The implications of this from a performance and flexibility angle are impressive. The XSPRADA database lends itself to internal transformation; hence it favors an ELT model, minus the “L”.

In recent years, there is increasing pressure to push the work of transformation into the DBMS, to enable parallel execution via SQL transformation scripts. This approach has been dubbed ELT since transformation is done after loading.

And finally, to support depth, an engine should allow rich analytics, provide an ability to “focus” in and out on the data, and provide a holistic un-segmented view of the entire data set:

“Modern data analyses involve increasingly sophisticated statistical methods… analysts often need to see both the forest and the trees in running these algorithms… The modern data warehouse should serve both as a deep data repository and as a sophisticated algorithmic runtime engine.”

A salient feature of the XSPRADA engine is its ability to handle multiple “types” of BI work at the same time. For example, it’s possible to mix OLAP, data mining, reporting and ad-hoc workloads simultaneously on the same data (and all of it) without resorting to “optimization” tricks for each mode. Similarly, the need for logical database partitioning doesn’t exist in the XSPRADA engine. Duplicating and re-modeling data islands on separate databases (physical or logical) for use by different departments is neither necessary nor recommended.

In an OLAP use case, there is no need to pre-define or load multidimensional cubes. The very act of querying consistently (meaning more than once) based on fact and dimension axes causes the engine to realize that this particular section of data is being accessed “multi-dimensionally”. It then starts cubing information internally, aggregating as indicated (if needed) by incoming queries. In this mode, perhaps the engine will decide a columnar storage approach is optimal and will re-structure the data accordingly. In a data mining use case, the approach is likely different because incoming queries are “incremental” (often pinpointed) and results are used to generate new queries without pre-determined patterns. The engine will likely start by eliminating vast “wasteland” areas of the data (the forest) from consideration as needed, then proceed to optimize specific islands of interest (the trees) as they become more relevant in the queries.

So overall, I think the XSPRADA analytical engine was indeed designed with “MAD-ness” from the get-go, even if the term didn’t exist years ago. It’s the approach and the philosophy that really matters. In that respect, we’re definitely headed for the MAD-house :)

No comments:

Post a Comment