I haven’t had a minute to sit down and blog lately. Our upcoming “pre-release” software slated for Cinco de Mayo has absorbed all my efforts. I just flew back from Austin, TX last week to participate in final engineering and release touches. At the same time, one of our major Defense prospects just re-activated a huge project so it’s all hands on deck at XSPRADA these days. I love it when a plan comes together.
Nevertheless, I recently picked up on a paper called “MAD Skills: New Analysis Practices for Big Data” referenced on Curt Monash’s blog by one of its authors Joe Hellerstein. Joe is a CS professor at UC Berkeley, if I understand correctly, and a consultant for Greenplum. I expected to read a lot of pro-Greenplum content in there but I don’t feel the major arguments presented are specifically tied to this vendor’s architecture or features per say. What makes this paper really valuable, in my opinion, is that it “resulted from a fairly quick, iterative discussion among data-centric people with varying job descriptions and training.” – In other words, it is user-driven (and not just a bunch of theoretical PhD musings) and as such, probably an accurate reflection of what I consider to be a significant shift in the world of BI lately.
Namely, that the “user” base is shifting from the IT type to an analyst and business person type. Note I didn’t say “end-user”, because the end user is still the business consumer. But what’s really changing is the desire (and ability) to cut out the middle man (read: the IT “gurus”), in essence, at every step of the way from data ingestion to results production. To put it in terms Marx would have liked, the means of production are shifting from dedicated technical resources to actual consumers J -- This is particularly true in the on-demand world, but probably pervasive throughout as well. I think the paper outlines and defines this change.
“The conceptual and computational centrality of the EDW makes it a mission-critical, expensive resource, used for serving data-intensive reports targeted at executive decision-makers. It is traditionally controlled by a dedicated IT staff that not only maintains the system, but jealously controls access to ensure that executives can rely on a high quality of service.”
The starting premise of the paper is that the industry is changing in its concept and approach to EDW. Whereas in the past an “Inmonish” view of the EDW was predicated on one central repository containing a single comprehensive “true” view of all enterprise data, the new model (and practical reality) is really more “Kimbalish” in the sense that the whole of EDW comprises the sum of its parts. And its parts are disparate, needing to be integrated in real time, without expectations of “perfect data” (DQ) in an instant-gratification world. This new model is premised on a MAD model: Magnetic, Agile and Deep.
Magnetic because the new architecture needs to “attract” a multitude of data sources naturally and dynamically. Agile because it needs to adapt flexibly to fast-shifting business requirements and technical challenges without losing a beat. And deep because it needs to support exploratory and analytical endeavors at all altitudes (detail/big picture) and for numerous user types simultaneously.
“Traditional Data Warehouse philosophy revolves around a disciplined approach to modeling information and rocesses in an enterprise. In the words of warehousing advocate Bill Inmon, it is an “architected environment" [12]. This view of warehousing is at odds with the magnetism and agility desired in many new analysis settings.”
“The EDW is expected to support very disparate users, from sales account managers to research scientists. These users' needs are very different, and a variety of reporting and statistical software tools are leveraged against the warehouse every day.”
Fulfilling this new reality requires a new type of analytical engine, in my opinion, because the old ones are premised on a model which, apparently, has not delivered successfully or sufficiently over time for BI. So the question beckons, what set of architectural features does a new analytical engine need to have in order to support the MAD model as described in this paper? And more importantly (from where I stand), is the XSPRADA analytical engine MAD enough?
To support magnetism, an engine should make it easy to ingest any type of data.
“Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic": attracting all the data sources that crop up within an organization regardless of data quality niceties.”
The “physical” data presented to the XSPRADA engine must consist of CSV text files. Currently CSV is the only possible way of presenting source data to the database. CSV is the least common data format denominator so this means the engine can handle pretty much any data source provided it can be morphed to CSV. Since the vast majority of enterprise data lends itself to CSV export, that covers a fairly wide array of data sources. How easy is it to “load” data into the engine? Very easy, especially since the engine doesn’t “load” data per say but works off the bits on disk directly. All it takes is a CSV file (one per table) and a SQL DDL statement such as “CREATE TABLE…FROM
To support agility, an engine shouldn’t dictate means and methods of usage and must be flexible:
“Given growing numbers of data sources and increasingly sophisticated and mission-critical data analyses, a modern warehouse must instead allow analysts to easily ingest, digest, produce and adapt data at a rapid pace. This requires a database whose physical and logical contents can be in continuous rapid evolution… we take the view that it is much more important to provide agility to analysts than to aspire to an elusive ideal of full integration”
And finally, to support depth, an engine should allow rich analytics, provide an ability to “focus” in and out on the data, and provide a holistic un-segmented view of the entire data set:
“Modern data analyses involve increasingly sophisticated statistical methods… analysts often need to see both the forest and the trees in running these algorithms… The modern data warehouse should serve both as a deep data repository and as a sophisticated algorithmic runtime engine.”
A salient feature of the XSPRADA engine is its ability to handle multiple “types” of BI work at the same time. For example, it’s possible to mix OLAP, data mining, reporting and ad-hoc workloads simultaneously on the same data (and all of it) without resorting to “optimization” tricks for each mode. Similarly, the need for logical database partitioning doesn’t exist in the XSPRADA engine. Duplicating and re-modeling data islands on separate databases (physical or logical) for use by different departments is neither necessary nor recommended.
In an OLAP use case, there is no need to pre-define or load multidimensional cubes. The very act of querying consistently (meaning more than once) based on fact and dimension axes causes the engine to realize that this particular section of data is being accessed “multi-dimensionally”. It then starts cubing information internally, aggregating as indicated (if needed) by incoming queries. In this mode, perhaps the engine will decide a columnar storage approach is optimal and will re-structure the data accordingly. In a data mining use case, the approach is likely different because incoming queries are “incremental” (often pinpointed) and results are used to generate new queries without pre-determined patterns. The engine will likely start by eliminating vast “wasteland” areas of the data (the forest) from consideration as needed, then proceed to optimize specific islands of interest (the trees) as they become more relevant in the queries.
So overall, I think the XSPRADA analytical engine was indeed designed with “MAD-ness” from the get-go, even if the term didn’t exist years ago. It’s the approach and the philosophy that really matters. In that respect, we’re definitely headed for the MAD-house :)
No comments:
Post a Comment