Monday, August 17, 2009

Oh yeah? Well my database is SMALLER than your database!

Contrary to popular edict, smaller is not always better, unless of course you’re talking about analytical database engines. In that respect, it’s hard to find an ADBMS that can fit on hard media like a CD or a USB stick. For example, I don’t think SQL Server, Oracle, DB2, Greenplum, Aster, ParAccel, or the myriad of other ADBMS vendors can fit all their bits in a tight spot. Even in the open source realm, I doubt you can wedge InfoBright (MySQL) or IceBreaker (Ingres) onto a stick, much less shlep their bits around as an email attachment.

One exception to this is the V-stick from Vertica. When I first read about this, I initially thought it was a hoax but apparently not. It’s pretty cool too because it includes the O/S, web server, GUI and the engine all together on a 16GB thumb drive. How an engine like Vertica, designed around distributed MPP, can possibly operate representatively (using terabyte-size data) on a thumb drive is beyond me, and I’ve never heard of anyone actually using this gizmo but I’d sure love to get my hands on one and review it if it’s still available.

The other exception of course is RDM/x, the XSPRADA database engine. The reason is simple: its total deployment footprint is around 10MB. That includes the 32/64 ODBC drivers and a couple DLLs. The engine itself is currently around 6MB. Last I looked the installer clocked in at 16,760KB. This means you can actually deploy RDM/x onto a memory stick if you want to. I tried it, it works. It’s pretty cool. But after a while I wondered, why would anyone care about this?

The reason is two-fold. First, it’s really easy to try out software that is small and self-contained without expanding large amounts of time and resources. Yes, you can download RDM/x from our website but in many cases (like secured firewalled enterprises), that’s not an option.

Second, it means we’re a good candidate for embedded applications. Because if I can fit my database engine on a stick (or in an email), I can probably embed it in instruments and devices as well either as raw C++ code or libraries.

But for quick POCs, size and simplicity really does matter. Say you’re suddenly tasked with evaluating solutions to deploy a BI solution inside your company. Suppose you’re a Microsoft shop. Suppose additional capex is not an option, and suppose further you have a week to show results (namely a set of nicely formatted reports, pivot tables or dashboards). Now what? If you have significant in-house experience with SQL Server and associated SSAS, SSIS, SSRS, and Excel (and assuming you have a clear and deep understanding of the business scope and goals to begin with) you’re probably going to:

(1) Figure out where your source data is coming from (connection strategies)

(2) Model your DW (figure out grain on facts, dimensions etc, need to figure out BIDS and SSAS)

(3) Establish some preliminary ETL process (including incremental loads, need to figure out SSIS)

(4) Load your warehouse (if you screw it up, then need to drop and do it over)

(5) Setup an SSAS cube structure (figure out SSAS via SSMS or BIDS then publish the thing)

(6) Figure out what queries to generate (talk to DW DBA or learn MDX)

(7) Figure out what BI tool to use (Excel or browser, depends on policies and audience)

(8) Generate the reports (canned or ad-hoc)/dashboards/pivot tables for the POC

Now, if you have no prior experience with the Microsoft BI toolset, and you can whip this little project up in a week, guess what, you need to quit your job and start a consulting company because clearly, as a NYC recruiter once told me “you’re so money”. But if you’re a normal person with little prior BI experience (and the terms ROLAP, MOLAP, SCD and MDX don’t ring a bell), you’re in a bind.

So another thing you can do is download a tiny analytical database (say, the XSPRADA RDM/x engine, for example) and throw, say, 100GB of data at it (this is just a small POC remember?), then plop Excel on top of it and generate some really cool reports or pivot tables to show the boss (in under a week) it can be done. How hard is that to do? This hard:

Figure out where your source data is coming from.

Yup, that one is pretty universal in the BI world. Difference here is all your data sources will export as CSV to feed the XSPRADA engine. So at least that’s consistent across all sources (be they structured, semi-structured or not). CSV is data format lingua-franca so your connection "strategy" is this: get everything out as CSV. Plain and simple.

Model your data warehouse.

That’s always a smart thing to do for obvious reasons although the XSPRADA engine is schema-agnostic and you can feed it normalized or star/snowflake models at will. The secret phrase is: “we don’t care”! So for a quick POC, if you find yourself "forced" to feed RDM/x a 3NF model, no worries.

Establish some preliminary ETL process.

RDM/x runs against initial CSV data islands directly off disk. Point to the CSV files using the XSPRADA SQL extensions for DDL and you’re done. You’ll likely be doing this via script or code (C++, Java or .NET to the ODBC driver directly or via a JDBC-ODBC bridge). For incremental loads, just plop the new CSV files on disk and point RDM/x to them using the INSERT INTO…FROM extension. This process can be done in real time without disruption while other queries are running. No hassle there.

Load your warehouse.

That’s executing a single line of SQL DDL code such as

CREATE TABLE ….FROM “c:\file1.csv;c:\file2.csv…c:\file32.csv”; or INSERT INTO…FROM “c:\file1.csv;c:\file2.csv…c:\file32.csv”;

Made a mistake of want to modify the schema and “reload” real quick? Not a problem. Simply re-issue the same DDL command and the table/schema is instantly updated. From a trial and error perspective (which, in a POC situation, is fairly typical), that’s a high-five.

Setup an SSAS cube structure (figure out SSAS via SSMS or BIDS)

There is no concept of cubes inside the XSPRADA engine. RDM/x automatically slices and dices based on incoming queries in real time. So if you want to “cube” just feed the engine slicing OLAP queries. RDM/x automatically restructures and aggregates in real time. No need to pre-define or pre-load cubes, deal with hierarchies or materialized views. I blogged about this earlier. RDM/x is a lot like Luke 11:9 – Ask and you shall receive.

Figure out what queries to generate (talk to DW DBA)

That’s where an external tool using MDX (along with an MDX expert!) can come in handy (most people don’t roll their own SQL for OLAP, although it can certainly be done in POC mode). One cool thing about RDM/x is its ability to “withstand” poorly-formulated SQL because the queries are optimized against the internal mathematical model. RDM/x is typically more “SQL-forgiving” than most other engines. And a poorly formulated query is likely transformed internally to still yield optimal performance. So even if you’re no SQL guru, the RDM/x engine is still on your side.

Figure out what BI tool to use (Excel, no brainer)

Connect Excel to the XSPRADA engine directly via ODBC or connect Mondrian to RDM/x (via bridge) then connect Excel to Mondrian via the SimbaO2X ODBO/XMLA connector. Alternatively, make the argument that using OSS like Pentaho or Jaspersoft against RDM/x directly is more flexible and accessible (not to mention cheaper!) than messing with Excel. Depending on your user base and corporate standards, that argument may or may not hold water.

Generate the reports/dashboards/KPI/Pivot Table/ad-hoc queries required by management.

Exactly the same way you would using any other tool and/or SQL.

At the end of the day (or in our case, the week), it’s all about “time to results” and “pain to results”. In those types of situations, smaller and simpler clearly has a significant advantage over the rest. And speaking of smaller, I have run over my allocated space for this posting :)

6 comments:

  1. Jerome,

    Great Article. I would like to link to your post from my site if that's ok with you. I also wanted to let you know that the final image for the V-Stick weighed in at 8GB. I may be putting together 4GB version in order to keep the manufacturing costs down. Let me know if your interested and I will put one aside for you.

    ReplyDelete
  2. Charles, my pleasure. Would love to get linked from your site and thanks for reading my blog. I would definitely be interested in evaluating the final product when ready so happy to take you up on that as well!
    Best,
    J.

    ReplyDelete
  3. Interesting post, would also love to try out a V-Stick. As an Infobright user, to comment on that, I think I could fit a usable installation on a stick. Bare-bones Linux a few hundred megs (lets say 2 gigs to be on the safe side), Infobright 3.1 enterprise edition full installation (minus data, of course) (64 bit) 43 megs on my server today. So given a 32 gig USB stick, some empty space and a conservative compression estimation of 1:5 (way less than I typically see in production), you might even fit your 100GB POC data set preloaded on the stick along with the environment!

    Not too shabby. And of course, the company also ships a preinstalled dbms/BI suite virtual machine download (Jasper or Pentaho, take your pick), so feel free to verify my claim.

    Not affiliated, other than a pretty satisfied customer.

    ReplyDelete
  4. @oa - I assume you are a MySQL shop if you're using InfoBright? I'd like to know more about your application - can you get me your contact info?

    ReplyDelete
  5. Jerome,

    Great article.. May I re-post this article on www.bcsolution.com ? Naturally, I will give you the props and links back to your site.

    ReplyDelete
  6. Hi Charles, you're more than welcome but FYI I am no longer either with XSPRADA or in the BI/database industry :)
    Best,
    J.

    ReplyDelete