Wednesday, August 26, 2009

Mole Whackers Need not Apply

Two completely different events caught my attention lately. One of them is a post by Curt Monash called Bottleneck Whack-A-Mole, and the other is the much-publicized alliance for “BI in the Cloud” comprising RightScale, Talend, Jaspersoft and Vertica.

In the post, Curt describes software development (or developing a good software product) as “a process of incremental improvement”. Fair enough. The analogy he draws is between constantly fixing and improving performance bottlenecks and the annoying (if entertaining) arcade game of Whack-A-Mole where you have to be fast enough to clobber enough of the critters as they randomly pop up from below. He then makes the point that “Improving performance in, for example, a database management system has a lot in common with Whack-A-Mole.” Having spent most of my life designing, developing, improving and testing commercial and enterprise software applications, I have to say I don’t totally agree with his analogy for several reasons.

First, call me old-fashioned, but I’m an ardent believer in the fact that software building is deterministic. Whack-A-Mole engineering is not. The age-old controversy about software being more of an art than a science may never be resolved, but at the end of the day, I feel software is (should be) a scientific, engineering-driven, deterministic endeavor like any other engineering discipline. With Whack-A-Mole engineering, buildings and airplanes fall to the ground. That’s not good. In my experience, those who seek to “romanticize” software engineering are typically adverse to proper planning, design and testing as being too “dry” or unworthy an endeavor. That’s nonsense.

Second, there is a distinct difference in the way you develop “regular” software from “system software” and I’ve learned this from sitting in the front row the past several years at XSPRADA watching database software being built from the ground up. It’s a little bit like the difference between building a tree house and a major commercial skyscraper. And I believe that playing Whack-A-Mole games while trying to bring up a building is a scary proposition at best (especially for future tenants). And yet, the example Curt provides involves Oracle’s Exadata, of all products! He states: “When I spoke to Oracle’s development managers last fall, they didn’t really know how many development iterations would be needed to get the product truly unclogged” – This statement is mind-boggling to me.

Because for one thing, it suggests that Exadata is “clogged” (ouch) but worse, that their engineering people have no clue as to how they might eventually (if ever) snake the blockages out of it! So, basically it’s a trial and error approach to building a database. Notwithstanding their “professed optimism” that it wouldn’t take “many iterations at all” to finally figure things out, it certainly doesn’t give me (or any reasonable person) a warm feeling about a multi-million dollar product claiming to be the world's ultimate analytical machine.

I think there’s a lot to be said for sound engineering practices, proper planning and testing, setting expectations and deterministic engineering management practices in the world of system software. That Oracle (or Netezza for that matter, also referenced in the post) might just be going along whacking moles instead is a scary proposition indeed. Even if this little game is limited to “performance engineering” as Curt suggest (as if there was a more important endeavor in an ADBMS), that’s a serious allegation in my book. I say leave the arcade games to the kids, and let the real engineers design and implement database and system software please. There’s no room for amateurs in this game.

On to my next point of interest: the new Gang of Four in the Cloud (with apologies to design pattern aficionados) comprising RightScale, Talend, Vertica and Jaspersoft have recently promoted and demonstrated a “bundled” on-demand package for the cloud. I attended their webcast yesterday and was impressed, but with reservations.

Each of these vendors is impressive on its own, no doubt about it. But it seems to me the bundled proposition might be confusing at best to the unwary customer. This new offering is billed by the marketing folks as “Instant BI, just add water” which drives me nuts. Look, it might be simple in theory, and it might take a few minutes to setup the stack on your own (as Yves de Montcheuil from Talend claims) but it’s still a long way to actually accomplishing anything serious in a few simple clicks. Sorry, not going to happen anytime soon.

You still have to work your way through provisioning and instance management (RightScale), data integration and loading (Talend), feeding and configuring the database (Vertica), and setting up the reports/analytics you might need (Jaspersoft). All of which can be accomplished just as easily (or not) internally by the way. It’s true you’d still have to purchase or license Vertica internally, which may or may not match the SaaS pricing I’m not sure (and either way, Vertica has a SaaS offering as well) but the other components are open source so, I’m not sure I see the big advantage there.

An interesting thing I noticed as well is that some people didn’t seem to understand what RightScale’s role was in the whole offering. This tells me they don’t really grasp the intricacies of “the cloud” – because instance and infrastructure management for enterprise in the cloud is not trivial and you do need something like RightScale to grease the wheels (it’s an abstraction layer really), but I think many people assume moving to the cloud is “magic” and makes all these issues disappear. If that were the case, you wouldn’t need RightScale in the mix. Beware undermanaging expectations I'd say.

Additionally, the pricing model (which is supposed to be so much simpler in the cloud) is confusing at best as each vendor has its own menu. The best answer to that I can remember was “starting at $1,700 per month” – I’m not sure what to make of that. So I think from an engineering/technical standpoint, this endeavor is noble, but from a “let’s make things simpler and transparent for the user” perspective, there’s still a lot of work to be done. In other words, it's a nice play for the vendors holding hands, but I'm not sure how beneficial it might be to the average enterprise user.

As usual, caveat emptor – Beware promises of a holy grail in BI as there is no such thing. It’s all about work. Hard, detailed and careful work with proper planning and budgeting. In that respect, setting up successful BI solutions is a lot like running and implementing software projects. There are no shortcuts, and it’s not a job for mole whackers.

Monday, August 17, 2009

Oh yeah? Well my database is SMALLER than your database!

Contrary to popular edict, smaller is not always better, unless of course you’re talking about analytical database engines. In that respect, it’s hard to find an ADBMS that can fit on hard media like a CD or a USB stick. For example, I don’t think SQL Server, Oracle, DB2, Greenplum, Aster, ParAccel, or the myriad of other ADBMS vendors can fit all their bits in a tight spot. Even in the open source realm, I doubt you can wedge InfoBright (MySQL) or IceBreaker (Ingres) onto a stick, much less shlep their bits around as an email attachment.

One exception to this is the V-stick from Vertica. When I first read about this, I initially thought it was a hoax but apparently not. It’s pretty cool too because it includes the O/S, web server, GUI and the engine all together on a 16GB thumb drive. How an engine like Vertica, designed around distributed MPP, can possibly operate representatively (using terabyte-size data) on a thumb drive is beyond me, and I’ve never heard of anyone actually using this gizmo but I’d sure love to get my hands on one and review it if it’s still available.

The other exception of course is RDM/x, the XSPRADA database engine. The reason is simple: its total deployment footprint is around 10MB. That includes the 32/64 ODBC drivers and a couple DLLs. The engine itself is currently around 6MB. Last I looked the installer clocked in at 16,760KB. This means you can actually deploy RDM/x onto a memory stick if you want to. I tried it, it works. It’s pretty cool. But after a while I wondered, why would anyone care about this?

The reason is two-fold. First, it’s really easy to try out software that is small and self-contained without expanding large amounts of time and resources. Yes, you can download RDM/x from our website but in many cases (like secured firewalled enterprises), that’s not an option.

Second, it means we’re a good candidate for embedded applications. Because if I can fit my database engine on a stick (or in an email), I can probably embed it in instruments and devices as well either as raw C++ code or libraries.

But for quick POCs, size and simplicity really does matter. Say you’re suddenly tasked with evaluating solutions to deploy a BI solution inside your company. Suppose you’re a Microsoft shop. Suppose additional capex is not an option, and suppose further you have a week to show results (namely a set of nicely formatted reports, pivot tables or dashboards). Now what? If you have significant in-house experience with SQL Server and associated SSAS, SSIS, SSRS, and Excel (and assuming you have a clear and deep understanding of the business scope and goals to begin with) you’re probably going to:

(1) Figure out where your source data is coming from (connection strategies)

(2) Model your DW (figure out grain on facts, dimensions etc, need to figure out BIDS and SSAS)

(3) Establish some preliminary ETL process (including incremental loads, need to figure out SSIS)

(4) Load your warehouse (if you screw it up, then need to drop and do it over)

(5) Setup an SSAS cube structure (figure out SSAS via SSMS or BIDS then publish the thing)

(6) Figure out what queries to generate (talk to DW DBA or learn MDX)

(7) Figure out what BI tool to use (Excel or browser, depends on policies and audience)

(8) Generate the reports (canned or ad-hoc)/dashboards/pivot tables for the POC

Now, if you have no prior experience with the Microsoft BI toolset, and you can whip this little project up in a week, guess what, you need to quit your job and start a consulting company because clearly, as a NYC recruiter once told me “you’re so money”. But if you’re a normal person with little prior BI experience (and the terms ROLAP, MOLAP, SCD and MDX don’t ring a bell), you’re in a bind.

So another thing you can do is download a tiny analytical database (say, the XSPRADA RDM/x engine, for example) and throw, say, 100GB of data at it (this is just a small POC remember?), then plop Excel on top of it and generate some really cool reports or pivot tables to show the boss (in under a week) it can be done. How hard is that to do? This hard:

Figure out where your source data is coming from.

Yup, that one is pretty universal in the BI world. Difference here is all your data sources will export as CSV to feed the XSPRADA engine. So at least that’s consistent across all sources (be they structured, semi-structured or not). CSV is data format lingua-franca so your connection "strategy" is this: get everything out as CSV. Plain and simple.

Model your data warehouse.

That’s always a smart thing to do for obvious reasons although the XSPRADA engine is schema-agnostic and you can feed it normalized or star/snowflake models at will. The secret phrase is: “we don’t care”! So for a quick POC, if you find yourself "forced" to feed RDM/x a 3NF model, no worries.

Establish some preliminary ETL process.

RDM/x runs against initial CSV data islands directly off disk. Point to the CSV files using the XSPRADA SQL extensions for DDL and you’re done. You’ll likely be doing this via script or code (C++, Java or .NET to the ODBC driver directly or via a JDBC-ODBC bridge). For incremental loads, just plop the new CSV files on disk and point RDM/x to them using the INSERT INTO…FROM extension. This process can be done in real time without disruption while other queries are running. No hassle there.

Load your warehouse.

That’s executing a single line of SQL DDL code such as

CREATE TABLE ….FROM “c:\file1.csv;c:\file2.csv…c:\file32.csv”; or INSERT INTO…FROM “c:\file1.csv;c:\file2.csv…c:\file32.csv”;

Made a mistake of want to modify the schema and “reload” real quick? Not a problem. Simply re-issue the same DDL command and the table/schema is instantly updated. From a trial and error perspective (which, in a POC situation, is fairly typical), that’s a high-five.

Setup an SSAS cube structure (figure out SSAS via SSMS or BIDS)

There is no concept of cubes inside the XSPRADA engine. RDM/x automatically slices and dices based on incoming queries in real time. So if you want to “cube” just feed the engine slicing OLAP queries. RDM/x automatically restructures and aggregates in real time. No need to pre-define or pre-load cubes, deal with hierarchies or materialized views. I blogged about this earlier. RDM/x is a lot like Luke 11:9 – Ask and you shall receive.

Figure out what queries to generate (talk to DW DBA)

That’s where an external tool using MDX (along with an MDX expert!) can come in handy (most people don’t roll their own SQL for OLAP, although it can certainly be done in POC mode). One cool thing about RDM/x is its ability to “withstand” poorly-formulated SQL because the queries are optimized against the internal mathematical model. RDM/x is typically more “SQL-forgiving” than most other engines. And a poorly formulated query is likely transformed internally to still yield optimal performance. So even if you’re no SQL guru, the RDM/x engine is still on your side.

Figure out what BI tool to use (Excel, no brainer)

Connect Excel to the XSPRADA engine directly via ODBC or connect Mondrian to RDM/x (via bridge) then connect Excel to Mondrian via the SimbaO2X ODBO/XMLA connector. Alternatively, make the argument that using OSS like Pentaho or Jaspersoft against RDM/x directly is more flexible and accessible (not to mention cheaper!) than messing with Excel. Depending on your user base and corporate standards, that argument may or may not hold water.

Generate the reports/dashboards/KPI/Pivot Table/ad-hoc queries required by management.

Exactly the same way you would using any other tool and/or SQL.

At the end of the day (or in our case, the week), it’s all about “time to results” and “pain to results”. In those types of situations, smaller and simpler clearly has a significant advantage over the rest. And speaking of smaller, I have run over my allocated space for this posting :)

Friday, August 14, 2009

Bits & Pieces Summer Posting

I thought I would do a “freeform post” today to celebrate the lazy 2009 Summer and the fact that most of the planet (but not the BI world for some reason) seems to be on vacation at the moment.

As you know I’ve been following ParAccel with interest for a short while now wondering how they would deploy those $22M of Sales & Marketing greenbacks they just scored. I read this interesting article about them lately and it looks like “customer acquisition” might be part of their strategy. Good move. Unfortunately, I couldn’t determine who the “other database products” or the other “columnar-MPP database” vendor might refer to. I can only surmise it might be Vertica. If anyone knows who else OfficeMax looked at, please share the wealth.

For all my whining about missing TDWI in my own backyard (San Diego) lately, it seems I didn’t miss much after all according to Merv Adrian who posted about the conference shortly thereafter. From the looks of it, the highlight might have been a sunset ride on the Lyzasoft yacht in the San Diego bay. Talk about good PR!

Andy Heyler wrote a good piece about the demise of Dataupia called No Data Utopia. In it he refers to the “awkwardly named” Dataupia. Yes Dataupia is a weird name. But so are several others such as Kickfire, Tokutek, or Calpont, for example. And although XSPRADA is admittedly rather funky, at least it’s an acronym (Extended Set Processing for Rapid Algebraic Data Access). To me the money quote in there is “you need to have a clearly differentiated position in such a crowded market”. It’s pretty much what I’ve been saying for a while (and common sense if you ask me). At the moment, I don’t see any of the players in this market besides XSPRADA with a “clearly differentiated position” on anything. At the end of the day, it’s still all about the prisons that are columns and rows.

Netezza announced the long-awaited (not) TwinFin product line prompting a flurry of nasty posts from competitors like Kognitio, and an interesting Monash post about data warehouse pricing. Much like traditional software license pricing, ADBMS prices seem to be reaching for the bottom (stay tuned for $19.95 per terabyte while supplies last!). And as someone recently said, the bottom is open source. Should be interesting to see what happens. Personally, I think a lot of this stuff is going to become commoditized. The play looks a lot like printers or razors, where the actual hardware is sold dirt cheap, but the paper or blades cost a fortune to replenish. Caveat emptor.

A relatively new ADBMS vendor called XtremeData has emerged. It looks like they’re based in the US (Schaumburg, IL to be precise) but the actual brains of the operation are in India somewhere. They’ve certainly been vocal on several BI blogs (namely, DBMS2). To me the funniest thing is their “ChalkTalks with Faisal” screencasts. Faisal is apparently their India-based CTO. The entire presentation is like a Netmeeting whiteboard session where Faisal keeps talking while drawing stuff on a whiteboard in a Flintstonish manner. All that’s missing are the stick figures. It wouldn’t be so bad if they realized the audio is horrible, due to the incessant noise from the marker writing on the board. It sounds exactly like squealing puppies in the background. It’s totally distracting albeit very amusing.

SQL Server 2008 R2 is out but without the Gemini (or Madison née Datallegro) pieces I guess. That’s Office 2010, parts of which are going into the cloud if I understand correctly. This whole Madison/Gemini “revolution” in BI is starting to get a little, shall we say, boring for lack of materialization. Not sure what’s going on with Microsoft lately but I’m getting more and more concerned. Even .NET seems to be taking a backseat to Java/J2EE. It wasn’t like that a year ago. I am sensing a scary downward spiral. One thing’s for sure, I have yet to see anything remotely connected to .NET or C# in the BI programming world, save for those .NET C# extensions Aster provided in their engine recently for doing MapReduce (and of course the PushBI initiative but even there…). To me it seems the entire ADBMS/BI code stack is Java on Linux (SuSE, Red Hat and CetnOS). I’m talking about the supporting/ecosystem tools of course, not the underlying engines (those are mostly C/C++ I believe).

I could be wrong but, it’s not looking good for Microsoft. My prediction: this behemoth will eventually split off into a myriad of smaller entities, some of which will survive, some of which won’t. The sum of the parts may be worth more than the whole.

Ingres has actually managed to generate some buzz in the US lately by announcing it is teaming up with a company called VectorWise (well, it’s a research outfit actually) to develop a “project”. No customers yet but a lot of very fancy PhD types in Amsterdam (they did MoneyDB/X100) and a first option to buy VectorWise is rumored should the venture be successful. Time will tell. It always amazes me why Ingres isn’t more of a household name in the US. In Europe, they’re very popular (at least that’s what my French Ingres experts tell me ).

Finally, a recent article is claiming that BI is used by only 8% of employees in the enterprise and that’s, of course, only counting the shops that have implemented it to begin with. Actually I find that number high and would have guessed more like 5%. This is not surprising in light of the recent economic woes and scandals we’ve witnessed recently. In most of these cases, it wasn’t a lack of technology or resources at play but rather a conscious choice to ignore reality. The tools are there. The desire is not.

In my opinion, this disconnect is very apparent in numerous retail outfits. Places like Home Depot, Whole Foods, AT&T or Circuit City (RIP) for example, who clearly have resources and tools to perform and exploit top-notch business intelligence but still manage to provide mediocre service or product at best consistently.

At Home Depot, they can’t (or won't) keep track of inventory correctly. I was once told it was because there was too much theft (both internal and external) to update the databases frequently enough! Consequently, they can’t tell you if they have some items in the store or not. Not all items, just some of them. And when they can, they’re unable to locate them physically inside the store (as in what aisle and section). That’s shocking to me given what we constantly hear about RFID and data warehousing investments at these large box places. But Lowes does a much better job of this so clearly, it’s not a technology issue.

At Whole Foods (at least the one in Irvine where I live) the quality and quantity of their product is inconsistent at best. Some days the self-serve fish is fresh, and sometimes it is not (and has been sitting out for too long looking like a nice fat food poisoning lawsuit waiting to happen). Like the proverbial “box of chocolates”, you never know what you’re going to get. Similarly, their checkouts are never balanced. You’ll see huge lines at several of them while employees sit idle at empty others. Every time I go there some happy-go-lucky line manager with a bright idea of the week makes it harder for me to shop and enjoy it there which is why I never set foot in the place any more (instead, I go to an even more expensive supermarket). Apparently no one is keeping track of this. Least of all the department managers who seem to frequently act on impulse in trial-and-error fashion. Yet surely Whole Foods has an uber-BI stack running somewhere in Austin giving them a “big picture” on a store by store basis right? You’d think someone would be paying attention (or maybe even using it)? But clearly they are not, or someone would be fixing these problems (or at least genuinely addressing customer complaints, which they won’t because it's "inconvenient" for them, as they like to put it).

I don’t think it’s so much about the difficulty of implementing and leveraging BI as the article suggests. I think it’s about genuine laziness on the part of upper management. Because, at the end of the day, if you’re the CEO of a place like this, you need to get your highly compensated butt out on the floor to truly see, taste and smell what’s going on. You need to talk to your employees, your customers, and get in their shoes (incognito if possible). I’m always shocked to hear C-level people lamenting the fact that their CRM system isn’t giving them enough visibility into their customers. Or better yet, they need to “understand the customer” better. Who’s kidding who? Fact is they simply don’t give a hoot most of the time. And if they’re too lazy or too self-important to do that, they’re not likely to pay much attention to BI tools and warehouses either no matter how fancy or ubiquitous the software might be. That's really just the nature of what "service" has become in the US lately. In order to improve BI usage, we will have to improve the quality of Management first and put real folks back in charge.