Monday, December 7, 2009

The Wailing Wall of Open Source BI

Henry David Thoreau once wrote: "The mass of men lead lives of quiet desperation". Much the same can be said of the multitude of users struggling with open source reporting and analysis tools like Mondrian or Jaspersoft. The difference, of course, if that those folks happen to be pretty vocal. And nowhere more so than on those vendors' own "support" forums.

I double-quote the term purposely. Because what you witness on these forums falls far (very far) short of what I consider to be minimally acceptable customer support levels. Now, it's a fact that many people find solace in these communities - after all, given the massive amounts of questions posed there, some are bound to get answered quickly and (hopefully) correctly. And it's a fact that many people achieve success (or some level of it) with open source BI solutions. At what cost, we can only surmise, but clearly, resilient, persistent and courageous people are getting some work done on these platforms.

But more often than not, the levels of post abandonment (ignored questions) and the arrogant responses are high enough to be shocking. I am stunned at the number of times where posting users are treated like it's their fault. The implication is that they are stupid or negligent. As a matter of fact, if you spend the time analyzing the language semantics, what you see are fearful, timid, often desperate users mustering the courage to post questions in the hope that somehow, someday, they will be answered by the Grand Wizards of [fill in the OSS BI vendor name]. This is not unlike the Wailing Wall in Jerusalem, where masses of people stick written prayers between the stones.



I find that these forum "masters" are too often condescending and consequently, they generate a "master-pupil" environment I find somewhat repulsive. It reminds me a lot of the old school systems in Europe, where Professors would stand on an elevated stage above the class to seal their social superiority. There's nothing wrong with respect but in my opinion, it has to be earned. When you berate, ignore or insult those who seek to learn from you, you are far from deserving respect, Grasshopper.

I do realize that these vendors provide some level of "enterprise" support for those wishing to license the software but given what you see in the public forums, it's hard to imagine it can be any better on the paid side because, quite honestly, Open Source companies simply don't "get" service. They are too often blinded by the magnificence of their code or product and forget that without users and healthy bi-directional communities, there wouln't be a company, Open Source or not.

My experience navigating through several Open Source forums was similar. At the time I figured, it's just me. I'm an impatient SOB and it's my problem. But when you start analyzing these forums (and I spent hours doing so), you quickly realize the disease is wide-spread.  All of a sudden, it's not just you anymore - almost everyone has the darn infection! I often wonder why people put up with this nonsense. Deep inside, I know the answer of course: because they don't have a choice. Either they can't find an easier product to use, or they are compelled to use it for organizational reasons (as in, my boss told me to check this out <sigh>). I do believe the advent of SaaS BI analytical tools spell the end of an era for the multitudes who must suffer through Open Source to get even the simplest of reports and dashboard out in less than three months!

But don't take my word for it. After all, I work on the SaaS side of BI and am likely biased. Even worse, I actually blogged a while ago about the merits of using Mondrian for OLAP. Indeed, after many months of putzing with it, I was able to accomplish something useful but I have twenty years of experience in IT. Read that again: two decades of messing around with difficult stuff and figuring out how to make it work. So sure, for someone like me (and if forced into it) you bet open source can work. But what about the poor guys who don't have that kind of experience? What about the people tasked with just getting reports or analytics done? I feel bad for those folks who invest time and considerable effort in open source solutions only to be left at the altar of success. And if you don't believe me when I say they're out there en masse, then feast your eyes on the following selected quotes and links from Pentaho and Jaspersoft OSS BI forums (they are reproduced verbatim, grammar, spelling and emotional content intact).


"It seems I first have to dive in pentaho source code to figure how the printing function in pentaho is working"

"I am new user to Pentaho. I downloaded mondrian-3.1.2.13008. Now please advise me the steps I need to follow to configure modrian in my system."

"Although I'm no expert, I think Spreadsheet services has been discontinued, well I've never seen a Pentaho Employee mention otherwise"

"I've been trying to use IIF all morning but I'm getting nowhere."

"I have two questions, which I barely dare to ask, but I am sitting here and can't get over my problem:"

"I tryed all combinations. The only combination that works is fact table,dimension tables and aggregates tables in the same Mysql schema, and a datasource that point to that schema."

"Hi everybody. Though I've run the demo applications and read almost every thread about this, still I can't get an exact understanding of the relation existing between Pentaho Server and Mondrian."

"I think that the preconfigured examples of JSPs, catalogs etc. are confusing at best - bad practice at worst."

"I am fairly new to Mondrian, and I am stuck at ths point"

"This might be very trivial for some of you but I'm really having a hard time figuring this out. So any help is greatly appreciated. Thanks so much."

"I've goggle this one down to my knuckles. I suspect my chopsticks are tangled on the Mondrian/Pentaho side? I would appreciate any thoughts the community might have."

"Nobody can help? If so, does anyone know where else I can go to find the answer ... I've pretty much exhausted Google's list of MDX links, and am getting a little desperate now. Don't want to tell my client "it can't be done" without good cause..."

"Desperate for assistance. Last big issue to overcome and deliver my pentaho bi solution."

"Need some Help again.. I told my Boss that I will get the prototype ready by tommorow. Any help would be appreciated"

"I'm quite desperate.... ... please... help...."

"Helloo...???? Anyone there?? My question may be silly, still any of your ideas would be much helpfull..Am really desperate to find the logic.. Am not good in java and am in verge of my project and solving this issue resolves half of my task...Please respond..!!!"

"We are pretty desperate by now. We tried to manually re-insert the UNC server part into the file name, but this is getting dirtier and dirtier as we build new jobs."

"Sorry bothering you again, but I'm in a desperate situationn. We are trying to run the new Pentaho GA, but we have a lot of problems when we try to run this with Oracle."

"I'm really fed up now trying to make charts working over couple of days. I have created a database connection and given the field names correctly. Can someone please help me? I'm really desperate to get this working.."

"I know this has been logged by both myself and others before, however I am desperate for an answer here."

"I  checkd out all the sample dashboards of cdf.. but am clueless to create a dashboard of my own.. is there a dashboard builder for cdf??"

"I dont consider myself a fool when it comes to working with technology, and I have found myself frustrated and confused beyond belief at what should be a seemingly simple task..."

"And...regarding the timestamp....how in the heck to I tell that to not show up if I have no idea where it's coming from? Sorry if I seem frustrated, but I am."

"I've just recently begun working with Pentaho, so forgive me if this is a known issue or feature...I did a search on these forums and couldn't find an answer...a frustrated contractor in DC."

"I got frustrated that i cannot get a decent quality, well formatted pdf output."

"However, I do have to say I really had to grind through the first 10 days or so in order to get this thing down to a point of not hitting hurdles every step of the way. In fact, I've got a colleague who is really pretty frustrated and trying to jump ship"

"A few weeks ago I was very frustrated that I was completely lost withing thePentaho suite. I cannot afford training (I'm broke, plus I live too far away from the traing venues) so my only means at this stage is all of the above"

"I’m not trying to start a ruckus or anything I just wanted to state when I am coming from and what I would like to see from Pentaho. Judging from the many unanswered post on this subject I think there are many more like me out there."

"Sorry about the begginer question. But i'm getting frustrated"

"I am getting increasingly more frustrated it is quite hard to get started. The documentation seems only to touch the very basics. "

"I am a bit frustrated right now as I haven't been able to solve any of my problems."

"...first, *I* don't think i'm an idiot. what frustrated me was the lack of TASK-ORIENTED information i found. i had a fairly simple, fairly small TASK i needed to GET DONE... i dont care about kettle or apis or what great problems Kettle could solve... I had a TASK and an IRATE BOSS."

"I'm pretty frustrated at this point: I mean seriously how much more simple could a transformation be!?!?!?!?!?!"

"If I am posting about that, it is because I ve been frustrated much!"

"I was very excited to find Pentaho and eagerly wanted to create my first Dashboard. Now I’m just completely frustrated. And as I search these posts, I see thousands of people are looking for the same basic answer with little to no response from Pentaho: How do you create/deploy a non-sample dashboard; i.e. how do you actually use this thing?"

"Maybe a non-IT guy doesn't want to deal with SQL queries at all...

Now let's look at a couple of "answers" to many of these posts (unfortunately, questions outnumber answers significantly):

"You use the search button. Version 2/3/3.5 brings in lots of changes to the way Pentaho does things, please mention which version of the BI server you are running.
Join the Unofficial Pentaho IRC channel on freenode. Server: chat.freenode.net Channel: ##pentaho - Please try and make an effort and search the wiki and forums before posting!"

"If all that is true then it is clearly a bug in mondrian and you should report it in jira."
Here, the moderator seems to question whatever the premise of the question was (as if people bothered posting lies...)

"The procedure is exactly as I have told you. You must have done something wrong somewhere. Just check through carefully;"
Typical "you must have screwed up something" response. How encouraging and a little condescending if you ask me.

"If not, please post the error. We're not clairvoyant."

"There might be a simpler way... search the doc."

In this post, a user actually bothers pasting a novel sized exception asking for help. Specifically, he asks: "Any help or even pointers to documentation more than Spring/Acegi provides would be greatly appreciated".

What does the Jasper guy reply with?

"JAAS authentication can certainly be done. Have a look at the Acegi documentation...The forum is also a great resource. - Is he kidding? Did he even read the post?

Here's another classic: a user is complaining about the advanced search in the Jasper forums. The moderator's suggestion is terse: "it is what it is". Then he suggests using Google! Are these people serious?

How about this comment: "I am desperate to use JasperReports , however I am having hard time understainding it".  Or this one: "Still can't compile - getting desparate...I am getting nowhere trying to get my reports to compile."

What can we conclude from this mass of desperation, frustration and confusion? Why would these vendor create and maintain such an environment to begin with? Because quite honestly, if I were looking for a BI platform for a critical project or client, and came upon these forums in the process, I'd run so fast the other way it's not even funny.

As I mentioned above, if you have some serious software development and IT background, the luxury of time, (the right hardware and software) and a propensity for hacking complex nebulous systems, or the courage to read a 600 page book on working with Pentaho, for example, these open source solutions might just do the trick. If your organization or client is insisting on going the Open Source route, and you have no say in the tooling selection, then clearly you have to deal with the pain, hope for the best, and pray for a check (or a job) at the end of the ordeal.

But if you're just a normal analyst or consultant tasked with a BI implementation on a tight deadline with an impatient boss breathing down your neck, there is an alternative. And if you're just a guy (or gal) who needs to get the job done now (as in, this week) without the drama and headaches of open source theatrics, you'll likely be looking at a SaaS BI solution.  Now, SaaS BI may not turn out to be your bag, but you owe it to yourself to at least give it a whirl and here's why (putting my pitching hat on):

It takes very little time, effort and money to try it out. In many cases, it's free. There's nothing to install or configure. It takes hours or days, not weeks or months, to build and showcase prototypes. Most of the time (as in 80%) the functionality is sufficient to do the job. And features/functionality increases dramatically with each release. Oh, and there's updated documentation. But here's the kicker: you'll actually get just-in-time support from people who actually care about your success.  As a matter of fact, they're invested in it. They will welcome your questions and use them to improve the offering. They won't ignore you and they won't treat you like an inferior species.

I know this is sounding a lot like "Miracle on 34th Street". Must be the season. Truth be told, SaaS BI is not for everyone, and your mileage may vary. But it's worth an honest shot. Because the alternative is an endless Wailing Wall of pain, misery and isolation that no one in BI should have to put up with anymore.




Sunday, November 29, 2009

Of CRM Golden Calves and Ten Commandments of SaaS Landlordship.




I have a pet peeve concerning two things I’ve been meaning to write about lately, namely CRM and multi-tenancy. I recently read an excellent article referencing both concepts and thought it might be a good stepping stone to a quick blog post.

In his article Don’t Get Conned – The many disguises worn by software-as-a-service, Matt Wallach discusses CRM software and the concept of SaaS multi-tenancy.  Matt’s business is life sciences, and he discusses the state of CRM software for his industry.  His argument is along the lines of “be wary of on-premise wolves in SaaS sheep clothing”. The wolves, in this case, are on-premise CRM vendors who simply throw servers over a wall and call it a SaaS day.  The sheep (aka the good guys) are those genuine SaaS vendors riding a multi-tenant architecture. He then proceeds to define what multi-tenancy means using the well-known “neighborhood versus apartment” analogy.  Sounds innocuous enough (and the piece is well-written) but here’s the problem I have.

I am going to get a lot of heat for this statement, but truth be told, CRM is a hoax. There is no such thing as CRM.  It’s all smoke and mirrors, and you cannot nurture customer relationships (or any other form of relationship) using a bunch of bits. It’s a cop-out. I know of CRM market segments and sizes. I am well aware of CRM dissemination and popularity in corporate America. I don’t question its omnipresence in virtually all businesses in one form or another. My claim, however, is that it does not and cannot work as billed.  In numerous cases, it is simply a waste of “feel-good” money. Here’s why.

The companies and industries making the most use of CRM packages are those with the worst customer service. This in itself should be eye-opening. Airlines are perfect examples. Can anyone think of an industry with worse customer service? I can’t. Yet they spend gazillions on CRM and boast about it frequently. How about retail? When’s the last time these guys used CRM to do anything besides manage their “rewards programs” (who are they rewarding please tell me?) or push advertizing gimmicks? How about Telecoms? How many people are happy with their cell carriers in the US? When was the last time your interaction with a carrier’s customer service department was either useful, efficient, or pleasant? How about the automobile industry or even car rental outfits? When was the last time you were wowed by one of those big corporate CRM consumers?

Now, I’m not saying all companies who use CRM manage customers poorly. Some do get lucky. And others use it as a tool, not a means to an end, which is perfectly fine.  What I am pointing out is that customer service quality seems inversely proportional to the resources spent on CRM at the corporate level. And that’s ironic. The problem is in the C-suite. These folks see CRM as a panacea but inherently, deep down, these people do not have customer service “genes”.  They have Board of Directors genes, profit margin genes, or MBA ones, but they don’t know the first thing about worshipping customers – It’s a form of autism on their part.

Because in my experience, customer service is not something you can learn on the fly (although I suspect it is taught in business schools).  And it’s not something you can acquire via software. Companies, like people, are either born with it or not. It’s a nature not a nurture trait. And no amount of software or resources will change the behavior of a company whose culture isn’t obsessively centered on the customer. It sounds obvious, and everyone talks about it, but most of the time, it’s nothing more than lip service.  It’s a sham I tell you.

Luckily, the solution is simple: throw the software out the window. Really, honestly, you don’t need it. You can’t even prove you need it anyway. Most of the time, you don’t know how to interpret or filter the results. And worse yet, you have too much information.  You get overloaded with it in a drunken stupor and it ends up clouding your judgment.  And shortly thereafter, you throw common sense out the window.  Malcom Gladwell makes excellent anecdotal arguments against information overload in Blink, one of his classics.  It’s a fascinating read (specifically I refer to section 4 of chapter 4).  CRM and software overload have the same effect on corporate management as those described by Gladwell affecting ER doctors’ decision making. Too much information leads to disastrous results.

I’m not suggesting the demise of customer or performance management software as a whole obviously.  Keep the databases, the warehouses and the BI in-house (because analysis and insight are still obviously crucial) but take your big honking CRM software and pull the plug. You will likely learn more (and spend a whole lot less) by talking to your five top customers. If you want to stay in control and get 360 customer views, sign up for something like TeamSupport – it’s all you need. Most importantly, do something truly revolutionary: get out of the building and walk a mile in your customers’ shoes.

Order your own product, fly your own airline (in coach), rent your own cars, dial your customer service numbers repeatedly – from a roaming zone, for example – visit your retail stores (incognito) and talk to your customers. Order your own food, visit your own bathrooms, wear your own clothes. Engage customers one on one, on the streets, in the stores, at their home, face to face, and listen like you mean it. Listen like your life (and job) depended on it – because it does.  You cannot manage customer relationships from a spreadsheet or a SQL query any more than you can steer a marriage successfully via the web (yes I know, there’s probably an iPhone app for that). It’s that simple.  Your fancy CRM system is the blue pill. Take the red one!

So now, having sufficiently ticked off everyone in the CRM industry, let me tackle the SaaS “badge of honor” also known as multi-tenancy. Here’s a reality check: no one really knows what multi-tenancy means. How do I know this? Because if you ask three people what it means you will get five different answers.  And when you search for it online, you get multiple descriptions, some more convincing than others. So unless I missed something, there is no official definition, only qualified opinions.

To me, the gist of it involves partitioning software application space very quickly and cheaply to accommodate exponential mass market adoption cycles (there’s a mouthful).  Exactly how you achieve this, from a technical perspective, is open to interpretation. Some people partition inside the database and share schemas (like Salesforce.com, where every tenant shares database tables), others partition on the application layer where processes may serve multiple requests, some consider virtualization to be a form of multi-tenancy, and yet others implement a hybrid approach.

At the end of the day, multi-tenancy is a design methodology. Multi-tenancy (however you implement it) allows faithful cloud landlords to follow basic rules. If Moses had come down from “The Cloud” instead of Mount Sinai, he might have brought these back instead:

(1) You shall provision tenants quickly without affecting other parts of the system.
(2) You shall be able to provide deterministic system metrics on a 24/7 basis.
(3) You shall let non-technical staff (or software) handle provisioning.
(4) You shall monitor your platform’s health day and night.
(5) You shall implement and frequently simulate doomsday scenarios.
(6) You shall make it as easy to provision one tenant as it is for one hundred.
(7) You shall segregate and protect thy tenant’s data at all cost.
(8) You shall upgrade transparently and automatically for all tenants.
(9) You shall allow tenants some degree of customization but not more.
(10) You shall have a simple, clear and consistent pricing model.

In my opinion, what goes on behind the scenes to support this is irrelevant because it’s irrelevant to the user. The user just wants to be “on-boarded” and provisioned pronto and safely without hassle. If you happen to be able to pull this off technically and cost-effectively, then you’re multi-tenant! If not, you have some pain coming your way.

Whether or not you can support thousands of tenants per instance (however you define “instance”) is, for all intents and purposes, really your problem, not the user’s.  And I have yet to meet a prospect who’s first and foremost concern involves “multi-tenancy”.

This might suggest we cease using the term as a marketing asset because I don’t think it resonates clearly. In my experience, the only things that matter to a cloud user are simplicity, service levels, trust, and cost. So let’s give users exactly what matters to them, explain it simply, and keep it real.

Yours in BI.

BRBR45DDTK3B

Friday, November 27, 2009

S&M: Where the SaaS rubber doesn’t meet the BI road.








I was lucky enough to be at Dreamforce 2009 last week and wanted to pen down a few thoughts while the event is still fresh in my mind. I don’t think there was any earth-shattering news there, and I got the feeling (both onsite and online) that a lot of people didn’t really grasp the value of Benioff’s announcement (or strategy) about “socializing” the platform with Chatter.  I, for one, certainly couldn’t make sense of Colin Powell’s presence at one of the keynotes (not sure what he can possibly offer the world of SaaS but maybe I missed something).  But overall it was an enlightening conference and here are some of my impressions (and they pertain mostly to the SaaS BI realm).

First, the sheer number of bodies at the event was impressive. I understand 18,000 people took part and that is quite a large crowd given how undersold (to put it mildly) other conferences have been this year. Businesses have been reluctant to invest in conferences in 2009 as evidenced by abysmal attendance numbers and the rising popularity of “ virtual conferencing”.  So if one conference was preferred over all others, it must have been Dreamforce 2009, because it seemed like everybody and his mother sent people there.

Second, I was impressed by the level of “education” the typical attendee exhibited. I didn’t really see or hear people asking basic “big picture” questions. Rather, the inquiries were very focused, deep, and to the point, revealing mature customers (buyers) who had done some serious homework. Actually, most of these folks have had meaningful experience in the cloud (some good, some bad) and knew how to hit the right vendor pressure points. From my standpoint, it is always vastly better (and more enriching) to deal with well educated buyers in a no-nonsense approach. This is exactly the user profile I experienced at Dreamforce.com manning the GoodData booth.

Third, and I realize this is subjective, but to be honest, there are a lot of small clueless companies out there having nothing to do with cloud per say who clutter these shows for the publicity of slapping “cloud” onto their marketing literature. I’m not going to name names, but let’s just say when you sell gardening shoes, mailboxes, or kitchen countertops, I’m not sure you should be spending marketing dollars on Dreamforce.

Fourth, I didn’t pick up any “religious” fervor at the show from either buyer or vendor sides. I assumed this was going to be a major rah-rah for everything cloud (with Open Source type of fervor) but I found the discussions to be much more measured and rational with most people objectively comparing both approaches (when applicable) with few pre-conceived notions. I believe this is a sign of industry maturation as people are getting better at separating the wheat from the chaff. I feel for the most part that SaaS limitations are well understood by most (not all) people and expectations are becoming more reasonable for the most part.

Fifth, I believe that SaaS beachheads have been claimed.  This is particularly true in the BI space. This has a lot to do with perception obviously but in my opinion, the winners and losers have already been tagged.  Most companies (if not all) are fairly new in the cloud space yet already, they have public reputations as in “these guys aren’t serious” or “these folks are the ones you want to talk to”.  Obviously first-to-market matters a great deal in any industry and cloud is no different.  Except with SaaS of course, you can be first to market with a virtual product (vaporware, in the on-premise world), and it takes longer for people to read the fine print but, eventually, they do.  When people have an interest in a particular SaaS domain, they go directly to the “top dogs” without stopping anywhere else.  I believe there is plenty of space for new companies in the cloud but for those having established an early lead (perceived or not, and with compelling technology), the future seems bright.

Sixth, everybody in the BI cloud space faces the exact same problems.  And no one has clear answers so this is still a very “trial and error” process.  The difference is between vendors who admit this, and those who don’t (mostly to themselves).  This is a sweeping statement but overwhelmingly, when you talk to other vendors, the same themes come out time and again.  Namely, how to scale fast enough (or onboard efficiently with minimal customer “touch”), and how to control sales and marketing costs which are turning out to be higher than anticipated.   Although adoption is growing, in my opinion, the technical hurdles are not remotely as high as the business ones.

Most of the BI cloud vendors have managed to get by on minimal engineering costs.  Offshore labor is cheap enough that you can afford competent engineering teams in India, Central Europe or China (to name a few) for literally dollars a day and minimal liability.  Some of these vendors are making ends meet with two-man engineering teams!  And only two I know of have engineering teams exceeding ten people. So clearly, the money pit is elsewhere.

And elsewhere is S&M (no, not the fun kind) namely Sales and Marketing. The original proposition for cloud was that the “service”, unlike traditional enterprise software, was going to kind of sell itself.  S&M budgets were going to be minimal. No more travelling field sales force, expensive face-to-face customer visits, pre or post-sales engineers.  It was all going to be “automatic” and on the web. But my limited experience contradicts this.

Because now, adoption and competition are growing.  For example nowadays in SaaS BI, you have dozens of vendors. You skim enough to get to the “serious” ones (see #5 above) and now you’re left with maybe four or five guys.  Next year, there will likely be ten serious contenders.  The more competition you have, the higher sales cycles and costs get.  Next thing you know, you’re back to boots on the ground and to a more traditional enterprise software sales models. This is the danger facing many SaaS players these days.  CEOs and investors are edgy about this emerging trend. It breaks the anticipated mold.

The other problem is what I call customer “touch-too-much”.  In a SaaS model, efficient on-boarding is crucial. This is not only about flipping the proverbial switch – because properly-engineered multi-tenant systems achieve this quite well – but more about the time it takes to get a user’s business problem solved.  Namely, the costly interaction spent on a given customer to handle specific needs and the amount of customization needed to achieve satisfaction (and final sign-off on the purchase order). POCs, sales cycles and marketing costs are growing.

This is a huge problem in the BI space because the requirements phase can be long and even there agility is not necessarily a Holy Grail.  The basic problem is simple: it is very difficult to remove the “human factor” when implementing BI.  Cookie cutter never satisfies a particular business problem entirely. The money’s in the customization and the subject matter expertise.  You must solve difficult business problems, not engineering ones.  The same challenges apply to software engineering, and history has seen a flurry of “blissful automation” endeavors fail in that space (remember 4GL?).  At the end of the day, you can’t remove what’s between the chair and the keyboard, and you can’t efficiently and consistently automate it in software – whether in the cloud or not.

Now, this is not so bad for a company like Salesforce.com because they’re a platform play.  So by definition, they provide efficient functionality (large offering surface), but it’s all fairly mediocre and “cookie cutter” – Users are free to (try and) customize their modules as they see fit.  In the analytics space, for example, Salesforce reporting and analysis is shallow. Consequently, users looking for customer data insight and trending statistics (say for pipeline analysis) will look for integrated solutions fitting their specific business needs.

But for the after-market players who “plug” into something like Salesforce, it’s a major hurdle. Unless they can very quickly and cheaply customize their offerings for a myriad of different business cases, and move through POCs quickly, the SaaS hosting and costing model won’t do them much good.  In my opinion, most existing BI SaaS vendors are currently struggling with this conundrum. Lucidera seems to have as well. Its demise was a shot across the bow.  The first SaaS BI player to move past this problem wins the game and keeps the investors happy.

I have a hard time thinking of a SaaS BI vendor currently striking a reasonable balance between zero S&M and massive S&M.  On one end of the scale, I see folks adamantly opposed to spending a dime on marketing and expecting serendipitous results. On the other end, I see vendors placing heavy bets on misguided or highly-targeted verticals.  I see both strategies as precarious and hope for more level-headedness in the coming months.

No matter which way this goes, we're in for a wild ride. No prisoners will be taken. It's a great time to be in this business.

Yours in BI.

Tuesday, October 27, 2009

Of Birthrights and DNA in the BI Cloud World

I have been so busy getting ramped up on GoodData that I haven’t had a minute to post here in over a week.  On Friday, I am headed out to our R&D Center in Prague, Czech Republic - looks horrible doesn't it? :) - for a week to do a brain-meld with our engineering team and deep-dive the GoodData architecture. I’ve learned a lot about it in the past two weeks but clearly, there’s nothing like sitting down with the hard-core developers and discovering how the sausage has really been made for the past two years.

One of the unique things about this platform is the fact it was designed for the cloud from the get-go. “Born and bred in the cloud” as they say.  And that was several years ago, before the cloud “explosion”, when the idea of running a full BI stack in the cloud was just a fantasy to most people.  They said it couldn’t be done (matter of fact, some folks still claim that, but reality proved them wrong in 2009) but Roman Stanek and his guys did it and I think the results speak for themselves (here's one of many examples).

I think the “born and bred in the cloud” stamp is important because there is so much hype and hyperbole about all things “cloud” these days. Especially in the BI space.  A lot of times when you hear or read things about clouds, it tends to feel like this (with apologies to my Thailand readers who obviously understand that script).

To separate the wheat from the chaff, you really have to ask the right questions. Namely, is your architecture really cloud-based or did you simply throw bits over the wall onto a bunch of hosted boxes (or VMs)?  Do you understand the meaning and implications of real multi-tenancy? Are you simply shifting technology because it’s hip (marketing-wise), or do you truly possess cloud DNA? And on the non-technical side, do the economics make sense?

In the mid-1990s, ASPs (application service providers) were all the rage. Everyone wanted to “go ASP”.  I worked for some of them. But the “software” crawled, got in the users’ way, was hell to maintain and deploy, and the costing model didn’t work.  Many (most) went belly-up. But hey, at the time, it sounded good and the VCs wrote checks.  But already back then, painful lessons were learned.  Namely that it’s virtually impossible to change horses in the middle of a race. 

The inherent value of “born and bred in the cloud” platforms is not simply technical to me.  Clearly the engineering is intricate and “cool”, but what’s really at stake is what Roman calls “price-based costing”.  (Speaking of pricing and clouds, check out Roman’s latest blog post, it’s à propos!).  This is a Peter Drucker concept but it applies really nicely to cloud-based economics.    

Price-based costing is defined as “choosing a desired sales price and costing out production to meet that sales price with a desired profit margin”.  The idea is to first determine what you want to charge for a product or service, and then work the development economics backwards to achieve that price point. When you’re starting from scratch with a cloud-based technology, you can pull that off, because parameters, cost and elasticity are linearly deterministic.  The up-front costs are fairly low and the flexibility is fantastic.  Additionally, the tools are cheap, commonly available and simple.  

Consider that the GoodData backend is almost all coded in Perl on Linux with a REST interface!  (I’m not suggesting Perl or Linux are “simplistic” but I worked the MSFT development stack all my life, and I can see orders of magnitude in complexity reduction here).  

So now, if you have legacy bits and a “ground-based” engineering platform (and people), you’re stuck on a price structure in this bearish economy, and you can’t just flip to a cloud model as a panacea, no matter what your Marketing people tell you.  Whether you sell proprietary or open source software, you’re in the same predicament. Sure, you can provide less functionality or cripple your offering (or worse yet, try to make do with less people), but just throwing it inside a “cloud” won’t do it economically, and the outcome will likely be painful. The DNA just isn’t there. 

So when some BI vendors simply slap some code (open source or not) onto AMIs (say like Mondrian for example) and call it a BI on-demand offering with a ROLAP engine, it makes me wonder if they “get” the concept at all.  Mind you I ponder the same thing about big league BI tooling players who simply re-package their (oh so heavy) bits into virtual boxes and call it a cloud day as well. 

With a genuine “born and bred” cloud model, you have fine grained leeway in adjusting the pricing dial. Guess what, you can even turn it down all the way to zero and still survive for a long time at given data volumes.  At GoodData, for example, you can get 10MB accounts for free.  Not a huge amount, certainly, yet sufficient to “play with”.  But no matter how many people sign up for that, GoodData’s cloud platform can handle it both volume wise and cost wise at ridiculous scale and at the flip of a switch.  Scalability in a “born and bred” cloud platform is very (did I say very?) cheap.  And pricing can be finely tuned by throttling the technology, not the other way around. And this isn’t about taking a loss for marketing purposes. GoodData, like any other sound business (except OSS maybe), isn’t about giving stuff away for free, believe you me.

So I guess my point would be: caveat emptor.  The next time you’re considering replacing that big honking on-premise BI tool with the vendor’s “new improved” cloud-based on-demand version, check under the hood and make sure that “born and bred” DNA is there (hint: not likely).  And the next time you’re considering a “native” SaaS cloud vendor for your BI analytics, ask the right questions about what makes their backend tick.  You might be in for an interesting smoke and mirrors show.


Monday, October 19, 2009

On Bringing the Good News While Clearing Out Fungus.

I wanted to spend a little time sharing some good news with you.  I am going to be Technical Evangelist for a relatively new company called GoodData.  As evangelism means "bringing the Good News" to the world, this is truly a match made in Heaven for yours truly, as I will now be bringing the GoodData news to the world.


Many of you will have heard about GoodData before of course.  Most people in our industry know it was founded by serial entrepreneur Roman Stanek of NetBeans and Systinet fame.  Sun acquired NetBeans circa 1999 for $10M and HP grabbed Systinet in 2006 for a mere bag of shells ($100M of them actually). 


GoodData is backed by heavy-hitting investors including Andreessen Horowitz  (Marc was also an initial angel investor) but also O'Reilly AlphaTech Ventures  and General Catalyst.  It's always nice when you can point to Marc Andreessen pitching your company on CNNMoney.  GoodData also cleared another funding round last week as anyone following our industry will have heard.  This kind of news does not go unnoticed these days.


So what is GoodData about?  GoodData is an on-demand business intelligence platform for collaborative analytics (phew!).  This means the software runs in the cloud (Amazon's EC2 to be precise), so clearly the model benefits from all the usual technical, deployment, and financial benefits SaaS brings to the table.  But more importantly, and what really sold me on the concept, is the following equation (and those who follow me know my affinity for all things KISS):
gd = BI - BS


In English, this says GoodData is business intelligence without all the "bullshiitake" (term borrowed from my hero Guy Kawasaki).  So let's talk a little bit about that breed of mushroom.

  • When you need to implement a BI project but depend on IT to setup infrastructure and purchase software before you can even get started, that's bullshiitake.
  • When you need a PhD or significant training to start using a complicated overkill BI tool, that's bullshiitake.
  • When it takes you weeks or even months to analyze time-critical data for your company and show results to an impatient boss, that's bullshiitake.
  • When you can't collaborate with your peers and customers (internal or external) or integrate with outside apps while doing agile BI, that's bullshiitake.

GoodData represents a new way of thinking and acting about BI by removing these "mushrooms".  In that sense, GoodData is really re-inventing how BI consumers and producers interact, behave, grow, and produce results together.   And that, to me, goes way beyond a pure technology play.  It is a “next curve” move and an industry-shifting vision I wanted to help forge.


So in the coming weeks and months, I am going to be talking about the technical and social aspects of this industry-shifting platform. You'll be able to reach me at my new work email and I will be twitting on the @gooddata stream as well.

My pitch to BI professionals will be simple. If you seek independence, control of your environment, quick and deep results, and the flexibility to enhance your company’s bottom line, then you need to try out GoodData. We represent a fundamental new way of thinking about BI.  My role will be to show you how and why.

Sunday, October 11, 2009

BI in a New York Minute is Born





In my last post, I mentioned I was going to do a "trial & error" run on a recent concept of providing weekly BI tidbits (news items) for people on the go.  Having had some time to reflect on this endeavor a little further over the weekend, and after consulting with some key people, I've decided to try another tack.  Fact is, I prefer keeping my main blog for "deeper" less frequent posts. 


So instead, I've decided to create another way to dessiminate this information. Additionally, I'm giving up on the idea of posting items on a weekly basis. Why cover weekly periods when in fact, BI news is occurring pretty much 24/7 in real time nowadays including weekends.  


To deliver better, I have created a new blog called "BI News in a New York Minute" where I simply post tidbits in real time.  When I get them, you get them.  People can subscribe to this stream via the blog (obviously) or by pulling feeds from http://biminute.blogspot.com/atom.xml.  Additionally, I've linked BI News in a New York Minute to Twitter so updates are also broadcast there when available (almost, but not annoyingly, in real time).  So if you aren't following me in twitland yet, point here and click "follow" :)


This is just one modality of a more general media concept I am working on for the Business Intelligence industry.  In the meantime, I am hopeful (and grateful if) you will give me feedback and suggestions!


Yours in BI.
J.

Friday, October 9, 2009

Your Weekly BI News in a “New York Minute”

I’m a glutton for BI information to the point of addiction. A day rarely goes by without my “scanning” three hundred or so feeds, blogs, twits, websites and other news and information sources pertaining to business intelligence.  I did the same thing when I was on the software development side, mind you, but using significantly less online sources and way more books. I find BI to be so dynamic an industry that by the time most books are published, the information is already obsolete.  And that’s just on the technical side.  On the business front, things change and evolve even faster.

So I started looking around for an aggregated “digest” of the day’s salient BI news points. I was looking for something that spanned the multitude of industry areas in BI from engineering and product news (platform to front-end), to business development, partnerships, personnel (who quit, joined or got fired), distribution (SaaS and cloud issues), venture funding, important conferences, and so on.  In a nutshell, all the areas that affect our industry one way or another. To my surprise, I couldn’t find anything readily available.  Something someone could take on a train or pull down to a mobile device during a daily commute to get a quick “scoop” on the week’s happenings in BI.

Then I figured, why not try my hand at compiling such a list on a weekly basis.  I know this is fairly subjective, but while scanning my own sources, I always “pull over” several items for more in-depth examination. I do this very quickly (I’m a very fast reader) and often “bucketize” these links using Delicious.com for further examination and/or analysis.  Clearly people have different areas of interest but I figured this pruned list could benefit some folks with no more than a “New York Minute” to spare.

I’m not sure how best to format and present this yet. I didn’t want to start off with some sort of mailing list offering because, quite honestly, I’ve never stuck with a mailing list subscription more than a couple of weeks myself.  So for now I’m going to try and post this on my blog on a weekly basis every Friday.  If I detect minimal interest, then I’ll try another approach.  I'll also include more items in future weeks.  This is more of a "trial and error" endeavor at this point.  Meanwhile, and without further ado (or order), here’s my BI Digest for the week of October 5th, 2009 "in a New York Minute".
  • Discrete SaaS player 1010Data scores a big contract in the retails space by signing up Dollar General.
  • MicroStrategy is far from Open Source but giving away software for free nevertheless now. Trojan horse or good value?
  • Awesome webinar on what was learned in the past 20 years in BI compliments of Claudia Imhoff and WhereScape
  • "But can it core a apple"? On integrating OLTP and OLAP in the same database engine. One of the most commented (and educational, for me) Curt Monash posts I’ve seen.
  • First of a two-part in-depth series on MapReduce vs. relational databases.
  • You might not be paying your DBA enough money to put up with this (alternatively, you might be using the wrong DBMS platform ).
  • BI and DW “implementors“ take heed. You may need more duct tape after all (this doesn’t just apply to the software development community, believe me).
  • For a deep (and I mean deep) dive into what customer analytics in retail are all about.
  • A really novel (read: smart) way to engage your market without the BS, hassle and expenses of conventional physical conferences as described by Merv Adrian who tried it.
  • Dashboards in Excel 2010 (yawn). Sorry it’s been hard to get excited about MSFT BI in, oh say about 18 months now.
  • Can (or should) InfoBright actually kick MonetDB’s rear-end? The numbers are (almost) in.
  • In-memory databases and the Kognitio men who love them.
  • Well, looks like nobody is using EC2 after all (NOT!)
  • Larry invites Marc to Oracle Open World next week. See rumors fly...
  • At least one person is excited about Gemini (but it’s a monkey so..).
  • If you live and breathe “fabric” and grok fiber over Ethernet, this one’s for you.


And that's the way BI is.  Goodnight, and enjoy your weekend!

Wednesday, September 30, 2009

Baking MapReduce into Database Engines - Worth the Reduction Sauce?

MapReduce (implementations include Hadoop and CloudDB) has gained popularity in the industry. It also serves as marketing fodder for several new-breed ADBMS vendors who now claim to support it in various forms. So what is really behind this magic pixel dust, what problems does it solve, and how relevant is it to someone deciding on a new (or additional) ADBMS platform these days?

First, let’s point out MapReduce is not a technology but an algorithm. Wikipedia defines an algorithm as “an effective method for solving a problem using a finite sequence of instructions.” In MapReduce’s case, the problem being solved is the processing and analysis of very large data sets. The solution is a parallelized divide-and-conquer approach and works like this. First, you split up the “problem” into small manageable chunks. Second, you fan out each chunk in parallel to individual “work units” (maps). Third, you take individual results from each unit and recombine them into your final result (reducers). In SQL parlance, conceptually, it’s like doing a select aggregate with a group by.

There are file based and database-centric applications of MapReduce in existence. Of course, the presumption is that your “problem space” can be split up in distinct pieces and recombined without information loss. And not all problems are either large enough or mathematically suited to this approach. But luckily data management is, by definition, perfectly well suited because, as a mathematician once told me “every data management task can be broken down into two and only two activities: partitioning and equivalence”.

Some folks think MapReduce is a modern breakthrough concept, but they’re wrong. The application of this algorithm to the management of large data is nothing new, as pointed out by Dr. Stonebraker in a 2008 posting. What’s “new” about MapReduce is that Google has popularized it. And the thought is, if Google can process and analyze the entire world via MapReduce, then clearly MapReduce must be the Holy Grail of monster data management. But Google has unique challenges (gargantuan data volumes), and some very impressive (and plentiful) gray matter at its disposal.

Because the interesting thing about this “divide and conquer” approach is that, although fairly easy to conceptualize, it’s incredibly hard to implement properly. The human brain is really not “wired” to think in parallel. Research has shown that the top brains can at best juggle seven different objects simultaneously (and for short time periods). To understand the intellectual challenges at play here, I strongly recommend watching this Google video series.

As I understand it, implementing MapReduce correctly and efficiently is probably as hard as conquering multi-threaded programming. And in twenty years, I have met three people who really understood multi-threading correctly and two of them were Russian PhDs. I've had battle-tested architects tell me they would rather shave with broken glass than tackle the risk and difficulty of multi-threading (luckily, they weren’t designing operating or flight-control systems!). My point is, it takes some pretty special skill and talent to do it right. Nothing inherently wrong with that, but it’s neither quick nor cheap.

So why then would database vendors race to support MapReduce? After all, dealing with and managing relational systems is complicated enough as is. But at least people have been trained in the art for decades, and SQL is lingua franca. So the pitfalls and solutions are well established. Additionally, Codd’s premise guaranteed abstraction by separating the logical layer (SQL and a normalized schema) from the physical one (hardware and storage). But MR is heavy with non-standard cross-layer implementation details by necessity. Clearly a step backward from the KISS principle (even a “major step backwards” if you buy into Dr. Stonebraker’s argument that MapReduce is offensive).

Regardless, three well-known new-breeders, namely Aster Data, Vertica and Greenplum jumped on the bandwagon early on and announced “MapReduce implementations” for their product. I wondered what compelled them to invest time and resources into something that didn’t seem essential (or cheap) to the market at large. Are users really clamoring for MapReduce support in their warehouse engines?

To learn more, I went to YouTube and checked out Aster’s video “In Database MapReduce Applications”. In it, I learned that graph theory problems (think: travelling salesman) were well suited to MapReduce but not SQL. Examples included social networking (LinkedIn), Government (intelligence), Telecom (routing statistic), and retail (CRM, affinities), and finance (risk, fraud). Pretty much anything that can be modeled using interconnected nodes. But a connection from a node to another is really a “relation”, and so clearly well suited to a “relational engine”. So I might have missed something.

I also learned that existing applications typically extracted data from the database, performed some analytic work on it, and then pushed the data back into the store. In other words, they couldn’t perform processing inside the database. I found that generalization hard to swallow but reminiscent of numerous past battles on whether “business logic” belongs in the application or database layer.

Aster’s implementation of MapReduce is “deep inside” their engine, from what I understand. One example I could find was yet another YouTube video called “In-Database MapReduce Example: Sessionize”. In it, Shawn Kung shows a MapReduce function being used inside a SQL statement to “sessionize” user IDs in a clickstream context. Aster also provides very basic how-to’s on their website and blog. Clearly Aster is targeting this new MapReduce capability at the DBA side of their users, and it looks a lot like leveraging UDFs to me. Aster’s conclusion: “we need to think beyond conventional databases.” I’m all for that!

Next, I wanted to learn about Vertica’s implementation. Especially since Vertica’s own Dr. Stonebraker had initially nailed MapReduce pretty hard as mentioned above. But Vertica’s new position seems to be that MapReduce is a-ok after all, provided it remains external and doesn't pollute the purity of the relational engine. I couldn't find much on YouTube or their website save for a press release dated 8/4/09 stating “With version 3.5, Vertica also introduces native support for MapReduce via connectivity to the standard Hadoop framework”. It seems the “scoop” on the Vertica/MapReduce wedding is best described in their corporate blog. Basically Vertica is OK with "integrating" or connecting to but not "ingesting" MapReduce (via Hadoop) if I understand clearly.

I was also able to glean some tidbits from Omer Trajman on Twitter. Namely that Vertica supports Hadoop “adapters” which allow you to read and write into the database (which is basically the press release). I wish I had more in-depth information about Vertica’s MR functionality but even a basic search for the term on their overly busy website yields zero information and, unless I missed it, I couldn’t find any relevant webcasts either.

Greenplum were, if I am not mistaken, first to support MapReduce. Greenplum has the best MR resource online if you ask me. It’s clear, detailed and full of insight. Greenplum has a merged/cooperative DBA/programmer approach in their offering. Programmers can write maps and reducers in their language of choice, leveraging DBA generated data sets as (and if) needed, and DBAs can use MR functions along with SQL without (presumably) getting their hands dirty. There isn’t much to add to this excellent resource so I won’t.

So having mapped out all these facts, what can we reduce from it (I’m so funny) and more importantly, should any of this stuff matter to prospects when evaluating ADBMS vendors? IMHO, you might benefit from a MR-enabled ADBMS if:

(1) You have petabytes (or more) of data, an MPP architecture, and a search, scientific research, or mining problem a high-performance SQL engine cannot handle.

(2) You don’t have heavy legacy systems. Integrating (or migrating) existing business and relational code with a new-breed MR-enabled engine can’t be fun, quick or cheap. You might be one of the lucky few with pet projects on the table.

(3) You’re in Academia and have access to numerous cheap and competent programming resources, lots of metal, plenty of time, and limited pressure to succeed.

(4) Your organization has a track record of successful projects dependant on symbiotic working relationships between your DBAs and your programmers. In my experience, DBAs and programmers don’t work well together. They have different goals and approaches. And it seems intellectual and political integration of both resources would be a sine qua non condition to success with an MR database product.

Short of that, I can’t imagine too many people lining up at the MR-ADBMS vendors’ doors simply based on their MapReduce capabilities. And I don’t think vendors make that case either. In my opinion, supporting MR in the product simply says “Hey, look at me, I’m at the forefront of technology. See how smart I am.” But as a buyer, I’d be a little concerned about overreach.

In fact, I wonder how these vendors spread resources efficiently (and economically!) between database engine building, cloud provisioning (which Aster and Vertica now pitch), and MapReduce integration. I suppose marketing requires less focus than engineering as a discipline but still, that’s a lot on one’s plate.

Friday, September 25, 2009

SELECT SUM(blessings) FROM working_in_bi

If you’re working in the business intelligence industry, you should really count your blessings.  Mind you, you can do that if you’re employed anywhere these days.  But there’s something special about what I call the “BI Family”.  I’m not referring to a specific segment of BI, and I am not focusing on any particular job function. I’m talking about working in the BI industry as a whole.


I feel somewhat qualified to comment on this because, in the past twenty years, I’ve worked in numerous industries.  To name a few: life sciences, research, semi-conductor, telecom, payroll, accounting, systems integration, consulting, audio-visual, online media, financials, accounting, and insurance.  So I have a lot of background to compare from.  And believe me, there are worse adoption outcomes than membership in the “BI Family”.  Here’s my subjective top ten list of why working in this industry is really cool (no specific order).


#1. Brains
People are really smart. I’m not suggesting other industries spawn dummies, but the proportion of high IQs in the BI world always amazes me.  It’s often humbling and always stimulating.


#2. Strength
As we all know, the BI market is not only huge, but getting larger and growing yearly to the tune of 8-10% a pop. Other industries are not so healthy to say the least.   Simply put, BI is clearly not a fad, and no one questions its future.  It’s one of the fuels of our free-market system by protecting businesses and providing them with competitive tools.


 #3. Meaning
Guy Kawasaki says: “make meaning”.  He’s right.  Life’s too short to be in a meaningless industry.  And if BI isn’t about making meaning then I don’t know what is.  The whole purpose of the industry is to make meaning and support critical decision making.  This industry yields real-life significant solutions to crucial sectors like health, research, medicine, and defense (to name a very few).  


#4. Quality of life
In light of ominous HR predictions, news of recurring layoffs, and current employment trends in the IT industry, the BI sector has been relatively spared. Clearly I haven’t done any formal polling but I get the “vibe” that people are generally pretty happy to be in this game.  And why not. Compensation is generally good.  And location-wise, BI companies are clustered around national funding clusters namely Northern California and the New York/Boston areas.  These comprise some of the most magnificent (and most expensive, granted) landscapes and vibrant urban areas in the country.  Other industries have centers in, shall we say, less compelling geographic areas.


#5. Funding
I lamented the VC situation in my previous post, and clearly this doesn’t apply to all segments of BI, but if you have any sort of compelling BI proposition with the word “cloud” in your business plan, trust me you will get a VC’s attention.  Maybe not an official invite to pitch in person, but most likely a phone call.  In other industries, that opportunity is long gone.


#6. Gratification
BI projects used to take many months (sometimes years) to implement (when they even got completed).  But nowadays the industry is in “agile” mode.  And those who don’t embrace that won’t likely be in this business much longer.  This means you get to build solutions and see results quickly.  That’s gratifying. 


#7. Globalism
BI is world-wide.  True, so are most other industries, but from my experience, there is less of an “us versus them” attitude.  It has a fraternal feel to it.  Hands and minds seamlessly reach across continents in ways I have not experienced elsewhere. (I’ll go hug a tree now).


#8. Bozos
Most people are genuinely nice and unassuming.  I know this sounds naïve at best but it’s true.  I have not seen the level of ego, axe grinding, or personal animosity frequent in other industries.  Every contact I’ve initiated from top analysts to CEOs in this industry has been followed up promptly with courteous, genuine and insightful discussion. I’ve found most people to be more generous with their time and advice than in many other industries. Maybe they fear less for their jobs or fancy titles. In either case, the BI industry is fairly low on the bozo scale.


#9. Passion
People in BI are passionate about their field.  I’m not saying they get out of bed every morning to go save the world (onward BI soldiers), but overall they value their work and their contribution.  Most people I’ve met in this business are workaholics.  They know their stuff inside-out and boy do they love to talk about it.  Passion signals a great, vibrant industry.  Additionally (and this key), there seems to be better customer advocacy in this industry than others.  It’s not perfect, but vendors often do listen and react accordingly.


#10. Innovation
I hate to reveal this well-kept secret (don’t tell anyone!), but there isn’t a lot of desire to innovate in the financial, accounting or insurance fields, for example.  I’d be preaching to the choir by pointing out the myriad of new-breed ADBMS players out there, but also the multitude of new OLAP, data mining and analysis products, approaches and new (non-relational) ways of looking at data, cloud BI, EC2, etc.  We’ve seen orders of magnitude of both hardware and software innovation in the BI world.  It is a rich intellectual field teaming with innovation levels typical of a “new frontier” because it is.


So what’s the point of this apologist diatribe?  Just to remind people in this field to count their blessings. There are many worse places and industries to be in.  And it’s easy to take things for granted in the heat and excitement of daily business life.     


In the past eighteen months I’ve met many challenges in this business.  From coding to QA, to technical writing, from sales engineering to evangelism, from product management to market analysis. You name it.  So I’ve seen a lot of the facets in a very intense, very short amount of time.  


And I’ve also been lucky to interact with numerous players in the industry, many of which have generously spent time and resources supporting my self-education efforts with their insight, connections, and advice.  You guys (and gals) know who you are and I thank you for the help. There are many mensches in this business.


For the first time in my life, I think I can say I’ve found a home here in the BI industry.  I’ve never felt this way in the past twenty years, and I’m not exactly sure how to explain it, but like the old pair of shoes my wife keeps insisting I ditch, it just feels right and I’d like to keep it that way.

Tuesday, September 22, 2009

Please Stop Making More ADBMS Sausage

If you’re thinking about building a new startup in the high-performance analytical database (ADBMS) market, hat’s off to you: kudos and respect my brother. I’ve been in the kitchen, and I’ve seen the sausage being made. But let me tell you something: you might be a day late and a dollar short to the party.

In the past years, I’ve often pondered where the new-breed high-performance analytical database industry was headed. Will the existing players manage to survive? And is there a chance in hell for new ones to succeed in this market? If you had asked me to peek into my newly minted BI crystal ball in early 2009, I would have said “no way”. Why? Because at the time I was predicting the demise of a majority of the twelve or so players in this space based on the observation that the field was too crowded and too expensive. I figured, in this cut-throat competitive space, and with tough economic times ahead, we’d be lucky to see two, maybe three survivors come 2010.

Since then, not only have several other significant players, technologies and business models popped up (for example, Groovy, XtremeData, VectorWise, Hadoop/MR and the OSS guys), but we have clearly not seen the level of attrition I was anticipating. Nobody has (officially) gone out of business save Dataupia as best I can tell, and Datallegro got a check from Steve Ballmer. Sure, some folks are experiencing tougher times than others (I dare say several are hanging by a thread) but overall, resilience has been the name of the game. So what gives?

From my point of view, there are two types of new-breeders really: those living in “comfortably numb” mode (quietly outliving peers may not be a bad strategy these days), and those kicking it into high-gear with a vengeance. In the latter category I can’t help but think of Vertica, Netezza, Aster Data and ParAccel. It takes a lot of “cojones“, cash and luck to build a new ADBMS company. But even blessed with all these, and given the proper planetary alignments, I would advise anyone considering a start from scratch nowadays to ponder the following points.

First, it is insanely expensive and complex to develop systems software to produce a complete analytical database engine (and I mean “complete” in a holistic Product Management sense). Tony Bain highlights some of the database startups challenges in his excellent series starting here (things are not significantly different between OLTP and OLAP in this respect). But systems software is a different animal than your run-of-the-mill corporate enterprise application. On average you’re looking at 200,000 man-hours and anywhere from $60M to $100M to fund such a venture to completion. This is just to get rolling. Additionally, you cannot just put out a database product and call it a day.

Maintaining (enhancing) a behemoth of Oracle or SQL Server stature runs hundreds of millions of dollars every single year. Everything from equipment to talent costs more when developing database software. Believe you me, the folks who will work on your SQL optimizer, inter-fabric communications, parallel or compression schemes better not be affordable newbies. Your development platforms won’t likely be the average oh-hum laptop attached to cheap storage. An efficient QA or Performance Group will cost a small fortune in payroll and redundant equipment. And seasoned performance architects don’t run the streets. You cannot assemble a database engine product by cobbling together open source bits and distributed talent like a new Web 2.0 RIA venture. You can’t take a Kia (even a dozen of them) to the Indy 500.

Second, I think it’s no longer possible to find sufficient levels of Venture Capital funding for such endeavors. My feelings on this issue are re-enforced when I read articles like this, or this. I think the writing was on the wall for several years now. Reports of the VC’s demise are greatly exaggerated, but the funds, the endurance, and the risk acceptance levels are gone. Small bets on small returns are in. Large bets on IPO-driven returns are out (for now). Even if you manage to score a major industry name like Mike Stonebraker on your Board, I think VCs in this space (those that are left and not now engaged in M&A) will say “talk to the hand”. Investors in numerous existing new-breeders are biting their nails to the bone (or looking for ways out). So to me, the train has left the station. And unless you can pony up your own seed money, trying to fund such a project via institutional money is currently, in my opinion, an exercise in futility.

Third, the field is already too crowded and spread out very thin. For a great overview of the major players out there, don’t miss Bloor Research’s competitive analysis paper. A lot of existing players do not have sufficient “boots on the ground” to make headway against larger ones, much less established powerhouses. Heck, even going against Vertica’s deep-pocketed marketing is no piece of cake. Worse yet, in this business, success is not guaranteed by technical superiority. I know it sounds heretic saying this about an industry dominated by performance claim testosterone, but it’s true.

Besides technical prowess, you need to get the word out louder than everyone else. Unfortunately, everyone has the same “word”. In a crowded space this means you have to yell “Fire!” pretty darn loud and relentlessly to get noticed. I think a lot of “database deals” are sealed on the golf course, more so than POC or bake-offs. Mind you, this is probably the case with most enterprise software. But to get your foot in the door, you need BOD and Primadona action. BOD are well-connected heavy hitters on your board. Primadonas are the star sales guys (or gals) currently working for your competition. Those you’ll have to poach with sweetheart deals to come work for you, a totally unproven new-breeder with a year of runway to go.

Fourth, pricing pressure in this business is relentlessly choking. This is a consequence of my previous point. A little over a year ago, word on the street was $100K/TB retail ($50K/TB street price) but now we’re seeing $20K/TB retail (TwinFin land), which probably means you can do $10K/TB on the street. Aster Data is pitching an appliance for $50K (1TB, includes Dell hardware), and Oracle’s new improved Exadata V2 (SATA storage) even touts $5,700/TB so I mean, at these margins, you’re basically talking about giving stuff away, and in a lot of cases, I suspect that’s what’s going on. So unless you’ve been around the block a bit and have some ammo in the bank, I don’t know how a newcomer can sustain this type of pricing “carpet bombing”. As if that weren’t enough, you have OSS and cloud players breathing down your neck. Customers expect more for less and perception of BI as a “commodity” is growing. In this pricing environment, survivability for a newbie is improbable at best.

Fifth, several windows of “technology opportunity” for ADBMS are closing. For example, if your great idea for a new ADBMS company involves a columnar approach, you might be too late to the party. If your “innovation” hinges on massive parallelism, compression, in-memory caching/cubing schemes, super-fast intra-nodal fabrics, hybrid MPP/SMP, hybrid row-column storage (PAX-like), or yet another SQL chip accelerator or super-duper FPGA, you might have missed the boat (on the other hand, if you figured out how to do analytics on compressed encrypted data, then you might be on to something).

I believe the top new-breeders did all the technical legwork in the past 4-5 years. It took Mike Stonebraker long enough, but Vertica pretty much put columnar on the map. And most of that engineering is now mature enough (and well proven) to warrant acquisition interest from the big boys. Initially, the big guys took a “wait and see” attitude (remember, OLTP butters their bread anyway, not analytical OLAP) but now, having seen results and traction on others’ dime, I think they’re ready to pony up some cash (classic buy vs. build decision) to absorb the bits and pieces suiting their marketing strategies. By doing so, the good ole boys re-invent themselves and say “hey look, we have columnar technology as well now!” (How Sybase didn’t corner this market with IQ is beyond me, especially having read Seth Grime’s excellent paper about it).

Better yet, by integrating new technologies into existing code bases, the big dogs can say “hey, we have the best of both worlds for OLTP and OLAP” (Oracle’s latest Exadata comes to mind). And perhaps “look, we have SMP on the processing side and MPP in storage layer”, or vice-versa, thereby returning to the old “one-size-fits-all” GP-RDBMS paradigm so criticized by Stonebraker (but so convenient for the corporate user).

And clearly, given the growing popularity of “operational analytics”, an OLTP+OLAP offering is compelling. So I think the “proof-of-concept” window for many new-breed technologies, specifically MPP columnar (but others as well, for instance, acceleration hardware, where Ingres is picking up VectorWise, or MPP where Microsoft snapped up Datallegro), has closed. The winners (and their results) are in and acquirers will likely make their move in 2010. This dovetails nicely with a recent TDWI survey claiming half the respondents plan to replace their DW platform between 2010 and 2012 (apparently this is presented here on October 7th).

All this being said, is it possible that a brand new software endeavor currently in stealth-mode development in Nepal might suddenly dominate the analytical database scene within months? How about a revolutionary FPGA/SQL Chip/Flash/Optical hardware contraption that could blow the hinges off industry standards and benchmarks? Sure why not. Real innovation is usually unexpected, and often unintended. But I don’t see it being driven by the classic “VC funds startup makes big database scores big IPO” model much longer.

When I look at things like Hadoop and current developments in the OSS space for VLDB analytics, I still have trouble grasping the business model, but I clearly see “life force” innovation at work here. A year ago I would never have expected a place like Visa to stray from the “Big Threes” but nowadays these guys are messing with Hadoop! People are also doing amazing things with MapReduce implementations and BigTable KV types of massive data storage systems.

How does open source fare against the five points mentioned above? Pretty darn well if you ask me. Costs are significantly lower, venture capital is not needed or minimal, engineering is crowd-sourced, there’s more breathing room, market entry is viral and massive, distribution and testing self-fueled, and pricing (or lack thereof) better controlled.

Additionally, open source seems shielded from the “Borg Effect”. I don’t see how massive proprietary shops like Oracle, IBM or Microsoft can successfully “absorb” these entities. I don’t think Larry has a clue what to do with MySQL. He can’t really unload it, but he can’t really integrate it either. Darn Trojan horse! In 2008, Infobright went Open Source and raised $10M in the process. Looking at the results, I think these guys were smart!

So if you’re thinking about building yet another high-performance analytical database engine the classical way (and not going OSS), my advice to you is simple: unless you have $60M in the bank and technology significant (and new) enough to impress people like Daniel Abadi (good luck on that one), you might be climbing up a greased pole. I'm not saying it's impossible mind you, but there’s been a lot of cooks making the same sausage over the last five to six years to last us a while. Maybe it's time to look at the next curve.


Thursday, September 17, 2009

Vaya Con Dios - The Day I left XSPRADA

Effective today, I will no longer be working at XSPRADA. There, I said it. Catharsis, take me away! It was definitely a tough decision but this is par for the course in the startup world, and God knows I’ve been there done that, but in this case, my close long-time personal relationship with the founders and unwavering worship of the technology for the past ten years make this particularly bittersweet.

I was very lucky to work with some of the best, most resilient people this industry has to offer and I don’t claim this lightly. As you know, I am parsimonious with compliments. But "they" say that tough times never last, only tough people do. This is why I am still convinced that XSPRADA technology will someday take its proper place in the world come what may.

A failed endeavor is only one during which you have learned nothing. And in this case, the XSPRADA opportunity has enriched me in personal and professional ways beyond my wildest expectations. So I depart a richer, more experienced man for it. In an industry too often clouded by “smoke and mirrors”, I am always reminded of the following advice I received a while back:

  1. Don't profess about things you are not sufficiently familiar with.
  2. Don't assume something isn't true that might be, or is that isn't
  3. Listen at least twice as long as you talk.
  4. Ask twice as many questions as you answer and LISTEN to the answers.
  5. When asked about something and you don't know, say "I don't know."
  6. If there is even the slightest possibility that you could be wrong, acknowledge it....
  7. and don't forget to thank people for their time and advice.

You combine that advice with the one found here and believe you me you’ve got yourself one kick-ass sales engineer there. They don’t run the streets. I’d say these points could (or should) constitute the Seven Commandments of the Sales Engineer (and probably any other profession, except perhaps politician). They were given to me when I started by Chris Piedmonte, founder of XSPRADA, and a guy whose courage, integrity and technical brilliance are, in my book, without equal.

Be that as it may, I must soldier on. Naturally, I will not be able to discuss topics pertaining to XSPRADA technology as an insider from now on, but Lord knows there is sufficient ADBMS/BI material out there that’s interesting enough to cover and discuss on a regular basis. Namely, the new “PAX Analytica” movement as originally brought up by Curt Monash and professionally laid-out by Daniel Abadi as usual in his excellent post.

Also, the old one-size fits all (OLAP+OLTP+whatever) versus dedicated engines (columnar OLAP) has been revived with the recent ORCL announcement touting the new improved Exadata V2 (exit HP, enter Sun). This deserves addressing in more detail. It leads to serious implications for current and potential customers.

Relevant as well is a discussion about the future. Will the “big boys” end up swallowing “new-breed” technology and integrating it (what I call the Borg effect, as discussed on Daniel Lemire's excellent post). Or will they become obsolete allowing the new-breeders to survive long-term as independent replacement entities?

And then there’s a recent thread about analytical speed which is very relevant at this point in time I believe. Finally, a little bird told me there’s about to be some really interesting rumble (again) pertaining to the infamous (or not, depending on which side you’re on) TPC organization. Indeed, there’s no lack of interesting topics out there!

Additionally, I think there’s a compelling story behind employment search and provisioning in the BI industry, so I’ll be penning some thoughts about that as I go along. In my experience, you can infer a lot about an industry’s state by checking its recruiting culture and pulse. Joy Chen claims in a recent blog posting that 54% of workers plan to resign after the recession. If this prediction is correct, the impact on our industry is sure to be felt and that, IMHO, is worth discussing.

So what’s really happening behind the employment scene in BI? I’ll be sharing some thoughts about that as I embark on the new path the BI Gods have charted for me. So thanks for sticking around, and as they say where I come from (well, ok maybe a little further South) Vaya con Dios!

Wednesday, August 26, 2009

Mole Whackers Need not Apply

Two completely different events caught my attention lately. One of them is a post by Curt Monash called Bottleneck Whack-A-Mole, and the other is the much-publicized alliance for “BI in the Cloud” comprising RightScale, Talend, Jaspersoft and Vertica.

In the post, Curt describes software development (or developing a good software product) as “a process of incremental improvement”. Fair enough. The analogy he draws is between constantly fixing and improving performance bottlenecks and the annoying (if entertaining) arcade game of Whack-A-Mole where you have to be fast enough to clobber enough of the critters as they randomly pop up from below. He then makes the point that “Improving performance in, for example, a database management system has a lot in common with Whack-A-Mole.” Having spent most of my life designing, developing, improving and testing commercial and enterprise software applications, I have to say I don’t totally agree with his analogy for several reasons.

First, call me old-fashioned, but I’m an ardent believer in the fact that software building is deterministic. Whack-A-Mole engineering is not. The age-old controversy about software being more of an art than a science may never be resolved, but at the end of the day, I feel software is (should be) a scientific, engineering-driven, deterministic endeavor like any other engineering discipline. With Whack-A-Mole engineering, buildings and airplanes fall to the ground. That’s not good. In my experience, those who seek to “romanticize” software engineering are typically adverse to proper planning, design and testing as being too “dry” or unworthy an endeavor. That’s nonsense.

Second, there is a distinct difference in the way you develop “regular” software from “system software” and I’ve learned this from sitting in the front row the past several years at XSPRADA watching database software being built from the ground up. It’s a little bit like the difference between building a tree house and a major commercial skyscraper. And I believe that playing Whack-A-Mole games while trying to bring up a building is a scary proposition at best (especially for future tenants). And yet, the example Curt provides involves Oracle’s Exadata, of all products! He states: “When I spoke to Oracle’s development managers last fall, they didn’t really know how many development iterations would be needed to get the product truly unclogged” – This statement is mind-boggling to me.

Because for one thing, it suggests that Exadata is “clogged” (ouch) but worse, that their engineering people have no clue as to how they might eventually (if ever) snake the blockages out of it! So, basically it’s a trial and error approach to building a database. Notwithstanding their “professed optimism” that it wouldn’t take “many iterations at all” to finally figure things out, it certainly doesn’t give me (or any reasonable person) a warm feeling about a multi-million dollar product claiming to be the world's ultimate analytical machine.

I think there’s a lot to be said for sound engineering practices, proper planning and testing, setting expectations and deterministic engineering management practices in the world of system software. That Oracle (or Netezza for that matter, also referenced in the post) might just be going along whacking moles instead is a scary proposition indeed. Even if this little game is limited to “performance engineering” as Curt suggest (as if there was a more important endeavor in an ADBMS), that’s a serious allegation in my book. I say leave the arcade games to the kids, and let the real engineers design and implement database and system software please. There’s no room for amateurs in this game.

On to my next point of interest: the new Gang of Four in the Cloud (with apologies to design pattern aficionados) comprising RightScale, Talend, Vertica and Jaspersoft have recently promoted and demonstrated a “bundled” on-demand package for the cloud. I attended their webcast yesterday and was impressed, but with reservations.

Each of these vendors is impressive on its own, no doubt about it. But it seems to me the bundled proposition might be confusing at best to the unwary customer. This new offering is billed by the marketing folks as “Instant BI, just add water” which drives me nuts. Look, it might be simple in theory, and it might take a few minutes to setup the stack on your own (as Yves de Montcheuil from Talend claims) but it’s still a long way to actually accomplishing anything serious in a few simple clicks. Sorry, not going to happen anytime soon.

You still have to work your way through provisioning and instance management (RightScale), data integration and loading (Talend), feeding and configuring the database (Vertica), and setting up the reports/analytics you might need (Jaspersoft). All of which can be accomplished just as easily (or not) internally by the way. It’s true you’d still have to purchase or license Vertica internally, which may or may not match the SaaS pricing I’m not sure (and either way, Vertica has a SaaS offering as well) but the other components are open source so, I’m not sure I see the big advantage there.

An interesting thing I noticed as well is that some people didn’t seem to understand what RightScale’s role was in the whole offering. This tells me they don’t really grasp the intricacies of “the cloud” – because instance and infrastructure management for enterprise in the cloud is not trivial and you do need something like RightScale to grease the wheels (it’s an abstraction layer really), but I think many people assume moving to the cloud is “magic” and makes all these issues disappear. If that were the case, you wouldn’t need RightScale in the mix. Beware undermanaging expectations I'd say.

Additionally, the pricing model (which is supposed to be so much simpler in the cloud) is confusing at best as each vendor has its own menu. The best answer to that I can remember was “starting at $1,700 per month” – I’m not sure what to make of that. So I think from an engineering/technical standpoint, this endeavor is noble, but from a “let’s make things simpler and transparent for the user” perspective, there’s still a lot of work to be done. In other words, it's a nice play for the vendors holding hands, but I'm not sure how beneficial it might be to the average enterprise user.

As usual, caveat emptor – Beware promises of a holy grail in BI as there is no such thing. It’s all about work. Hard, detailed and careful work with proper planning and budgeting. In that respect, setting up successful BI solutions is a lot like running and implementing software projects. There are no shortcuts, and it’s not a job for mole whackers.

Monday, August 17, 2009

Oh yeah? Well my database is SMALLER than your database!

Contrary to popular edict, smaller is not always better, unless of course you’re talking about analytical database engines. In that respect, it’s hard to find an ADBMS that can fit on hard media like a CD or a USB stick. For example, I don’t think SQL Server, Oracle, DB2, Greenplum, Aster, ParAccel, or the myriad of other ADBMS vendors can fit all their bits in a tight spot. Even in the open source realm, I doubt you can wedge InfoBright (MySQL) or IceBreaker (Ingres) onto a stick, much less shlep their bits around as an email attachment.

One exception to this is the V-stick from Vertica. When I first read about this, I initially thought it was a hoax but apparently not. It’s pretty cool too because it includes the O/S, web server, GUI and the engine all together on a 16GB thumb drive. How an engine like Vertica, designed around distributed MPP, can possibly operate representatively (using terabyte-size data) on a thumb drive is beyond me, and I’ve never heard of anyone actually using this gizmo but I’d sure love to get my hands on one and review it if it’s still available.

The other exception of course is RDM/x, the XSPRADA database engine. The reason is simple: its total deployment footprint is around 10MB. That includes the 32/64 ODBC drivers and a couple DLLs. The engine itself is currently around 6MB. Last I looked the installer clocked in at 16,760KB. This means you can actually deploy RDM/x onto a memory stick if you want to. I tried it, it works. It’s pretty cool. But after a while I wondered, why would anyone care about this?

The reason is two-fold. First, it’s really easy to try out software that is small and self-contained without expanding large amounts of time and resources. Yes, you can download RDM/x from our website but in many cases (like secured firewalled enterprises), that’s not an option.

Second, it means we’re a good candidate for embedded applications. Because if I can fit my database engine on a stick (or in an email), I can probably embed it in instruments and devices as well either as raw C++ code or libraries.

But for quick POCs, size and simplicity really does matter. Say you’re suddenly tasked with evaluating solutions to deploy a BI solution inside your company. Suppose you’re a Microsoft shop. Suppose additional capex is not an option, and suppose further you have a week to show results (namely a set of nicely formatted reports, pivot tables or dashboards). Now what? If you have significant in-house experience with SQL Server and associated SSAS, SSIS, SSRS, and Excel (and assuming you have a clear and deep understanding of the business scope and goals to begin with) you’re probably going to:

(1) Figure out where your source data is coming from (connection strategies)

(2) Model your DW (figure out grain on facts, dimensions etc, need to figure out BIDS and SSAS)

(3) Establish some preliminary ETL process (including incremental loads, need to figure out SSIS)

(4) Load your warehouse (if you screw it up, then need to drop and do it over)

(5) Setup an SSAS cube structure (figure out SSAS via SSMS or BIDS then publish the thing)

(6) Figure out what queries to generate (talk to DW DBA or learn MDX)

(7) Figure out what BI tool to use (Excel or browser, depends on policies and audience)

(8) Generate the reports (canned or ad-hoc)/dashboards/pivot tables for the POC

Now, if you have no prior experience with the Microsoft BI toolset, and you can whip this little project up in a week, guess what, you need to quit your job and start a consulting company because clearly, as a NYC recruiter once told me “you’re so money”. But if you’re a normal person with little prior BI experience (and the terms ROLAP, MOLAP, SCD and MDX don’t ring a bell), you’re in a bind.

So another thing you can do is download a tiny analytical database (say, the XSPRADA RDM/x engine, for example) and throw, say, 100GB of data at it (this is just a small POC remember?), then plop Excel on top of it and generate some really cool reports or pivot tables to show the boss (in under a week) it can be done. How hard is that to do? This hard:

Figure out where your source data is coming from.

Yup, that one is pretty universal in the BI world. Difference here is all your data sources will export as CSV to feed the XSPRADA engine. So at least that’s consistent across all sources (be they structured, semi-structured or not). CSV is data format lingua-franca so your connection "strategy" is this: get everything out as CSV. Plain and simple.

Model your data warehouse.

That’s always a smart thing to do for obvious reasons although the XSPRADA engine is schema-agnostic and you can feed it normalized or star/snowflake models at will. The secret phrase is: “we don’t care”! So for a quick POC, if you find yourself "forced" to feed RDM/x a 3NF model, no worries.

Establish some preliminary ETL process.

RDM/x runs against initial CSV data islands directly off disk. Point to the CSV files using the XSPRADA SQL extensions for DDL and you’re done. You’ll likely be doing this via script or code (C++, Java or .NET to the ODBC driver directly or via a JDBC-ODBC bridge). For incremental loads, just plop the new CSV files on disk and point RDM/x to them using the INSERT INTO…FROM extension. This process can be done in real time without disruption while other queries are running. No hassle there.

Load your warehouse.

That’s executing a single line of SQL DDL code such as

CREATE TABLE ….FROM “c:\file1.csv;c:\file2.csv…c:\file32.csv”; or INSERT INTO…FROM “c:\file1.csv;c:\file2.csv…c:\file32.csv”;

Made a mistake of want to modify the schema and “reload” real quick? Not a problem. Simply re-issue the same DDL command and the table/schema is instantly updated. From a trial and error perspective (which, in a POC situation, is fairly typical), that’s a high-five.

Setup an SSAS cube structure (figure out SSAS via SSMS or BIDS)

There is no concept of cubes inside the XSPRADA engine. RDM/x automatically slices and dices based on incoming queries in real time. So if you want to “cube” just feed the engine slicing OLAP queries. RDM/x automatically restructures and aggregates in real time. No need to pre-define or pre-load cubes, deal with hierarchies or materialized views. I blogged about this earlier. RDM/x is a lot like Luke 11:9 – Ask and you shall receive.

Figure out what queries to generate (talk to DW DBA)

That’s where an external tool using MDX (along with an MDX expert!) can come in handy (most people don’t roll their own SQL for OLAP, although it can certainly be done in POC mode). One cool thing about RDM/x is its ability to “withstand” poorly-formulated SQL because the queries are optimized against the internal mathematical model. RDM/x is typically more “SQL-forgiving” than most other engines. And a poorly formulated query is likely transformed internally to still yield optimal performance. So even if you’re no SQL guru, the RDM/x engine is still on your side.

Figure out what BI tool to use (Excel, no brainer)

Connect Excel to the XSPRADA engine directly via ODBC or connect Mondrian to RDM/x (via bridge) then connect Excel to Mondrian via the SimbaO2X ODBO/XMLA connector. Alternatively, make the argument that using OSS like Pentaho or Jaspersoft against RDM/x directly is more flexible and accessible (not to mention cheaper!) than messing with Excel. Depending on your user base and corporate standards, that argument may or may not hold water.

Generate the reports/dashboards/KPI/Pivot Table/ad-hoc queries required by management.

Exactly the same way you would using any other tool and/or SQL.

At the end of the day (or in our case, the week), it’s all about “time to results” and “pain to results”. In those types of situations, smaller and simpler clearly has a significant advantage over the rest. And speaking of smaller, I have run over my allocated space for this posting :)

Friday, August 14, 2009

Bits & Pieces Summer Posting

I thought I would do a “freeform post” today to celebrate the lazy 2009 Summer and the fact that most of the planet (but not the BI world for some reason) seems to be on vacation at the moment.

As you know I’ve been following ParAccel with interest for a short while now wondering how they would deploy those $22M of Sales & Marketing greenbacks they just scored. I read this interesting article about them lately and it looks like “customer acquisition” might be part of their strategy. Good move. Unfortunately, I couldn’t determine who the “other database products” or the other “columnar-MPP database” vendor might refer to. I can only surmise it might be Vertica. If anyone knows who else OfficeMax looked at, please share the wealth.

For all my whining about missing TDWI in my own backyard (San Diego) lately, it seems I didn’t miss much after all according to Merv Adrian who posted about the conference shortly thereafter. From the looks of it, the highlight might have been a sunset ride on the Lyzasoft yacht in the San Diego bay. Talk about good PR!

Andy Heyler wrote a good piece about the demise of Dataupia called No Data Utopia. In it he refers to the “awkwardly named” Dataupia. Yes Dataupia is a weird name. But so are several others such as Kickfire, Tokutek, or Calpont, for example. And although XSPRADA is admittedly rather funky, at least it’s an acronym (Extended Set Processing for Rapid Algebraic Data Access). To me the money quote in there is “you need to have a clearly differentiated position in such a crowded market”. It’s pretty much what I’ve been saying for a while (and common sense if you ask me). At the moment, I don’t see any of the players in this market besides XSPRADA with a “clearly differentiated position” on anything. At the end of the day, it’s still all about the prisons that are columns and rows.

Netezza announced the long-awaited (not) TwinFin product line prompting a flurry of nasty posts from competitors like Kognitio, and an interesting Monash post about data warehouse pricing. Much like traditional software license pricing, ADBMS prices seem to be reaching for the bottom (stay tuned for $19.95 per terabyte while supplies last!). And as someone recently said, the bottom is open source. Should be interesting to see what happens. Personally, I think a lot of this stuff is going to become commoditized. The play looks a lot like printers or razors, where the actual hardware is sold dirt cheap, but the paper or blades cost a fortune to replenish. Caveat emptor.

A relatively new ADBMS vendor called XtremeData has emerged. It looks like they’re based in the US (Schaumburg, IL to be precise) but the actual brains of the operation are in India somewhere. They’ve certainly been vocal on several BI blogs (namely, DBMS2). To me the funniest thing is their “ChalkTalks with Faisal” screencasts. Faisal is apparently their India-based CTO. The entire presentation is like a Netmeeting whiteboard session where Faisal keeps talking while drawing stuff on a whiteboard in a Flintstonish manner. All that’s missing are the stick figures. It wouldn’t be so bad if they realized the audio is horrible, due to the incessant noise from the marker writing on the board. It sounds exactly like squealing puppies in the background. It’s totally distracting albeit very amusing.

SQL Server 2008 R2 is out but without the Gemini (or Madison née Datallegro) pieces I guess. That’s Office 2010, parts of which are going into the cloud if I understand correctly. This whole Madison/Gemini “revolution” in BI is starting to get a little, shall we say, boring for lack of materialization. Not sure what’s going on with Microsoft lately but I’m getting more and more concerned. Even .NET seems to be taking a backseat to Java/J2EE. It wasn’t like that a year ago. I am sensing a scary downward spiral. One thing’s for sure, I have yet to see anything remotely connected to .NET or C# in the BI programming world, save for those .NET C# extensions Aster provided in their engine recently for doing MapReduce (and of course the PushBI initiative but even there…). To me it seems the entire ADBMS/BI code stack is Java on Linux (SuSE, Red Hat and CetnOS). I’m talking about the supporting/ecosystem tools of course, not the underlying engines (those are mostly C/C++ I believe).

I could be wrong but, it’s not looking good for Microsoft. My prediction: this behemoth will eventually split off into a myriad of smaller entities, some of which will survive, some of which won’t. The sum of the parts may be worth more than the whole.

Ingres has actually managed to generate some buzz in the US lately by announcing it is teaming up with a company called VectorWise (well, it’s a research outfit actually) to develop a “project”. No customers yet but a lot of very fancy PhD types in Amsterdam (they did MoneyDB/X100) and a first option to buy VectorWise is rumored should the venture be successful. Time will tell. It always amazes me why Ingres isn’t more of a household name in the US. In Europe, they’re very popular (at least that’s what my French Ingres experts tell me ).

Finally, a recent article is claiming that BI is used by only 8% of employees in the enterprise and that’s, of course, only counting the shops that have implemented it to begin with. Actually I find that number high and would have guessed more like 5%. This is not surprising in light of the recent economic woes and scandals we’ve witnessed recently. In most of these cases, it wasn’t a lack of technology or resources at play but rather a conscious choice to ignore reality. The tools are there. The desire is not.

In my opinion, this disconnect is very apparent in numerous retail outfits. Places like Home Depot, Whole Foods, AT&T or Circuit City (RIP) for example, who clearly have resources and tools to perform and exploit top-notch business intelligence but still manage to provide mediocre service or product at best consistently.

At Home Depot, they can’t (or won't) keep track of inventory correctly. I was once told it was because there was too much theft (both internal and external) to update the databases frequently enough! Consequently, they can’t tell you if they have some items in the store or not. Not all items, just some of them. And when they can, they’re unable to locate them physically inside the store (as in what aisle and section). That’s shocking to me given what we constantly hear about RFID and data warehousing investments at these large box places. But Lowes does a much better job of this so clearly, it’s not a technology issue.

At Whole Foods (at least the one in Irvine where I live) the quality and quantity of their product is inconsistent at best. Some days the self-serve fish is fresh, and sometimes it is not (and has been sitting out for too long looking like a nice fat food poisoning lawsuit waiting to happen). Like the proverbial “box of chocolates”, you never know what you’re going to get. Similarly, their checkouts are never balanced. You’ll see huge lines at several of them while employees sit idle at empty others. Every time I go there some happy-go-lucky line manager with a bright idea of the week makes it harder for me to shop and enjoy it there which is why I never set foot in the place any more (instead, I go to an even more expensive supermarket). Apparently no one is keeping track of this. Least of all the department managers who seem to frequently act on impulse in trial-and-error fashion. Yet surely Whole Foods has an uber-BI stack running somewhere in Austin giving them a “big picture” on a store by store basis right? You’d think someone would be paying attention (or maybe even using it)? But clearly they are not, or someone would be fixing these problems (or at least genuinely addressing customer complaints, which they won’t because it's "inconvenient" for them, as they like to put it).

I don’t think it’s so much about the difficulty of implementing and leveraging BI as the article suggests. I think it’s about genuine laziness on the part of upper management. Because, at the end of the day, if you’re the CEO of a place like this, you need to get your highly compensated butt out on the floor to truly see, taste and smell what’s going on. You need to talk to your employees, your customers, and get in their shoes (incognito if possible). I’m always shocked to hear C-level people lamenting the fact that their CRM system isn’t giving them enough visibility into their customers. Or better yet, they need to “understand the customer” better. Who’s kidding who? Fact is they simply don’t give a hoot most of the time. And if they’re too lazy or too self-important to do that, they’re not likely to pay much attention to BI tools and warehouses either no matter how fancy or ubiquitous the software might be. That's really just the nature of what "service" has become in the US lately. In order to improve BI usage, we will have to improve the quality of Management first and put real folks back in charge.

Tuesday, July 28, 2009

How I Learned to Love Mondrian (confessions of a WISA guy)

I’ve been playing with Pentaho’s Mondrian for almost a year now on and off. I have to say, everytime I mess with that stack I am more and more impressed by its richness and capabilities. And twelve months ago, when I started learning it, I was what you could call “severely LAMP-challenged”. I’ve sure made a lot of progress since then and wanted to talk about this as I figured it might help other Microsofties out there needing (or wanting) to put a toe in these mysterious LAMP/OSS waters.

The first thing I ever did with Mondrian was figure out how to install it on a Windows platform. The reason I did was twofold. First, we didn’t have appropriate Linux hardware/software in house at the time, and second, I have way more experience on Windows so it’s a lot easier for me, and third, I wanted to do it locally and avoid dealing with cross-platform bridging at the moment (our ODBC drivers are Windows only as well). Path of least resistance is an engineering mantra in my book.

Lucky for me I had worked with Java and Apache Tomcat in the dot-com days so I had no trouble pulling and installing the JRE/JDK and the web server itself (which comes as a Windows service). Next, I deployed the Mondrian WAR file into the Tomcat webapps folder which caused it to be automatically “deployed” as a web application. Way easier than deploying ASP.NET applications (but you didn’t hear this from me ).

Then, I fired up the Mondrian landing page, clicked on the Jpivot link and, of course, kaboom. Yes, without a JDBC driver, Mondrian is not a happy camper. It took me a little longer to figure out the Sun JDBC-ODBC bridge and how to plug corresponding connection string it into numerous Mondrian files to replace the default connections there (which are all for MySQL if I recall).

The Mondrian documentation isn’t great but if you Google long enough you can usually find some other poor slob with a similar problem and, with luck, published solutions online. [Side note: I once had the nerve to email Julian Hyde, their Chief Architect, about some technical question. He abruptly suggested I don’t bother him and use the “community” forums instead]. Unfortunately those forums are often useless for non-enterprise (read: non-paying) users. There are flurries of unanswered questions and problems up there. This is a generic OSS problem I suppose. You get what you pay for J

So finally I had my bridge setup, along with a DSN called MondrianFoodMart (the default) pointing at the default Access database (distributed with Mondrian). And now I was able to fire up Mondrian on top of the database and do a couple drills, run a couple MDX queries. Bliss.

Next, I took that Access database and exported it to CSV format with corresponding DDL. This is the way to feed the XSPRADA engine. Fired up our RDM/x server and ran the DDL scripts. Then re-defined the MondrianFoodMart DSN to point to us via our ODBC driver (32-bit only, 64-bit won't fly with the bridge - painful lessons learned...). Reloaded the Mondrian page, and voila! Mondrian was now talking to RDM/x and displaying the Sales cube.

One point of the exercise was being able to show Mondrian OLAP on top of our database. Another was being able to show our database’s behavior in time as more and more queries come in (hint: it gets faster). Now, with Mondrian, this is a little tricky because the platform is heavily cache-based. Mondrian shoots initial queries at a relational system and proceeds to cache heavily as it aggregates results. So the more you use it, the more it caches. Obviously the reason for this is that Mondrian is designed to run on top of relational databases, and not “OLAP-intelligent” engines such as ours. It has to translate the MDX into straight SQL queries every time, as it fills its caches initially. Nevertheless it does re-hit the database as needed when you start slicing and dicing on new dimensions or facts, as one would expect. So you can actually see our engine’s “dinosaur tail” behavior as I once described in a previous post.

Now, the MondrianFoodMart database is fine for setting up the stack, but it’s not particularly interesting in so far as data volume goes if you’re in the VLDB space. More recently, I attempted to setup the TPC-H/SSB sample data under Mondrian, meaning I tried to manually create the XML defining some SSB cube. There is a whole fairly complex XML language Mondrian uses to define and connect fact tables (measures) with associated dimensions. They have a UX driven tool called Workbench but I could never get it to work on my system (and didn’t have enough time to keep messing with it). With online help and using the FoodMart.xml sample file, I was able to get a basic cube up in about a day. Nothing fancy, but now I can OLAP into arbitrarily large data sets and that’s a good thing.

As cool as it is seeing our stuff run under Mondrian, I always dreamed of doing the same thing under Excel (as in 75% market share, yeah I want to support that please). Until recently, I thought this would not be possible until we implemented MDX in the engine but then I saw the light. It is called the SimbaO2X connector and it rocks!

[Start Commercial] Did I mention how much I love this company Simba? They pretty much wrote the book on data connectivity. Within a day they had me a 30-day trial version of their O2X offering, no questions asked. And follow-up to boot. Their stuff works, and they know how to take care of people. What a concept! [End Commercial]

This SimbaO2X puppy lets ODBO clients (say like Excel) talk to XML/A OLAP servers (say like Mondrian). Note, there is a similar offering from Pentaho called Pentaho Spreadsheet Services. It carries a small yearly license fee from what I understand. Supposedly you can email Pentaho sales for additional information and a local contact. I’m still waiting for their reply. Hey it’s OSS…Did I mention you get what you pay for?

Either way, the relevant fact is that the SimbaO2X connector works without a hitch. I am finally able to create and manage pivot tables from Excel, talking to Mondrian (via XML/A), talking to RDM/x (via ODBC)! This is the bomb! I need to really get a deeper understanding of MDX capabilities now. But the more I learn about it the more impressed I get, and the better demos I can do.

The Canary in the Gold Mine?

I’ve been claiming for a while that data mining and predictive analytics (PA) were the new hills to conquer in BI and this morning the news came out that IBM had plopped down big money for SPSS. IBM is also investing R&D dollars in ways to manipulate data directly while encrypted and/or compressed. This particular research fascinates me because I believe it will be key to SaaS acceptance, where security is still a significant push-back for obvious reasons. This means analytics might actually have a future on the cloud. And this is important IMHO because this allows for significant progress in the UX systems required to use (drive) mining engines efficiently. The kind of improvements that cannot be generated and deployed quickly enough with fat client implementations. I’m thinking of really interesting things like www.spezify.com for example.

Another interesting trend is pushing analytical capabilities deep into the database engine either via stored procedures or user-defined functions in one or more programming languages (much like .NET inside SQL Server, for example). All this leads me to believe that insightful BI players have been turning their guns on solving the next big pain point of BI which is, IMHO, data mining and predictive analytics. This embedded capability relates to the deep kind of analytics I once blogged about in the context of Greenplum’s MAD paper.

So does this mean we’re all done with OLAP? Not likely, but I think a certain peak has been reached where OLAP has become “bearable”. I don’t really have a 3-5 year “future outlook” on OLAP at this point. Is it still hard to cube and do MDX? Yes. Is it still a pain in the behind to setup large SSAS analytics? You bet. Is setting up a production version of Pentaho’s Mondrian ROLAP for the faint of heart? Not exactly. But there are now multiple alternatives out there in both hardware (faster COTS components, FPGAs, GPUs, MPP) and software (columnar, ALGEBRAIX) realms.

Our own ADBMS at XSPRADA is designed and tuned specifically for OLAP workloads in its present form. Product such as ours have helped “commoditize” OLAP work by shifting design and pre-structuring efforts (cubing, slicing and dicing) from the user (DBA) to the software itself. This is done automatically and based on queries coming in. There is no need to configure cubes, mixed workloads are supported, and all the user really has to do is ask questions. It’s that simple really. Let the software worry about the darn cubes!

So I guess my point is, if there are people still struggling (read: losing time and money) with OLAP in the enterprise, I have to say it’s because they’re either poorly advised or simply not opening their eyes to new tools and techniques currently available. At this point OLAP pain is no longer a necessity. It’s an uneducated choice. From a technical standpoint, it has been addressed. Let’s move on to the next problem please. This is why I think the industry is poised to tackle another challenge now, namely data mining and predictive analytics. Even Curt Monash in a recent blog about the SPSS acquisition writes:

So far business intelligence/predictive analytics integration has been pretty minor, because nobody’s figured out how to do it right, but some day that will change. Hmm — I feel another “Future of … ” post coming on”.

Sorry Curt, I beat you to it J

Mining is a totally different segment of the business intelligence endeavor. When you do OLAP, you’re asking “tell me what happened and why”. When you do mining, you have no clue what happened and much less why. In mining you’re asking “tell me what I should be looking at” or “tell me what’s interesting in this data?” And predictively, you’re asking “tell me what’s likely to happen” – as in, show me the crystal ball. Mining is not a pre-structured, pre-indexed kind of “cubing” world. It’s an ad-hoc discovery process. It’s iterative. Much like the way a human brain functions when discovering information, and trying to make sense of it. This “human-like” behavior is actually one of QlikView’s usability pitches. In mining, the relational model is a hindrance, not an asset, because relationships are not necessarily canned or static. Predictive analytics are more of an art than a science as well. These concepts don’t fit nicely in pre-structured, tabulated formats.

Additionally, mining and PA are creative endeavors (whereas OLAP is not). This is why it’s important to let users define their own “stuff” so they can trial-and-error through the problem. Conventional database engines don’t support this type of workload elegantly. It’s simply not “structured” nicely like OLTP or OLAP. You can’t easily (or cost-effectively) try, erase and re-start with conventional engines. They're not forgiving.

So what’s needed are systems that can first intelligently process data upstream in ELT mode because acquiring statistic on incoming data (at varying rates) is an important step for analytics. XSPRADA’s engine starts analyzing data statistically upon initial presentation. More importantly, it keeps doing so automatically in real time, and continuously via comprehensive optimization. This is a unique feature that causes the system to continuously re-evaluate system resources against queries and data to seek out additional or more effective optimizations.

Next, you need systems that can tell you where NOT to look. Because in this type of work, pertinent data is often clustered in very specific areas (as in 5% of 100TB perhaps). And user questions tend to hit within small percentages of those clusters. Yes there are always exceptions, but generally-speaking, that’s what happens. So what you DON’T want are systems that spend a lot of time scanning boatloads of data (needle in the haystack). What you need is intelligent software that can quickly eliminate vast areas of informational “no-man’s land” based on incoming queries. In such a problem space, throwing additional monies at ever more powerful metal is a self-defeating approach. It’s the software stupid! J

As it turns out, XSPRADA’s ALGEBRAIX technology is very good at eliminating "useless" (read: at a given time) data spaces. Not only that, but it also shines at inferring subtle relationships between different entities. The kind of relationships a human wouldn’t even think of asking on her own. It’s also very good at recognizing patterns (both in queries and targeted result sets).

In a way, you would expect that a system built on pure mathematical foundation would be particularly well suited to data mining workloads. And it sure is. This is the beauty of having a “wide” and rich enough technology that is as easily and readily applicable to a multitude of different BI problems. It means you don’t need to re-invent the wheel or re-architect your system every time a new problem space opens up. And that, in the business intelligence technology world is a rare find indeed.

Monday, July 27, 2009

Who leaves a country packed with ponies to come to a non-pony country?

Usually I think of myself as a “hot-shot” when it comes to XSPRADA technology and its applications. This is because I’ve been involved with it for ten years, and that kind of history builds bonds. In a word, having lived and breathed it for so long, I’m severely biased, but at least, I’m aware of it. It’s a completely different story when you talk to a user who also happens to be biased from experience running the stuff to solve real problems. That’s when you hear praise that makes you step back and go “wow, we really do shine here above and beyond”. There is nothing sweeter than an adamant customer evangelist.

One such person we shall call Tim (why not, since it's his real name). He works for a major DOD contractor out here in California. Tim asked me to withhold his company’s name for obvious reasons. He’s been running POC projects using XSPRADA technology for years. As a matter of fact, Tim once ran a real-time CEP version of our engine (which can handle both real time and historical input) for a demo bid project he needed to put together where no other vendor could come close.

Tim is the ultimate engineer’s engineer and one of the smartest folks in the “information” field I’ve ever met. He’s got experience galore and has been around the block a few times. Currently there are other groups in Tim’s shop running the XSPRADA engine for other purposes, and he keeps abreast of those POCs as well. I could tell you what they entail but then I’d have to kill you . Suffice to say that the engineering being done there would blow most people’s minds (as in, holy cow, we're actually doing this?!?).

It turns out Tim is biased as well, but he’s biased from a user perspective. This is "been-there-done-that" advocacy. And that, in my book, is far more compelling than any argument coming from an insider like myself. Although I don’t know too many people who can explain and position the technology as well as I can (I’m so modest, no pictures please! ), Tim had the following analysis recently and I thought it was so “perfect” I had to reproduce it here (bolding my own):

“One of the distinctions I’ve been using lately to explain the difference between set based data processing and most everything else (row based, column based, partition based,…) is that most other DBs are based on defining somewhat arbitrary bins of data of fixed size or dimension (tables with fixed columns in RDBMs, column collections for things like Vertica ,et al, and chunks in Google’s BigTable designs, to name a few). Then there is significant overhead to partition the incoming/outgoing data to fit into these fixed containers. Inevitably, any operation on these artificial partitions will include wasted processing or I/O on irrelevant data that just “happens” to live in the affected partitions. This is a huge waste of time and resources. In addition, these bins are continually reused by means of destructive updates which require them to be locked during transactions to avoid data corruptions. This is the other main source of waste in that significant delays are now imposed not only on the relevant data involved in the operation, but also on collateral data that might be holding up other operations unnecessarily. These two effects are mutually opposed: larger partitions would help the I/O problem, but at the expense of exacerbating the locking problem. And vice-versa. By contrast set based data systems, like XSP, use completely variable sized containers (the sets) dynamically partitioned based on operational relevance and not on any predetermined partition sizes. In this way, the amount of irrelevant data moved across the I/O boundary for any data operation is significantly reduced. And because these sets are immutable, there is no locking interference with other concurrent data operations.”

To put it in Southern California speak, I was like, wow, this dude really gets it! And there isn’t much I can add to Tim’s conclusions. I bow to the completeness of his understanding; his analysis stands on its own.

The context of our exchange was about pre-structured or “canned” mechanisms used by other database engines versus the flexibility of our approach. It’s what I naively call “bucketizing”. And this topic is of course related to the ADR functionality I was discussing in my last post. But it also pertains to the “schema agnosticism”, parallelism, and ACID aspects of the database I’ve mentioned in the past.

The XSPRADA engine dynamically adapts to queries and data. It avoids the inherent rigidity prevalent in all other technologies. Yes it’s compelling to handle analytics via columns, but that’s only one of many ways you can address the problem. And if your technology is “columnar” in nature, it’s the ONLY way you can address the problem. You’ve put all your eggs in one basket. At the end of the day, you’re still looking at the world in a tabular format (which happens to be vertical). So it’s what I call the one-trick pony approach.

There’s nothing wrong with one-trick ponies if you need to solve a very specific business problem quickly and efficiently. But from a holistic business perspective, it’s a scary proposition. If you’re running an enterprise, you want flexibility. You want the ability to address problems as they come up using all available means at your disposal. You’re looking for a wide array of tools and methods to win battles, not a single weapon system. And this is what XSPRADA technology offers: “the ability to apply the right technique for any question for any data at any time”.

If Tim is concerned with waste and inefficiency, it’s because his shop deals with tera and petabyte volumes of data with limited shelf-life. In other words, waste is not an option. This isn’t about airline transactions messing up your trip or bank accounts being debited incorrectly by the way. This is about national security and people living or dying in real-life tactical situations.

In applications like this, one-trick ponies don’t cut the mustard. And this is why Tim and several other groups in his company have been looking at XSPRADA technology for years. There simply isn’t anything out there that can meet their requirements, and believe me they’ve tried all the usual suspects. To Tim and his colleagues, the unfair competitive advantage XSPRADA can deliver to their company (and clients) is well worth the risk of evaluating technology that’s a little out of the ordinary.

Friday, July 17, 2009

ADR and how I got kicked out by Kickfire

I had the opportunity to brief Merv Adrian (major BI industry analyst) recently about XSPRADA and in discussing competitive technological differentiation, I highlighted the three major points that make XSPRADA uniquely stand out in a sea of “new-breed” ADBMS vendors, namely our algebraic engine (aka ALGEBRAIX), adaptive data restructuring (ADR), and temporal invariance. Those of you who have honored me with their readership in the past will no doubt be familiar with those terms and concepts. (side note: Merv is the one who coined this ADBMS term that I shamelessly borrow constantly now -- thank you Merv!)

I want to focus a little bit on ADR in this post because in the midst of our discussion, Merv asked me a really good question about it. He said “if you’re busy doing ADR, and more and more queries come in and more and more users get on board, what will the impact be on performance?”. Excellent point! Quite honestly, no one had ever asked me this before so after our call, I did a little more research and came up with a few more questions on my own. All of which I’d like to discuss here.

First of all, to recap, ADR is the process of adaptively restructuring data both logically (say the structures in RAM for example) and on storage (disk) based on the nature of queries coming in. What does this mean? Well for example if a query is clearly pulling only certain columns from a table (as is typically the case in OLAP), ADR will pull these columns out and optimally lay them out on disk for more efficient access. ADR could also include indexing (as in bitmap, in low-cardinality cases) and any other means at its disposal to optimize questions pertaining to these columns. Sound familiar? It should as this is basically the principle behind columnar systems. However ADR doesn’t stop there.

It may, for example, decide that sharding row blocks is a more efficient strategy given a particular query pattern and set out to do just that as well. It may decide that duplicating certain pieces of information on given disks is more I/O efficient. In memory, it may decide to implement different indexing schemes depending on the nature of the queries. In short, ADR has absolute “carte blanche” to take every means at its disposal to optimize the system in real time.

Unlike most other ADBMS out there, the XSPRADA engine is a living breathing entity constantly striving for optimal performance. And it has more than a few tools at its disposal (in other words, not a one-trick pony, compliments of the underlying mathematics) -- But so the question is indeed legitimate: how does this impact performance if at all?

The answer, unsurprisingly, is it depends. First, it’s important to realize the design principle behind ADR. It is called “crowdsourcing”. The philosophy is that the more people hit the database from all angles , the better chance there is of being able to optimize the database. From a technical perspective, overload is always a possibility, as with any other system. If too much is submitted at a given time, the system could theoretically run out of resources and impact performance. But when properly configured (balanced) with proper amounts of storage, the system should not degrade with additional users and queries coming in.

Note that concurrent queries are also beneficial to the system. This is because data streams can be shared between query processes in parallel, thereby reducing disk I/O. Additionally, ADR is not user or query-specific. So there is no risk of one ADR action resulting from Query1 from “clobbering” another ADR action from Query2. When sharing is possible, it is implemented and maintained in the mathematical model (algebraic integrity). But all queries from all users enter the “algebraic space” together in one big pool. This means everyone’s actions benefit everyone else. The XSPRADA engine is a very populist one J

On a completely different topic, I wanted to relate a funny incident that happened to me on the way to the Forum recently. Well okay, it wasn’t exactly a forum per say but rather the newly minted Kickfire on-demand trial process. They call it Cloud-based Trial.

I enthusiastically signed up for the trial and got an assigned time-slice for today. Within minutes, I get the following email from Kickfire’s Karl Van den Bergh, who is Kickfire’s Vice President of Marketing and Business Development (phew! Talk about long names to match long titles).

“Thanks for your interest in Kickfire. Our trialing system is reserved for prospects and partners. As our companies are somewhat competitive we are not able to give access to XSPRADA at this time.”

Wow. Cold man.

But next morning, I get this email from the company’s trial support team:

“Congratulations! This email confirms that your Kickfire On-Demand Trial will begin at 10:00am (PST) on 07/17/2009.”

Then thirty minutes later I get this:

“jerome, Your reservation has been deleted. Reservation #sc14a5eb4e82d0dc.”

By that time I’m thinking okay, somebody at Kickfire Trial Support finally got their behinds kicked (pun intended) for daring to confirm my trial session. Apologizing for this confusion is Karl again:

"Apologies for the registration mix up this morning - we have a number of system administrators who weren't in synch."

Yeah, I’d say. But wait, there’s more if you order now! Early this morning, after my session was supposed to end, I get a phone call from Kickfire asking me how my trial went! I couldn’t help but blurt out “you guys are really confused”. Then I explained that Karl had branded me persona-non-grata at which point I was promptly dropped like a bad case of H1N1.

Now, don’t get me wrong. I love this Kickfire appliance concept. And I have a positive informative relationship with technical folks there who are nothing if not extremely competent (and nice people to boot). And although I cannot play with the Kickfire appliance either locally or remotely, from what I’ve heard and read, these guys have a top notch team, great technology, a seasoned CEO and attractive market positioning.

In thinking about this, my initial reaction was that I was stupidly naïve. Honestly, it never occurred to me that Kickfire would ban any “competitor” (or anyone else for that matter) from remotely playing with their software. The reason is because we don’t do this at XSPRADA. As a matter of fact, anyone can pull our bits for free from our website, including Kickfire (which already has). So to me this has a strange, secretive, “we have something to hide” feel which, given what I know about Kickfire, really took me by surprise.

But it's true I have this weird concept about openness that many other vendors share, and I’m happy to put our stuff in competitive hands (after all, the customers certainly will!) anytime and get any feedback from the experience, negative or not. In my experience, few competitors will go out and trash another vendor’s offering, much less try to reverse-engineer it (at least not in this industry) -- Maybe I’m foolish.

However, it’s true I don’t have a “marketing” bone in my body. As an engineer and evangelist, I’ve always been adamant about transparency, peer review and feedback. So maybe it’s a good think I don’t handle XSPRADA Marketing, or Kickfire’s for that matter! J

Thursday, July 2, 2009

Of Views, Cubes, Patents and Haystacks

In keeping with a past commitments to do so, I plan on staying purely technical in this posting. The first topic I want to bring up relates to database views. The second one deals with OLAP “cubing”. I want to discuss those in the context of the XSPRADA analytical engine RDM/x.

Many conventional database engines support view and materialized views and I believe Oracle actually came up with the concept (I am not 100% sure of that) but numerous other vendors support them including Postgres, MySQL, DB2, SQL Server, etc. In either case, the idea is simple. A view is basically a logical rendition of a SQL query. It looks and behaves like a table. Views can be logical or materialized. A logical view is really just a “pointer” to the actual data. A materialized view actually contains data and is implemented as a full-fledged database object. There are several reasons to use views.

One is performance enhancement and convenience (caching a nasty JOIN or OLAP summaries dynamically, meaning original data changes trigger recalculations). One is abstraction. Another is security and access control (you can grant specific privileges to views while keeping underlying data protected from users). The important thing to remember is that views were conceived in response to a problem, namely that database performance can really suck without them (especially in warehousing applications).

If you didn’t have inherent performance issues in conventional database engines, you wouldn’t need views to begin with. And herein lies the reason why the XSPRADA engine doesn’t support classical views: it simply doesn’t need them. The engine caches and materializes query results dynamically on the fly. As such, “materialized views” are generated on an as-needed basis. But this is true for everything else going on inside the XSPRADA engine because the adaptive data restructuring (ADR) feature dictates that all forms of optimizations be materialized for optimal logical and I/O performance on the fly as queries come in.

Who determines when this happens? The optimizer and the algebraic engine (now officially known as XSPRADA ALGEBRAIX under US Patent 11/383,477). If and when the system detects more than one query involving a particular set of joins, it materializes (and likely caches) the results for subsequent use, assuming that a request pattern has emerged. Similarly, it is always possible to “store” a complex query into a named result set using the INTO extension. For example, you can write something like:

Select p.product_name, s.store_city, sum(f.store_sales) tots from sales_facts f

Join product p on p.product_id = f.product_id

Join store s on s.store_id = f.store_id

Group by s.store_city, p.product_name

Order by tots desc INTO ds;

At that point, the ‘ds’ table will contain the results and can be used independently and updated at will. Currently, the XSPRADA engine will not automatically update ‘ds’ should any change occur to any of the underlying tables. Regardless, ‘ds’ now becomes a regular RDM/x table like any other. There is no distinction between views and tables in RDM/x. And even though views are defined in RDM/x, this is done for sheer syntactic convenience, and using them yields no technical advantage.

On the OLAP front, the system behaves in a similar way. This is why I always say, with RDM/x you don’t need to pre-structure or “cube” your data. The engine does that automatically internally for the user based on his/her query patterns!

When you feed RDM/x a query like this:

SELECT s.store_city, sum(f.total_sales) from store s, sales_fact f

Where s.store_id = f.store_id

GROUP BY s.store_city

RDM/x automatically starts building a “cube” for it. What does that mean? The system creates a set of mathematical expressions relating the detail data to the aggregates you’ve just requested. In the process, instances of the data in intermediate and final form are realized. This city-based slice is “remembered” by the system and any subsequent query against that slice or variation thereof will benefit from immense optimization (and performance increase). This is how subsequent queries just get faster and faster in the XSPRADA system. And if you slice along another dimension next, say products, for example, the same optimization and ADR is performed for that slice. If you ask enough questions involving enough dimensions, you get closer and closer to instantaneous response time. Adding a dimension may cause a slight performance dip on the first shot, but the penalty is nowhere near as painful as having to add another slice or recompute aggregates (cells) in a conventional “manual cubing” system (say like Mondrian or SSAS to name a few).

People have asked (namely Chris Webb, an MDX guru who runs an excellent blog) if there is some sort of internal structure that makes the query run faster over time. The answer is that the re-structuring of the data, combined with the algebraic system and the methods for accessing the data internally (in memory and storage) are what makes these queries get faster with time. Another excellent question was “can you edit these structures?” And the answer is no, as the system is entirely automated and requires no user intervention. Importantly, DML operations (INSERT, UPDATE, DELETE) on the original detail data cause automatic updates of the aggregated data, as long as the aggregates continue to be expressed in terms of the original data. Simply put, the system maintains internal integrity automatically unless, of course, you change the original query.

MDX and dynamic “view” updates are features likely to be implemented by RDM/x in the near future. Without MDX, you lack the convenience of being able to talk to RDM/x directly from clients such as Excel (or SAP for that matter, who has apparently adopted MDX now), although you can clearly question RDM/x via Excel using OLAP queries. MDX is obviously more powerful and convenient. This doesn’t take away from the internal “magic” occurring inside RDM/x when handling OLAP workloads.

And I have just a few more lines to make one last point. Technically speaking, RDM/x is currently heavily biased in favor of OLAP workloads. This is not a consequence of the technology being used per say but rather reflects marketing priorities. I sometimes meet people who try out our product to do search. For example, they’ll try to grab a few select rows from 100,000,000 rows of data. This is not where RDM/x currently shines! I always tell them: if you want to do this, you’re better off using a conventional OLTP system like SQL Server or Oracle. RDM/x is not about finding the proverbial needle in the haystack. It’s about telling you what the haystack looks like.

Monday, June 29, 2009

ParAccel's Big Ad-Venture

I was just about to write a piece about the sad state of venture funding in the analytical database and BI space over the past years when BAM! Out of nowhere comes the announcement that ParAccel just scored $22M in a C-round investment. More importantly, as Merv Adrian points out, “ParAccel’s previous investors participated as well.”

This last bit of information is even more telling than their ability to raise new money in this zero-point-zero economy. Because if you look at past events surrounding Dataupia (and even Lucidera, in a way), it’s clear to me that there is a lot of “investor fatigue” in the BI industry at the moment. In my book, this is partly due to the fact that a lot, if not most, of these investors had no clue about the data management market or BI as a whole (but at the time, it sounded cool, and the guys next door were touting it so…) or how much it takes to actually build a full-fledged high-performance analytical engine from scratch (which is about 200,000 man-hours just on the engineering and maybe $20-60M if you’re lucky). So of course when the going gets tough, these guys get jittery and bail out. No surprise there.

My prediction is that, by year end, there’s going to be a lot of bodies left “on the carpet” as we say in French (I think it's an old boxing term). ParAccel’s new money, combined with their recent marketing coup involving a 30TB TPC-H benchmark, will surely help them survive these tough times provided they don’t squander the funds and from what I’ve seen so far, these guys tread lightly and wisely.

Not to say this makes the BI capital markets suddenly look better. The speculation “out there” (meaning in the “connected” unofficial circles) is that the average VC returns for the 2000-2009 decade (which won’t be reported before April 2010) will likely be close to zero or negative. The spectacular returns achieved in 1999 will drop out. Clearly VCs would not take kindly to this information being published and the NVCA has not made it public. Nevertheless if you follow VC performance sites like MoneyTree and the NVCA you can see the writing on the wall and it isn’t pretty. There are a lot of angel and VC investors looking for a way out nowadays at any cost. And to paraphrase Dr. Evil, when this happens, “people DIE!”

According to this article, venture capital is getting depleted. It’s hard to raise serious cash these days, imagine that! I’ll go one further and speculate that the venture capital system as we know it in the US may not be around another ten years. At least not in its present form. By then, the major source of funding will likely be big government (European-style) as risk will become more and more demonized and “regulated” (and you can’t regulate risk by definition, or it isn’t risk!). But what do I know? I’m not a VC.

For full-service SEs, this product is for you.

I have not written specifically about my SE job before here at XSPRADA because I try to keep this blog fairly technical, or at least directly related to database and BI issues, but I want to make an exception this time. Reason being I recently discovered a very interesting SaaS offering called TeamSupport that makes my job easier. And I figure, if it does that for me, chances are it can also do it for others in the profession. So I felt compelled to share the wealth.

First, to put things in context, I discovered TeamSupport completely by chance interacting with their COO/VP Sales Eric Harrington on a LinkedIn group. Eric offered to give a quick and dirty online demo and five minutes later, he produced. That in itself was impressive to me.

TeamSupport is a cross between an issue/bug management tracking system (not unlike JIRA, FogBugz, Team System or Bugzilla, all of which I have used in the past) and a CRM system (not unlike Salesforce.com or RightNow.com), although they don’t bill themselves as being CRM per say. The TeamSupport pitch is about “bridging the gap” between customer service/support, product development, engineering and QA. Given the nature of my work as full-service SE , this seemed like a pretty compelling tool to me.

Now, typical sales engineers in larger shops ride in tandem with Sales people (AEs) on most accounts and are mandated with “greasing the rails” of sales, as I like to put it. But I typically don’t go out with AEs (nothing personal, we just don’t have any) and always pretty much face clients and prospects (both technical and executive) on my own. By necessity, I have a very close and tight relationship with our engineering group, and can typically remember specific tracking issues by JIRA number. Similarly, as I also handle a lot of the sales/support side of things, I’m often up on Salesforce.com managing and supporting accounts, and mining opportunities and what have you.

As great as JIRA is and as useful as Salesforce can be, they don’t play nicely together out of the box for this purpose. TeamSupport integrates functionality from both sides of the house. As it is on-demand software, I was able to create an account and log on in no time. I just love the simple clean UX of this product. On the left pane menu are all the entities I can create, edit and manage such as issues, features, tasks, bugs, users, customers and products. In the middle is the workspace corresponding to the selected menu item. But make no mistake this software is rich, rich, rich.

So, for example, I use Features to enter feature requests from user and prospects’ wish lists. If I click on Features, I immediately see my list by ticket number. I can therefore track those with Engineering but also product management (as in, when can we accomplish this and should we?) and give customers feedback on progress (or at least estimates). You can associate Features with one or more customers. This is useful when more than one customer makes a similar request (guess what, that’s common).

I use Tasks to keep track of what I need to handle or resolve on a daily basis. Those too can be associated with multiple customers. That’s cool. You can subscribe to tasks and get email notifications when they are modified by other users (TeamSupport is free for up to three users, by the way).

In the Bugs section, I can enter pertinent items from our internal JIRA tracking system. I can assign or link those to various corporate groups like Engineering, QA or Sales Engineering.

In the Knowledge section, I like to enter resolutions to past issues or problems. These are likely to come up for other customers or prospects so it’s a great way to keep track of those for future reference.

In Customers, I entered all my current customers or prospects. I don’t differentiate on that. Paying customer or evaluating customer, whether you’ve paid your money or just kicking the tires, you get the same high level of service from me. Priority to existing customers, clearly, but same level of service

In Products, I can enter all versions of existing product lines and link those with customers and prospects (as in who bought what or who is currently trying what version and which maintenance release). You can of course track issues and features by product. We do maintenance releases fairly often so this is a great way for me to track critical feature enhancements/bug fixes on a per-version level.

If I click on Dashboard, I get an immediate 30,000 foot picture of where I’m at this point in time. I can then drill into my tickets or customers at will. I now have all I need on hand to help me cover everything from the most minute technical details to the most important pain point gleaned from my last interaction with a prospect.

TeamSupport can also ingest our JIRA database. All you have to do is export your database and they help you get it loaded. This is great on the ticketing side. They are also working on a future API to do this automatically. On the CRM side, they are integrating with Salesforce as well, and I will be beta-testing that effort shortly to bring in accounts from that side into TeamSupport.

Last but not least, TeamSupport lets you setup support portals for each customer. This is important (and unique) for several reasons. First, it makes us look a lot more polished than just handling everything via email (or Twitter). Second, it creates a history of interaction that can be mined and referenced at will (this in itself is very valuable business intelligence!). Third it facilitates a push model of interaction with the customer because each side gets change notifications. This means I can respond in real time to questions and issues. I like real time.

As most great software, it’s sort of hard to do it justice in a quick write-up. You kind of just “get it” as soon as you start using it. It’s elegant, simple, fast, and easy to use. It just flows and it’s intuitive. I actually enjoy using the darn thing! Go figure. I forget what the exact subscription costs are but it’s dirt cheap (ahem, I mean cost-effective) considering the value provided out of the box. I should conclude by saying that I am in no way connected to this company. I had never heard of them before last month, and have zero affiliations with anyone there. I can tell you they built a really valuable product if you’re in technical sales. Did I mention their support is stellar? Enough said: if you want to try this product out, shoot my friend Eric a quick email or twit him up at TeamSupport. Tell him I sent you J