Big Honking Databases: June 2009

Monday, June 29, 2009

ParAccel's Big Ad-Venture

I was just about to write a piece about the sad state of venture funding in the analytical database and BI space over the past years when BAM! Out of nowhere comes the announcement that ParAccel just scored $22M in a C-round investment. More importantly, as Merv Adrian points out, “ParAccel’s previous investors participated as well.”

This last bit of information is even more telling than their ability to raise new money in this zero-point-zero economy. Because if you look at past events surrounding Dataupia (and even Lucidera, in a way), it’s clear to me that there is a lot of “investor fatigue” in the BI industry at the moment. In my book, this is partly due to the fact that a lot, if not most, of these investors had no clue about the data management market or BI as a whole (but at the time, it sounded cool, and the guys next door were touting it so…) or how much it takes to actually build a full-fledged high-performance analytical engine from scratch (which is about 200,000 man-hours just on the engineering and maybe $20-60M if you’re lucky). So of course when the going gets tough, these guys get jittery and bail out. No surprise there.

My prediction is that, by year end, there’s going to be a lot of bodies left “on the carpet” as we say in French (I think it's an old boxing term). ParAccel’s new money, combined with their recent marketing coup involving a 30TB TPC-H benchmark, will surely help them survive these tough times provided they don’t squander the funds and from what I’ve seen so far, these guys tread lightly and wisely.

Not to say this makes the BI capital markets suddenly look better. The speculation “out there” (meaning in the “connected” unofficial circles) is that the average VC returns for the 2000-2009 decade (which won’t be reported before April 2010) will likely be close to zero or negative. The spectacular returns achieved in 1999 will drop out. Clearly VCs would not take kindly to this information being published and the NVCA has not made it public. Nevertheless if you follow VC performance sites like MoneyTree and the NVCA you can see the writing on the wall and it isn’t pretty. There are a lot of angel and VC investors looking for a way out nowadays at any cost. And to paraphrase Dr. Evil, when this happens, “people DIE!”

According to this article, venture capital is getting depleted. It’s hard to raise serious cash these days, imagine that! I’ll go one further and speculate that the venture capital system as we know it in the US may not be around another ten years. At least not in its present form. By then, the major source of funding will likely be big government (European-style) as risk will become more and more demonized and “regulated” (and you can’t regulate risk by definition, or it isn’t risk!). But what do I know? I’m not a VC.

For full-service SEs, this product is for you.

I have not written specifically about my SE job before here at XSPRADA because I try to keep this blog fairly technical, or at least directly related to database and BI issues, but I want to make an exception this time. Reason being I recently discovered a very interesting SaaS offering called TeamSupport that makes my job easier. And I figure, if it does that for me, chances are it can also do it for others in the profession. So I felt compelled to share the wealth.

First, to put things in context, I discovered TeamSupport completely by chance interacting with their COO/VP Sales Eric Harrington on a LinkedIn group. Eric offered to give a quick and dirty online demo and five minutes later, he produced. That in itself was impressive to me.

TeamSupport is a cross between an issue/bug management tracking system (not unlike JIRA, FogBugz, Team System or Bugzilla, all of which I have used in the past) and a CRM system (not unlike Salesforce.com or RightNow.com), although they don’t bill themselves as being CRM per say. The TeamSupport pitch is about “bridging the gap” between customer service/support, product development, engineering and QA. Given the nature of my work as full-service SE , this seemed like a pretty compelling tool to me.

Now, typical sales engineers in larger shops ride in tandem with Sales people (AEs) on most accounts and are mandated with “greasing the rails” of sales, as I like to put it. But I typically don’t go out with AEs (nothing personal, we just don’t have any) and always pretty much face clients and prospects (both technical and executive) on my own. By necessity, I have a very close and tight relationship with our engineering group, and can typically remember specific tracking issues by JIRA number. Similarly, as I also handle a lot of the sales/support side of things, I’m often up on Salesforce.com managing and supporting accounts, and mining opportunities and what have you.

As great as JIRA is and as useful as Salesforce can be, they don’t play nicely together out of the box for this purpose. TeamSupport integrates functionality from both sides of the house. As it is on-demand software, I was able to create an account and log on in no time. I just love the simple clean UX of this product. On the left pane menu are all the entities I can create, edit and manage such as issues, features, tasks, bugs, users, customers and products. In the middle is the workspace corresponding to the selected menu item. But make no mistake this software is rich, rich, rich.

So, for example, I use Features to enter feature requests from user and prospects’ wish lists. If I click on Features, I immediately see my list by ticket number. I can therefore track those with Engineering but also product management (as in, when can we accomplish this and should we?) and give customers feedback on progress (or at least estimates). You can associate Features with one or more customers. This is useful when more than one customer makes a similar request (guess what, that’s common).

I use Tasks to keep track of what I need to handle or resolve on a daily basis. Those too can be associated with multiple customers. That’s cool. You can subscribe to tasks and get email notifications when they are modified by other users (TeamSupport is free for up to three users, by the way).

In the Bugs section, I can enter pertinent items from our internal JIRA tracking system. I can assign or link those to various corporate groups like Engineering, QA or Sales Engineering.

In the Knowledge section, I like to enter resolutions to past issues or problems. These are likely to come up for other customers or prospects so it’s a great way to keep track of those for future reference.

In Customers, I entered all my current customers or prospects. I don’t differentiate on that. Paying customer or evaluating customer, whether you’ve paid your money or just kicking the tires, you get the same high level of service from me. Priority to existing customers, clearly, but same level of service

In Products, I can enter all versions of existing product lines and link those with customers and prospects (as in who bought what or who is currently trying what version and which maintenance release). You can of course track issues and features by product. We do maintenance releases fairly often so this is a great way for me to track critical feature enhancements/bug fixes on a per-version level.

If I click on Dashboard, I get an immediate 30,000 foot picture of where I’m at this point in time. I can then drill into my tickets or customers at will. I now have all I need on hand to help me cover everything from the most minute technical details to the most important pain point gleaned from my last interaction with a prospect.

TeamSupport can also ingest our JIRA database. All you have to do is export your database and they help you get it loaded. This is great on the ticketing side. They are also working on a future API to do this automatically. On the CRM side, they are integrating with Salesforce as well, and I will be beta-testing that effort shortly to bring in accounts from that side into TeamSupport.

Last but not least, TeamSupport lets you setup support portals for each customer. This is important (and unique) for several reasons. First, it makes us look a lot more polished than just handling everything via email (or Twitter). Second, it creates a history of interaction that can be mined and referenced at will (this in itself is very valuable business intelligence!). Third it facilitates a push model of interaction with the customer because each side gets change notifications. This means I can respond in real time to questions and issues. I like real time.

As most great software, it’s sort of hard to do it justice in a quick write-up. You kind of just “get it” as soon as you start using it. It’s elegant, simple, fast, and easy to use. It just flows and it’s intuitive. I actually enjoy using the darn thing! Go figure. I forget what the exact subscription costs are but it’s dirt cheap (ahem, I mean cost-effective) considering the value provided out of the box. I should conclude by saying that I am in no way connected to this company. I had never heard of them before last month, and have zero affiliations with anyone there. I can tell you they built a really valuable product if you’re in technical sales. Did I mention their support is stellar? Enough said: if you want to try this product out, shoot my friend Eric a quick email or twit him up at TeamSupport. Tell him I sent you J

Thursday, June 25, 2009

Pawn to King IV

I recently read (sorry it's in French ) that SAP and Jive had partnered up to offer BI in an Enterprise 2.0 context (aka enterprise social networking). The same happened in France with Dassault Systems and blueKiwi Software. I’ve also been following the new “BI for the masses” trend exemplified by offerings like PushBI and RoamBI. Then several weeks ago I saw the Google Wave presentation at I/O and thought to myself, geez wouldn’t it be cool if people could share and collaborate using data wavelets under management from a BI Wave “bot” dedicated to a specific community of users.

I think BI is on its way to becoming commoditized. I see it moving up from the technologists directly into the hands (and mobile devices) of the users. Years from now, people will look back on monolithic enterprise data warehouses and their infrastructures and wonder “how could people ever have lived like this?”

I think it’s really crucial for progressive companies in the BI space to ask themselves what their users will look like ten or twenty years down the line. It’s obviously important to understand current users, but doing so is relatively easy. Anticipating what today’s young people will purchase in 10-20 years as senior corporate buyers (and decision makers) is not so straightforward. But as today’s “kids” are the people you’ll be selling corporate BI to in the future, doesn’t it make sense to (1) analyze their mindset and (2) start reaching out to them now?

Those of you who read this blog regularly know that I worship Guy Kawasaki. In his book Reality Check, there’s a chapter called “Get a Clue: The Global Youth Market”. In it, he interviews Kathleen Gasperini, the cofounder of Label Networks. These folks analyze “global youth culture” for major corporations. It’s a fascinating read (as is most of the book).

You’re thinking great, what does selling Nike shoes or Levi Strauss jeans to kids have to do with pitching enterprise software to business intelligence users? I say a lot. Because if you understand the behavior and expectations of upcoming generations of BI buyers, you’ll understand your future BI customer and gain unfair competitive advantage in the process. Now, I don’t purport to understand current youth behavioral traits, but I do make the following subjective observations based on past experience.

Instant gratification. Immediacy has become a birth right. Expectations of “just-in-time” are prevalent in everything they do, purchase or share. I don’t think these people are the kind who will sit around the office waiting six months for a $4M BI project and associated resources to get provisioned, staffed, configured, and maybe then approved. These folks are going to want something up and running in days. Anyone not operating in the same timeframes will be left in the dust.

Social consciousness. Business is business, but “good” business wins points. A “good” business does not exist exclusively for pecuniary purposes. Making meaning (as Guy K. would say), as opposed to just making a quick buck, will matter. Being “green” will matter (not sure what that means but it’s a hot button). I know this sounds awfully naïve, but young people are experiencing the ability to “make a difference in the world” (something past generations may have missed out on) and respecting those who strive to do so. They’ll take that to the corporate world as well. Help them do so. They mean it.

24/7 Connectivity. Young people are constantly connected to the internet “matrix”. It’s a 24/7 world for them. Like an addiction, this isn’t a habit you casually kick with age. Accessing future BI buyers outside the realm of the “matrix” will be futile. Accessing or supporting users during “regular business hours” will get you laughed out of the market. Not reaching out to or monitoring social networks will be foolish at best. The thought of instant, always-on BI following users around on mobile devices 24/7 may make some people laugh, but this is how young people already live. There’s no reason they’ll leave this behind in the context of “the office” in years to come.

Do-it-yourself is the new mantra. Self-empowerment is alive and well. As Bob Marley used to sing, “when one door is closed, don’t you know, another is open”. Unless you provide the tools and infrastructure to “DIY” you probably won’t get much traction (we see this happening now with open source already). Empower users because they’ll be savvy and used to it.

The current economic turmoil and its endless revelations, combined with the technological advances that have shifted the way young people see, communicate with, and gauge the world around them, will translate into a new type of enterprise BI buyer in the coming decades. If I were running a BI company, I’d want to bring the “good word” into the schools now, and start cultivating my future clients in the dorms, in the research centers, in the classrooms and on their mobile devices.

I could be wrong on this, but my money is on the bet that the upcoming generation of BI buyers is an entirely different animal. In this business, like chess, it pays to think several moves ahead.

Wednesday, June 24, 2009

Leave the Gun, Take the Cannoli

I grew up in New York City and spent enough time in Jersey back East to have witnessed a couple interesting brawls in my life (I even had neighbors who dug holes for a living or worked in waste management) but it’s been a while since I’ve seen anything like the recent scuffle among industry analysts and vendors regarding the recently published ParAccel TPC-H benchmark. Maron!

It all started innocently enough two days ago when Merv Adrian, BI industry analyst emeritus, published the news in his blog titled “ParAccel Rocks the TPC-H – Will See Added Momentum”.

Now, it’s not every day that a vendor publishes audited TPC-H benchmarks (“audited” being the key word, as the process runs around $100K from what I understand). Very few companies besides the Big Three have the deep pockets and technology prowess to accomplish that. Furthermore, ParAccel did its benchmark based on 30TB which isn’t exactly a small chunk of data. And so Merv made the point that, at the very least, the news should certainly help put ParAccel on the map. To quote him: “This is a coup for ParAccel, whose timing turns out to be impeccable”.

Immediately, this was picked up by no other than Curt Monash, BI analyst to the stars (and I say that quite seriously), who happens to despise the very concept of TPC benchmarks for reasons he clearly outlines in a recent post entitled “The TPC-H benchmark is a blight upon the industry”. To pull a couple of money-quotes from the site:

“...the TPC-H is irrelevant to judging an analytic DBMS’ real world performance.

“In my opinion, this independent yardstick [the TPC-H] is too warped to be worth the trouble of measuring with.”

“I was suggesting that buyers don’t pay the TPC-H much heed. (CAM)”

“TPC-Hs waste hours of my time every year. I generally am scathing whenever they come up”

Now, notwithstanding the TPC-H issues, I think Curt will concede that he doesn’t particularly appreciate or trust ParAccel as a company either as the following statements will show:

“I would not advise anybody to consider ParAccel’s product, for any use, except after a proof-of-concept in which ParAccel was not given the time and opportunity to perform extensive off-site tuning. I tend to feel that way about all analytic DBMS, but it’s a particular concern in the case of ParAccel.”

“I’d categorically advise against including ParAccel on a short list unless the company confirms it is willing to do a POC at the prospect’s location.”

“The system built and run in that benchmark — as in almost all TPC-Hs — is ludicrous. Hence it should be of interest only to ludicrously spendthrift organizations.”

“Based on past experience, I’d be very skeptical of ParAccel’s competitive claims, even more than I would be of most other vendors’.”

The combination of published TPC benchmarks and the originator of the benchmark seem to have created what Curt himself refers to as “the perfect storm”. To say he doesn’t like either would be a gross understatement :)

Both blogs immediately started getting “opinionated” comments from the public at large, including ParAccel’s VP of Marketing Kim Stanick and a gentleman named Richard Gostanian who may or may not be connected to Sun Microsystems (depending on which Twits you read). Sun supplied the hardware for the ParAccel benchmark. To cite a couple quotes from the comments, Richard Gostanian responds:

“Perusing your website, I detect a certain hostility towards ParAccel.” – (No kidding!)

“Indeed you do more to harm your own credibility than raise doubts about ParAccel.”

“…TPC-H is the only industry standard, objective, benchmark that attempts to measure the performance, and price-performance, of combined hardware and software solutions for data warehousing.”

“So Curt, pray tell, if ParAccel’s 30 TB result wasn’t “much of an accomplishment”, how is it that no other vendor has published anything even remotely close?”

Then Kim Stanick says: “It [TPC-H] is the most credible general benchmark to-date.”

And an anonymous reader chimes in:

“After reading Curt’s post about ParAccel and Kim this is obviously personal…I wonder why the little fella didn’t have a fit over Oracles 1TB TPC-H? Check his bio. He consults for Oracle.”

To which Curt replies (among other things):

“As for your question as to why other vendors don’t do TPC-Hs — perhaps they’re too busy doing POCs for real customers and prospects to bother.”

Ouch! This nasty sudden melee took me by surprise at a time when I was considering blogging about the whole TPC-H system for analytical engines anyway. I’ve wondered for quite a while whether or not publishing such metrics actually helped “new breed” startups like ourselves from a marketing and sales standpoint. Given the high cost and resource drain, what’s the return on this investment? What’s more, I have yet to meet a prospect or user who either cares of knows about TPC-H benchmarks. So far, the only people I’ve ever seen show any interest in the matter are venture capitalists and investors which tells me right there that something is amiss (or, maybe that’s why the small players take the plunge, I don’t know).

As some of you may know, XSPRADA is a recent member of TPC.org alongside other industry startups like Kickfire, Vertica, ParAccel and Greenplum. Numerous other startups in the same category are not members. They don’t seem to fare any worse. Furthermore, as best I can tell, even some existing members (namely Greenplum or Vertica) don’t publish audited benchmarks. Yet clearly these two vendors don’t seem negatively affected by the lack thereof.

Although we at XSPRADA have conducted TPC-H benchmarks (and continue to do so) internally, we have never attempted to get them audited and published. If a prospect asked me about it, I would recommend we help him run those benchmarks in-house on his own hardware anyway! Even if we had $100K to blow on getting audited benchmarks, I’m not sure it would make sense to pursue.

I’m usually a pretty opinionated black & white guy, but with respect to this TPC-H business, I tend to centerline. Strangely enough, I identify with both sides of the argument. One the one hand, I don’t believe the benchmarks to be totally useless. Having been involved in generating our internal results, I can vouch for the fact that it takes a lot of tedious work and kick-ass engineering to even complete the list. By any stretch of the imagination, this is not a small inconsequential feat. Doing so on anything above 10TB is, in my opinion, nothing to sneeze at. If nothing else, being able to handle the SQL for all 22 queries is a decent achievement. And then of course, there’s the notion that even “trying” to do it is noble in itself. In that sense I tip my hat to the small guys who pulled it off.

On the other hand, I don’t feel the benchmarks are holistically useful for evaluation purposes. As a prospect looking at several vendors, they might figure in my check-list but not more significantly than others I consider more important. Namely: how easy is the product to work with, what resources does it consume (human and metal), how does it play in the BI ecosystem as a whole (connectivity), and last but not least, what kind of support and viability will the vendor provide? I’m a little weird that way. I tend to evaluate companies based on their people over most everything else. But that’s just me.

At the end of the day (and everyone does seem to agree on that), what matters are onsite POCs. Nothing can beat running your own data on your own metal. I want a vendor to hand me the keys and go “ok, have a good ride, call me if you need anything” and mean it. BMW sells cars this way. Enough said. This is what I drive to when helping people evaluate our offering.

It remains to be seen how much of this brouhaha will benefit ParAccel in the long run. They say there’s no such thing as bad publicity. If they end up getting recognition and sales from it, then they have chosen wisely, and no one can take that away from them. Personally, I wish them the best. I believe the more numerous we are in this upstart game, the better it is for us, and more importantly, for our customers. So I say leave the guns, and take the cannoli.

Tuesday, June 23, 2009

In-House or SaaS? How About Both?

Chuck Hollis just penned another interested blog post about the economics behind private clouds for the enterprise. There is a lot of talk about on-premise versus on-demand SaaS these days in the BI community (and when I say SaaS I mean either private or public).

From a financial standpoint, the two models are fairly well established. Basically, on-premise is budgeted as capital expenditure, while on-demand is budgeted as operational expenditure. Much like the difference between buying your TV and paying for your electric bill on a monthly basis, or the difference between buying and leasing a car.

With on-premise you buy a lot of expensive stuff and it depreciates (and loses value) over time. With on-demand you rent a service for a short critical amount of time as needed. In many ways, the parallels are strikingly close to hiring in-house software developers versus outsourcing to consultants. The business case for either direction is easily conceived.

In-house developers are a long-term investment. They will learn the business and be allocated as needed on a per-project basis. The project is likely to be long-term. Much like capital equipment, they also need to be “upgraded” periodically – namely allowed and encouraged to keep up with technology so they remain productive and far-sighted. They also need to be provisioned with tools and resources to do their jobs effectively. The most progressive shops understand that. Not unlike expensive heavy metal and software licenses, they also tend to get worn out or outdated with time. And they’re expensive to replace and renew.

Consultants (either remote or in-house) are a quick-fix solution, usually applied to a pressing problem when in-house resources are either not available or not competent to handle the pressing business need. They’re expensive, but they (hopefully) get the job done quickly, get you the answers you need when you need them, and ride out into the sunset. Like on-demand offerings, they can be shutdown at will, but they also carry vendor-lock risk.

Having been in the software business for two decades, I’ve been on both sides of this fence numerous times. In my experience, the best shops implement a hybrid approach with strong internal cores supplemented as needed by top-notch “gun slingers”. In the best of cases, I’ve seen synergy and knowledge transfer occur between the two entities (when the politics were right) with significant benefit to the enterprise.

I think the same thing will probably occur in the BI space. I would imagine large shops will probably have both in-house staff and equipment, backed up by quickly ramped up SaaS offerings dedicated to what I call “transient data mart needs”. I could be wrong about this, but hybrid approaches (in business and technology) are usually more flexible and not necessarily conflicting. They can also feed off each other in positive ways.

I don’t believe the choices are going to be 100% on-premise or 100% on-demand. The trick for the CIOs out there is going to be determining which projects and which needs are better served by internal (strategic) or external (tactical) solutions in an agile way. In that sense there is really no “battle” between the two approaches. They should be considered complementary parts of an intelligent BI strategy toolbox.

This ACID Leaves No Bitter Taste

In my previous post below I received a comment (question) from Swany about inserts in the XSPRADA database engine RDM/x. Specifically, he (or she) asked:

“What happens if there is an error during the SELECT .. INTO? Are such inserts ACID or will I get partial data in a table if the system crashes?”

This is of course an excellent question, and I thought it was worth addressing on a wider level beyond just incremental loads. To place this in context, recall that ACID properties are defined as a set of rules pertaining to transactional database management systems defined as Atomicity, Consistency, Isolation and Durability. If a transactional database does not meet these conditions, it is not considered “reliable”. I won’t bore the reader with yet another ACID definition. Suffice to hit Wikipedia for a reasonable description.

Now the first question of course is whether analytical database engines supporting OLAP style work can or should meet the same criteria as classical transactional OLTP systems. By definition, analytical systems are biased for read access and updates are supposed to be rare, but inserts certainly occur as incremental loads are performed on warehouses (and data marts) at various intervals (from hours to weeks typically). In either case, an analytical engine clearly needs to handle data changes in an ACID way or data loss and corruption can occur. Similarly, data value and integrity need to be protected (locked) from concurrent (possibly conflicting) access patterns. Internal database structures are vulnerable to corruption during these transactions. So how does XSPRADA technology handle these issues?

The answer, not surprisingly, lies in XSPRADA’s “magic sauce”, namely, the mathematics of Extended Set Processing (XSP). To appreciate this, one needs to understand that all entities inside the XSPRADA mathematical model (ie: tables, rows, fields etc.) are represented as extended sets. And all extended sets by definition are immutable. This means updates to the system are implemented by creating additional extended sets, and the original ones are never mutated or deleted by subsequent processing. This ensures that data sets in the system are never corrupted or worse, deleted by mistake.

Internally, the XSPRADA data model is fully contained in a “universe” of extended sets. This is the set of all sets. Sets in this universe are related to each other via algebraic relations (hence the “relational” part of Relational Data Miner or RDM/x). Depending on the state of the system at a specific point in time, sets are either “realized” (materialized) to disk, or “virtual”, meaning they have an internal mathematical representation defined by algebraic expressions involving other sets, but no physical existence. (This has repercussions concerning “materialized views” which I’ll attempt to discuss in a future post).

So when sets are modified by internal operations, both new and old sets remain in existence within the universe. This means updates and inserts never actually change information, but only add to it. To complete the transaction, RDM updates the universe metadata to include knowledge of the newly created sets (if any) along with the algebraic relations linking them to their original brethren. Genesis is maintained. This is crucial because, unlike in conventional DBMS systems, the original data never needs to be re-generated (or created) to achieve rollback. The universe is only updated once all operations have completed successfully. If an error occurs, no harm no foul, as the previous state of the universe was maintained and still exists. In essence, the INSERT, UPDATE, DELETE functionality of the XSPRADA database is merely a logical emulation of conventional DBMS DML. Each of these statements internally results in an additive activity. In fact, UPDATE is internally implemented as DELETE+INSERT. So to recover from a failed set of operations (a transaction gone south) the system simply deletes any incomplete sets and does not update the universe metadata! This mechanism enforces atomicity and consistency natively without any need for additional programming or complexity.

On the isolation front, lists of in-process and pending operations are maintained in dynamic pipelines for each extended set. The system examines these pipelines and algebraically identifies any potential conflicts between operands. Again, the mathematics allows this to occur natively. So if the results of pending operations do not affect in-process operations, the system executes them concurrently and immediately. Conversely, if the mathematics identify a potential conflict or deadlock, pending operations are queued until conflicting running operations have completed.

Durability is the last remaining condition. The system maintains all realized sets in persistent storage (disk drives). Although sets or parts thereof can be (and often are) kept in cache, any complete set also has an image on disk. This prevents system failures from affecting the durability of realized sets. When the system restarts, it also restarts any operations that were executing at the time of failure.

For all these reasons, XSPRADA technology is actually superior to conventional database mechanisms for enforcing ACID, as the enforcement is inherently “built-in” via the mathematics underlying the system at all times. As I mentioned in the last post, the engine is also time-invariant, meaning it can always be queried at a given time point in the past. The ability to do this without any external programming or internal modeling is significant. One use case that immediately comes to mind (to me anyway) in wake of the recent Wall Street disasters is being able to ask a financial database to yield answers as if it were being queried months or years ago. Imagine being able to roll back time to analyze or audit results that supported past decisions and the people who signed off on them. What a concept!

If you'd like to take the XSPRADA database out for a spin (and pull the plug in the middle of using it just to see what happens ) feel free to get the trial bits from our website.

Monday, June 22, 2009

Can Rover Roll Over Too?

In this post, I want to stay in gear-head mode and "demo" a couple of neat tricks from the XSPRADA RDM/x analytical engine puppy. I’m going to address quick prototyping capabilities (thanks to schema agnosticism), incremental inserts, and time invariance.

We’re going to do this with some really simple tables and data structures just to highlight the concepts behind the engine’s functionality. So suppose we start with a CSV table called ‘demo.csv’ consisting of the following rows (NOTE: the file must be terminated by CR-LF):

1000,12,"now is the time"

1001,45,"for all good men"

1002,76,"to come to the aid"

Now we want to load this into RDM/x so we use the XSPRADA SQL extension CREATE TABLE FROM as follows:

create table demo(id int,c1 int,c2 char(128)) from "C:\ demo.csv"

Then we do a SELECT * on the table just to make sure all is well (for example, using QTODBC or any other ODBC-compliant SQL front-end tool) and we see:

1000 12 now is the time

1001 45 for all good men

1002 76 to come to the aid

And if I look at my schema in the QTOBDC object browser I see this as expected:

So this is fine and dandy when you know the actual schema of your data but now what if you don’t or what if you’re not sure or what if you're out to determine what the best schema might be given your application?

Ok well what we can do it simply load everything as text types. So now we do:

create table demo(id char(128),c1 char(128),c2 char(128)) from "C:\ demo.csv"

The original ‘demo’ table is overwritten with the new schema which now becomes:

And now we can play around with that until we’re satisfied. Assume we suspect the optimal way to work with this data might be using INT for the index, a double precision type for the c1 column, and an 80-char VARCHAR for the last field. We might take our existing table and “flip” it into a new schema (called flipped) as such:

select cast(id as int), cast(c1 as double precision), cast(c2 as varchar(80)) from demo into flipped

Note how we convert the schema into a new one directly into a new table on the fly. Now if we select from flipped we see:

1000 12.0 now is the time

1001 45.0 for all good men

1002 76.0 to come to the aid

And the schema for the new table, as expected, is:

Now if we needed to do some querying on numerics for C1, we could. Note we didn’t even have to explicitly create or schema the ‘flipped’ table. RDM/x took care of that automatically, making these types of operations ideal for quick prototyping. Incidentally, we could have done this directly on the existing table as well. This type of flexibility is pretty cool and allows you to “play” with the data in a trial and error mode.

Now suppose we come across some additional data and we wish to INSERT this information into our existing database. This is typically what happens with daily incremental into a data warehouse. On a periodic basis, chunks of data are added to (typically) fact tables. Our incremental CSV looks as such:

2000,132,"to be "

2001,465,"or not to be"

2002,786,"that is the question"

There are several ways to handle this using RDM/x. The most direct one is to simply use the XSPRADA “INSERT INTO FROM” SQL extension as follows:

INSERT INTO demo FROM “c:\demo_062209.csv”

In this case RDM/x is told “hey, there is more data being presented to you so, algebraically, union it with the existing data”. As RDM/x does not “load” data in the conventional sense of the term or support bulk loading (as it doesn’t need to), additional data can be presented virtually in real-time with little or no effect on the ability to query the system simultaneously.

This is what’s called “non-disruptive live updates” or concurrent load and query capability. Initially, users are a little surprised that bulk inserts are not supported. But in fact, all that’s needed is “dropping” the incremental data somewhere on disk and telling RDM/x of its existence. RDM/x can ingest this information as fast as it can be written to disk.

Another way to do this is by loading the incremental “chunk” into its own table as such:

create table demo_062209(id int,c1 double precision,c2 varchar(80)) from "C:\ demo_062209.csv"

And then doing

select * from demo union select * from demo_062209 into demo

A bit of subtlety there: when you do this, you’re effectively “updating” the existing demo table with the unioned results of the incremental. So you take the old demo table, add the incremental, then flip that back into the original demo table. You could have saved the “old” demo table first before doing this as such:

select * from demo into demo_current

Notice how these “dynamic” tables really behave like variables in a loosely-typed programming language. They are conceptually related to conventional database “views” (although RDM/x does support view objects in the database but from a purely semantic way). As a matter of fact, all tables and views in RDM/x are essentially the same thing and materialized (to disk and/or memory) on a JIT basis anyway.

Of interest here is that you can always recover past instances of any data RDM/x via a nifty little feature called “time invariance”. This feature is unique in the world of databases to the best of my knowledge. Essentially, you can query RDM/x much like a “time machine”, asking it to yield results as if the question were being asked in the past.

So suppose you had inserted your incremental into the ‘demo’ table by mistake and suppose your insert had occurred at a given time such as '2009-06-22 15:13:52.775000000' (and you can tell this if you are logging your SQL statements to the database using the usrtrace option of the RDM/x ODBC driver). You can always recover the state of ‘demo’ at the time by issuing:

SELECT * FROM demo AT TIMESTAMP '2009-06-22 15:13:52.775000000' into recover

Now your ‘recover’ table will contain ‘demo’ exactly the way it was at that time.

The implications are far-reaching because you can essentially always query the RDM/x database at any point in the past. RDM/x never deletes information UNLESS the information has been explicitly deleted using a DROP TABLE or DELETE type of DML statement AND the garbage collector kicks in. Short of that, the database maintains information and integrity through time.

These are just a couple of nifty tricks from one original database engine called RDM/x.

Friday, June 19, 2009

Choose Wisely Grasshopper

If you’re been tasked with implementing or supplementing business intelligence offerings in your organization lately (read: a data mart, for example), the choices have gotten quite a bit more complicated. A while ago, you would have your pick of three or four vendors. If one of those vendors was already in-house handling your transactional systems, guess what, he was likely to also handle your warehousing needs. Or at least try to.

Nowadays, the decision path is a little more involved because the BI world is no longer ruled by a small oligarchy (namely Oracle, Microsoft and IBM). The proliferation of “new-breed” analytical engine vendors has greatly expanded a buyer’s options. There are around twenty-some players in this field now. Not only that, but delivery options have expanded as well.

Nowadays, you can get BI delivered and running in-house sitting on commodity or custom hardware. You can buy canned appliances. You can go proprietary bits. You can go Open Source. Or you can tap the “cloud” with an on-demand subscription-based model. You can pick a columnar vendor, or a row-based one. You can choose an MPP architecture, or an SMP implementation. You can even stick with the “gorillas” if that makes you feel better (and money is no object). These different paths are all strategically different both technically and economically. The modern BI strategist is compelled to choose wisely in an unforgiving economy where failure is no longer an option (this time they mean it).

So for the sake of argument, I’m going to assume the following buyer profile:

Money is _definitely_ an object.
IT resources are non-existent, limited, not available, or not inclined to help.
A lengthy proof-of-concept (POC) cycle is not an option.
Project timeline is measured in weeks not months or years.
C-level people want to see incremental results starting today.
A single DBA is left standing in your organization (but next week, maybe zero).
Your ass is on the line.

I think this describes a fairly common scenario these days. Faced with such odds, I think most people opt for the path of least economic and implementation resistance. Most people don’t have $2-6M hanging around including staff and a six-twelve month window to implement a full-blown Oracle, Microsoft or IBM solution. These days are simply gone (good riddance on that). It would appear at first glance that the only remaining alternatives would be open source (OSS) or on-demand SaaS software.

Now, OSS is attractive from a cost basis, as most freebies tend to be. Yes, you do pay for support if pulling down “enterprise” versions of the software but in many cases, some buyers get away using the free versions for a while, at least for quick POCs. In some cases, the free versions are either limited or incomplete in functionality (like InfoBright, for example) so that could be a “gotcha” depending on your application needs. Similarly, non-enterprise versions usually depend on community for support. If you’re in a jam and need serious dedicated support on a moment’s notice, you’ll need to pony up for an enterprise version or wait until “the community” comes up with an adequate answer to your problem (if ever).

Additionally, OSS does have hidden costs, not the least of which are installation, setup, configuration, and maintenance. But, if you happen to be a Linux shop and have enough LAMP developers on staff with sufficient expertise and time, and your management happens to be accepting of the whole OSS concept, it might just be a viable option.

Another quick way to get up and running quickly is to go the Cloud (SaaS) service route. In that scenario, you pay a monthly fee to access a BI platform in the cloud. This hands-off approach is certainly attractive in many cases provided data volumes and security restrictions do not get in the way. Shlepping 10-100TB of data offsite is not something most people consider yet. But, for smaller data sizes, SaaS is certainly an option. Vertica, Kognitio and Aster Data come to mind as the latest new-breeders to provide cloud-based services to customers (either on proprietary or public cloud platforms like EC2). There is a flurry of other on-demand BI players as described in several of my past posts. Of course, the downsides there tend to be upload time, limited functionality and vendor lock-in.

Now if I can get on my soapbox for a minute, I’m going to pitch a third option. At XSPRADA, we’ve developed a 5MB high-performance analytical database running as a Windows service on commodity hardware and Server 2003 or 2008 x64 operating systems. You can install this puppy internally (your data stays nice and safe in-house) in about 2.5 minutes including ODBC drivers. You then point it at your CSV data on disk and start firing off queries immediately. The more you ask, the faster it gets. And you can show results in minutes, not weeks or months. No cubing, no indexing, no pre-structuring, no partitioning, none of that nonsense. That’s the bottom line in a nutshell. I promise you business intelligence is not the rocket science so many folks make it out to be! For those who might be hesitating between open source, SaaS or much more expensive in-house solutions, I think this is a pretty unique proposition. There really is nothing like it anywhere else on the market, and it just so happens we have a 30 day trial going on right now if you visit our website J

Thursday, June 4, 2009

Yabadabadoo and off to Austin!

I’m headed out to Austin again tomorrow. Last I checked it was 94F over there with 60% humidity. Guess I won’t be going on my daily power walk for a little while lest I keel over. That’s OK though, I’ll get my workouts at Bone Daddy’s. I’m jazzed up about the trip as usual. I love that town. It has a relaxed, genuine, homey feel I have not found in many other US cities. When people ask “how ya doing?” they actually mean it and expect an answer!

I have prospects, clients and partners to visit. Now that our pre-release RDM/x software is available, I’m getting to see how people actually use these bits in real-life situations and that’s fairly exciting for many reasons.

Product claims are one thing. But it’s quite another to actually see people’s reaction when they discover the software. When folks install our software the usual reaction is “huh? Is this all there is to it? What did I miss?” – The answer is “you missed nothing” – RDM/x deploys and installs as a 5 Meg Windows Service. Read my lips: a 5 megabyte Windows Service EXE is all you need to query TB-level data sets faster and easier than you’re used to. Install the bits, present data, connect to ODBC driver, and ask questions. Done. Naturally, given users’ past experience and expectations with classic analytical engines, it’s hard to swallow initially. And speaking of shockers, here a doozy of a benchmark you can find on the TPC website.

That’s right folks, $6 Million dollars with Godzilla metal and full-blown enterprise software will buy you queries into a mere 1TB of data. Or, you could install and start a 5MB Windows Service on a $20,000 server.

What else am I picking up? People are incredulous about the fact that there’s no need for partitioning, indexing, pre-structuring, worrying about rigid schemas and data models. In one case, I was actually able to load 18GB of data by pulling everything in as strings. These particular data sets were CSV clickstreams for a web analytics application (conveniently, they’re often text log files which suits us well as CSV is the only format we currently ingest). The ETL attempts had become tedious due to poor data quality and uncertain schemas. It wasn’t clear initially what the best schema might be for the type of data at hand so experimentation was needed. So we took more of an ELT approach and just DDL’d everything in as text. We then quickly flipped internally to another schema, casting columns as needed on the fly. And if that schema turned out to be less efficient or interesting than originally thought, we just flipped the whole dataset to another one in one line of SQL. This type of just-in-time flexibility in presenting and transforming data is seldom (if ever) found in more ponderous solutions. For web analytics type of applications (where DQ is typically poor and schemas fairly dynamic) this is a huge competitive advantage.

Another sweet spot is OLAP cubes. Or lack thereof I should say. In typical OLAP applications, building cubes is lengthy, complicated and expensive (as anyone who ever footed an analytical consultant’s bill will attest to). Re-aggregating or adding dimensions is also painfully slow and tedious as data volumes increase (and they always do). A lot of times these processes are setup to run nightly as batch processes. If you’re lucky and if you’ve done your engineering properly (lots of ifs), you show up in the morning and it’s done. But you can’t add a dimension in real-time without impact on the entire system. And you need to plan for your questions up-front, as in “what shall I slice & dice this year?” With RDM/x, you don’t need to setup dimensions explicitly. The very act of querying in a dimensional way alerts the engine to that effect and it starts building “cubes” internally and aggregating as needed, optimizing for specified dimensions. Then adding a dimension is painless – all you need to do is actually query against it more than once. With RDM/x, once is an event, twice is a pattern.

This “anticipatory” self-adjusting behavior is actually at the heart of RDM/x. This is one area where the product distinguishes itself from the competition. Because the software continuously re-structures both data and queries as more and more questions come in. This means you see response time decrease as quantity and complexity of queries increase. This is fundamentally different from conventional database behavior where query response time “flat-lines” up to a given number (not including load time) and pretty much hovers there until and unless an “expert” can optimize either code or configuration (if possible at all). And when you add to or complicate your query patterns, performance suffers.

The RDM/x response-time profile is more of what I call the “dinosaur tail” effect: big initial hump, then exponential taper down over time. Here’s my Fred Flintstone rendition of the effect using MS Paint (nostalgia?):

It is always a blast observing people as they first witness this unique behavior. It’s like discovering a new life form J -- And I’m looking forward to repeating the experience in Austin next week and throughout the country soon enough. If you care for a head-start, pull down the bits from our website and give it a whirl!

Big Honking Databases