Monday, May 25, 2009

Data Warehousing for Dummies

Every so often I’ll be talking to someone implementing our solution at a customer or POC site and the question comes up “what’s so special about your database anyway?” and “So, is it really different from MySQL, SQL Server or Oracle?” or “I don’t understand why your database talks SQL since it’s not a normal database like Oracle”. Better yet: “What else do I need to do after installation to get this working?” Usually these questions come from non-LOB folks who are tasked with implementing a particular solution using our product. Typically these people tend to be experienced software developers or DBAs. People who, as Joel Spolsky likes to say, are “smart and get shit done”. The type of folks you can throw a problem at and say “Ok, go solve it using this new tool.”

But as often happens in large organizations, they may not have been briefed fully by management on the features/functionality of the new tool needing evaluation. Or maybe this is their first exposure to BI. They may also never have encountered or worked with an analytical database product. I know that, several years ago, if you’d asked me what the difference was between OLTP and OLAP I would have blurted something like “one if for transactional stuff, the other for reporting” and been in the right ballpark but no cigar.

So when these questions come up, I am always ecstatic to be able to share what I’ve learned with the guys in the trenches doing the real work! The first thing I do is give I very general view of the differences between transactional (operational) and analytical use cases. Then I’ll try and give a 30,000 foot picture of data warehousing and its history. I‘ll mention Kimball and Inmon, of course, then several books and a series of blogs, websites and youtube videos for further exploration.

But this weekend, I discovered the holy grail of data warehousing 101. I was hanging out at my local Borders looking to trade my 40% off coupon in exchange for yet another good data warehousing/BI book when I noticed the yellow “Data Warehousing for Dummies” on the bottom shelf. Obviously I couldn’t resist picking it up, especially since I have yet to meet anyone remotely “dumb” in this business.

To my great surprise, I noticed the author was none other than Tom Hammergren, the owner of Balanced Insight, one of the top BI Software Innovator firms in the country. To say that Tom is a warehousing and BI guru is an understatement. This much I knew. But I had NO idea he was also an accomplished writer who could present this complicated subject in clear, simple terms anyone can understand and relate to! From now on, whenever someone asks me for the quick low-down about BI and data warehousing, I’ll be referring him or her to Tom’s book.

Now to answer the above questions about our own product RDM/x. There is nothing magical about our analytical database, at least from a usage standpoint. RDM/x talks and walks like any other database product out there on the market using ODBC. The magic is on the inside, certainly not in the interface (thankfully) as it supports a significant subset of the SQL-92 standard (minus TCL and DCL). Is RDM/x “really” different from SQL Server, Oracle, MySQL or DB2 though? You bet.

RDM/x is designed for data analysis, not transactional processing. As a matter of fact, RDM/x is the smallest, nimblest, on-premise solution available on the market that will let you query terabytes of data in minutes from installation. And that’s why I believe most people are a little confused from the get-go. Because they’re used to large footprint multi-module database clients with 500 page installation and setup manuals, followed by complicated tuning and optimization techniques involving indexing, partitioning, and all that “good” stuff. When they see a 5 megabyte piece of software installing as a Windows Service, ready willing and able to handle queries on giga or terabytes of data within minutes, they think they’re missing something. How can BI be this simple? Well it can. The proof is in the pudding and since we’re allowing you to download a fully functional 30-day evaluation from our website effective now, the best I can do is recommend you take me up on that assertion by visiting our website.

Thursday, May 21, 2009

Running RDM/x on Amazon EC2 is DICEE!

I want to keep this post fairly brief because there is so much stuff going on at XSPRADA lately that I find myself pressed for time from 6AM to midnight on a typical day which usually also includes weekends, but that’s the price you pay for building a revolution. Ask Fidel, he knows.

First of all, I finally had time last Sunday to record a screencast explaining how to install and setup our RDM/x software.  In the process, I discovered that Camstudio and Microsoft Windows x64 decoder were my friends, reducing a 1.2GB video to 35Megs (phew!).

Second, I want to talk about my recent epiphany with EC2.  Early this week I decided to see if I could install and run our Windows-based database engine RDM/x on some sort of cloud platform because I don’t think anyone in their right minds in enterprise software can afford to ignore this trend any longer.  My purpose was certainly not to setup a full-fledged production system up there, but rather to setup a quick and dirty demonstration system so people could either duplicate or use it on the fly to test-drive our software, for example.

After poking around a bit I settled on Amazon’s EC2.  They seemed like the only “big-time” player supporting WinTel boxes (our software runs on 64-bit Windows Server 2003 and 2008) and they have enough credibility and market “karma” at this point to alleviate most basic concerns about reliability and security.  So after checking out possible configuration tools and hitting our CEO up for some plastic, I signed up for EC2 and started exploring this brave new world.

It turns out EC2 is really several “platform” components comprising: the actual EC2 O/S instance (a VM blade) known as an AMI (Amazon Machine Instance), persistent storage called EBS (Elastic Block Store) which presents as “volumes” you mount onto the AMI, and persistent (hot/cold) storage (also used for EBS snapshots) called S3.  There’s also a queuing system called SQS but that didn’t enter my mix.   Confused yet? It’s not that bad once you get used to it J

For the configuration tooling, you can use command line tools (which I suspect most *NIX/LAMP people prefer), a FireFox pluggin, or the web-based AWS (Amazon Web Services) Console.  I used both of the latter to compare.

I brought up one of the standard Windows AMI as a Windows 2003 R2 64-bit datacenter server.  I actually tried two different instance types. One extra-large standard with 15GB of RAM and 4 cores, and one large high-CPU with 7GB of RAM and 8 cores.  I found better performance on the 4 core box with twice the RAM so I ended up sticking to that one.

For credentials, you need to generate a key-pair, and then plug in the private part into a dialog box which then spits out an admin password for the new instance. You then connect remotely to your instance using Remote Desktop Connection (or SSH if you’re talking to a *nix instance).

Right off the bat my instance came with four attached 500GB “hard drives”.  These volumes are more like flash drives I think. This is not persistent storage but it’s pretty darn fast.  For “real” storage you need to create Elastic Block Storage (EBS) “volumes” and attach them to your instance.  So I did just that and slapped 4 additional 500GB drives to my box, and then converted each drive to an NTFS mount point (because this is best practice for our particular application). Unfortunately, I extracted a maximum 22-25MB/sec I/O to and from these volumes.  I had read somewhere that these dynamic “block devices” were more like instant SANs but in fact, Amazon Silver Support (another pay-for service but well worth it if you ask me) stated the following to me in an email:

“Even though it presents a block interface, EBS isn't intended to be equivalent to fibre-channel SAN storage. The performance you should expect from EBS would more closely align with a NAS device. You can stripe several EBS devices together for higher I/O rates, but your rates will be limited by various shared components in the system, including the network between your instance and the storage servers. Larger instance types will typically see better performance than smaller instance types.”

“More closely align with a NAS device”.  Huh oh. That means gating at the NIC level.  The systems are clearly not setup for intense I/O data processing needs, at least not using the standard EC2 configuration models currently available.  Nevertheless, I was still able to do sufficient work with sufficient data to build a reasonable “functional demo” machine.  And that was my goal from the onset. Given this took me about 1.5 days to figure out, at an average cost of around $16/day (not including the additional Silver support fees) I am very impressed with this cloud platform to say the least and I’m sure Jeff Bezos is basking in the bliss of my endorsement :)

Quite honestly, this cloud business is no joke.  I haven’t seen, heard and felt such a buzz around a new “platform” in the industry since I got my hands on Windows 3.0 in the early nineties.  It was the same “oh my God” emotional feeling at the time, or DICEE as Guy Kawasaki likes to put it (Deep, Intelligent, Complete, Elegant and Emotive).


Saturday, May 16, 2009

Software that Sucks

Here’s a classic from the tech press that really caught my attention recently: 

“In the context of software, the word “Enterprise” has now officially come to mean software that sucks. Enterprise Software hit the nadir of suckitude (sic) at the launch of “Enjoy SAP”.  This is like the American Dental Association launching “Enjoy Root Canal”.  SAP is certainly an easy target, but let’s face it, “Enterprise Software” is generally a poorly integrated mess.  Working with Enterprise Software feels a bit like walking through an industrial landfill or an airport hangar.  Nothing is built to human scale.”

This was written on the SOA Center blog by no other than Software AG’s Chief Strategist Miko Matsumura.  His use of the techo-political term “suckitude” is one for the annals of our new post-TARP technology world.  If nothing else, the current situation seems to be facilitating proverbial “paradigm shifts” (namely, on-demand software) while encouraging more anti-status-quo “frank-speak” from industry figureheads.  I’m all for that. 

Because, notwithstanding all the pain, suffering and incertitude in the economy lately, one of the really brilliant consequences of this world-wide mess is that people are starting to say out loud what everyone’s been thinking silently for years.  Even in the sacrosanct enterprise software glass mansions, people who matter are starting to throw stones.  When major industry players start talking straight and using technical terms like “sucking”, you know the BS gloves are off.   I think established players, platforms and ways of doing business and thinking about customers are all up for questioning at this point.  Sunshine is the best disinfectant.

And speaking of gloves off, SAP and industry shifts, this old article from April 2008 refers to a slug match between two industry titans at the Churchill Club.  One is Marc Benioff from and the other Dr. Hasso Plattner of SAP fame.  There’s a video of the exchange on Youtube.  I know it’s a long one, but I assure you it’s worth watching entirely if you care anything about the on-demand versus on-premise religious wars of late.

I am not going to comment at length on the video as anyone can draw their own conclusions, but I did want to point out what I consider some key points, and throw in a few gold nuggets.  

First, the body language between those two guys is simply priceless.  It is more than obvious from the get-go that they can’t stand each other.  You can catch the vibe even in that one picture in the article (and throughout the video).  Benioff’s looking away from Dr. Plattner constantly (he fidgets with his wedding band incessantly), and Dr. Plattner is reflective in his own world as in “why the hell am I here”.  To my amazement, at the end of the video, they both reveal that this is their very first in-person meeting!  Incidentally, one audience member does ask Dr. Plattner at the end why he accepted to do this.  His answer: “for the challenge”.  Not sure what that means.

Second, the verbal jousting between the two is fairly aggressive.  I don’t think these guys have much respect for each other notwithstanding their pseudo-polite claims to the contrary.  If you asked me whether Benioff hates Microsoft or SAP more, I’d be tempted to say SAP.

At one point Benioff states: "We have been passionate about moving obstacles out of the way of the old enterprise software companies.”  I guess this is one major tenet of the on-demand adepts.  Power to the users!  In my opnion, Dr. Plattner really  does buy the on-demand proposition but not “religiously”, and either way, he can’t say it in public.  He knows SAP screwed it up in the past.  I’m not sure he believes in SAP’s ability to execute such a shift internally.  And I bet he wouldn’t mind buying Salesforce outright with one check.  He implies as much several times but then claims he doesn’t want to get into a bidding war with Oracle.  Hogwash.

Throughout the video, both contestants score evenly, in my opinion, on the arrogance meter.  I guess they can both afford to be that way, but it does take a certain piece of the “human” side away from each.  For Dr. Plattner, I think the Germanic personality comes through more than genuine arrogance.  After all, he doesn’t need it at this point.  The guy built and ran a $40B company.  Enough said.  Benioff often has this “do the right thing” Google-ish “morality” in several other interviews and videos.  But when you watch him in action here, the only thing that comes out is ruthless self-convinced warrior (it’s no coincidence his favorite read is Sun Tzu’s The Art of War).  Although conviction and the ability to back it up is noble (and key to business success), I’ve always feared people immutably driven by their own dogma (mind you, I actually buy into on-demand big time).  But as my high-school math teacher used to say “you can never shelter yourself from a surprise”. 

Finally, as I was lauding “frank-speak” earlier, I did want to point out that Dr. Plattner uses the term “shit” several times during the exchange.  Initially, referring to Salesforce grabbing Dupont from them he states “Why did he win DuPont?  Because we had a shitty CRM system, and he had a much better one.”  Then later, referring to a customer still using code Dr. Plattner himself wrote: “…Shit! There is a customer in America still using the code I wrote.” Then referring to SAP’s earlier attempt at on-demand CRM: “…Shit, yeah!  It was better than our CRM on-demand.”   I find that endearing.

To conclude, if you truly want to understand the ongoing (and upcoming) battles between the SaaS and on-premise proponents of the enterprise software industry, you owe it to yourself to watch this video or, at the very least, pull down the transcript. And bring some popcorn!



Thursday, May 7, 2009

Tidbits and Check this Guy out

I read this article a couple of weeks ago and thought about one of our field test partners (telecom) because they had some political issues shipping us some data due to (very legitimate) privacy concerns – as in their CSO going “are you guys out of your f$##$ing minds?!?”.  

As it turns out, there are several data obfuscation tools out there on the market, including DMSuite’s offering as described in this article. I’m curious if most companies’ privacy policies make an exception for data that’s been altered by such a tool and if so, is there some sort of standard or certification these tools must meet?  If you know anything about that, I’d appreciate some insight.

I didn’t know until last night that there actually is a CIQP Certification.  What is CIQP you ask? Come on, get with the program!  Everyone knows what a Certified Information Quality Professional is!  There is a whole website dedicated to DQ as well.  I had never heard about this professional category.  If anyone reading this happens to be in that category and/or CIQP Certified, I’d love to chat with you and learn more about it.

For those of you who think the economy really sucks, your deduction is likely valid.  Nevertheless, BI and on-demand software market indexes seem pretty healthy to me as this article demonstrates.  My conclusion: I’d rather be in the “avant-garde” BI enterprise software sector than working for SAP or Oracle at this point J

I discovered Guy Kawasaki’s Entrepreneurial Lectures delivered at Stanford in 2003-2004 via this videocast series and sat there mesmerized listening to every single clip for hours.   

Guy (who now runs this blog and this company) successfully evangelized the Mac in the mid-80s and now runs a VC firm called Garage Technology Ventures.  His reputation and track record are legendary.  In the clips, he lectures young Stanford engineers-to-be on how to become successful entrepreneurs, change the world, and keep their soul in the process.  These are the points (or quotes) from his lectures that were etched on my mind:

  • Make meaning and make the world a better place.
  • Don’t write a mission statement, write a Mantra.
  • If your product is not unique and adds no value, you’re doing something stupid.
  • Don’t ask people [customers] to do things you wouldn’t do.
  • Be a Mensch.
  • Hire infected people.
  • Suck down.  The higher you go in the enterprise, the thinner the oxygen.
  • A milestone is something that increases the valuation of your company.
  • The valuation formula for a startup is: add $500,000 per engineer and subtract $250,000 per MBA.

All these points are perfectly in line with my personal experience.  But I’d never heard anyone formalize them in such an entertaining way before! And I don’t know that any comment can properly decorate any of these either. It’s one of those “you either get it or you don’t” kind of things.  It can’t be taught or inculcated by anything else than passion-driven experience.  

Tuesday, May 5, 2009

MAD About You.

I haven’t had a minute to sit down and blog lately. Our upcoming “pre-release” software slated for Cinco de Mayo has absorbed all my efforts. I just flew back from Austin, TX last week to participate in final engineering and release touches. At the same time, one of our major Defense prospects just re-activated a huge project so it’s all hands on deck at XSPRADA these days. I love it when a plan comes together.

Nevertheless, I recently picked up on a paper called “MAD Skills: New Analysis Practices for Big Data” referenced on Curt Monash’s blog by one of its authors Joe Hellerstein. Joe is a CS professor at UC Berkeley, if I understand correctly, and a consultant for Greenplum. I expected to read a lot of pro-Greenplum content in there but I don’t feel the major arguments presented are specifically tied to this vendor’s architecture or features per say. What makes this paper really valuable, in my opinion, is that it “resulted from a fairly quick, iterative discussion among data-centric people with varying job descriptions and training.” – In other words, it is user-driven (and not just a bunch of theoretical PhD musings) and as such, probably an accurate reflection of what I consider to be a significant shift in the world of BI lately.

Namely, that the “user” base is shifting from the IT type to an analyst and business person type. Note I didn’t say “end-user”, because the end user is still the business consumer. But what’s really changing is the desire (and ability) to cut out the middle man (read: the IT “gurus”), in essence, at every step of the way from data ingestion to results production. To put it in terms Marx would have liked, the means of production are shifting from dedicated technical resources to actual consumers J -- This is particularly true in the on-demand world, but probably pervasive throughout as well. I think the paper outlines and defines this change.

“The conceptual and computational centrality of the EDW makes it a mission-critical, expensive resource, used for serving data-intensive reports targeted at executive decision-makers. It is traditionally controlled by a dedicated IT staff that not only maintains the system, but jealously controls access to ensure that executives can rely on a high quality of service.”

The starting premise of the paper is that the industry is changing in its concept and approach to EDW. Whereas in the past an “Inmonish” view of the EDW was predicated on one central repository containing a single comprehensive “true” view of all enterprise data, the new model (and practical reality) is really more “Kimbalish” in the sense that the whole of EDW comprises the sum of its parts. And its parts are disparate, needing to be integrated in real time, without expectations of “perfect data” (DQ) in an instant-gratification world. This new model is premised on a MAD model: Magnetic, Agile and Deep.

Magnetic because the new architecture needs to “attract” a multitude of data sources naturally and dynamically. Agile because it needs to adapt flexibly to fast-shifting business requirements and technical challenges without losing a beat. And deep because it needs to support exploratory and analytical endeavors at all altitudes (detail/big picture) and for numerous user types simultaneously.

“Traditional Data Warehouse philosophy revolves around a disciplined approach to modeling information and rocesses in an enterprise. In the words of warehousing advocate Bill Inmon, it is an “architected environment" [12]. This view of warehousing is at odds with the magnetism and agility desired in many new analysis settings.”

“The EDW is expected to support very disparate users, from sales account managers to research scientists. These users' needs are very different, and a variety of reporting and statistical software tools are leveraged against the warehouse every day.”

Fulfilling this new reality requires a new type of analytical engine, in my opinion, because the old ones are premised on a model which, apparently, has not delivered successfully or sufficiently over time for BI. So the question beckons, what set of architectural features does a new analytical engine need to have in order to support the MAD model as described in this paper? And more importantly (from where I stand), is the XSPRADA analytical engine MAD enough?

To support magnetism, an engine should make it easy to ingest any type of data.

“Given the ubiquity of data in modern organizations, a data warehouse can keep pace today only by being “magnetic": attracting all the data sources that crop up within an organization regardless of data quality niceties.”

The “physical” data presented to the XSPRADA engine must consist of CSV text files. Currently CSV is the only possible way of presenting source data to the database. CSV is the least common data format denominator so this means the engine can handle pretty much any data source provided it can be morphed to CSV. Since the vast majority of enterprise data lends itself to CSV export, that covers a fairly wide array of data sources. How easy is it to “load” data into the engine? Very easy, especially since the engine doesn’t “load” data per say but works off the bits on disk directly. All it takes is a CSV file (one per table) and a SQL DDL statement such as “CREATE TABLE…FROM ” to “present” data to the engine. “The central philosophy in MAD data modeling is to get the organization's data into the warehouse as soon as possible.” – I’d say this fulfills that philosophy.

To support agility, an engine shouldn’t dictate means and methods of usage and must be flexible:

“Given growing numbers of data sources and increasingly sophisticated and mission-critical data analyses, a modern warehouse must instead allow analysts to easily ingest, digest, produce and adapt data at a rapid pace. This requires a database whose physical and logical contents can be in continuous rapid evolution… we take the view that it is much more important to provide agility to analysts than to aspire to an elusive ideal of full integration”

The external representation of data on disk and the internal “logical” modeling of the data changes dynamically based on incoming queries. The XSPRADA engine is “adaptive” in that sense and constantly looks at incoming queries and data on disk to determine the most optimal way of storing and rendering it internally. This feature is called Adaptive Data Restructuring (ADR). On the flexibility side, the XSPRADA engine is schema-agnostic. This is a fairly unique feature that allows users to “flip” schemas on the fly.

For example, it’s possible to present entire data sets to the engine based on an all VARCHAR schema (ie: make every column VARCHAR). Maybe you don’t know the real schema at the time, or perhaps you don’t care about it or perhaps the optimal schema can only be determined after some analysis is performed. Or perhaps there are inconsistencies in the data or DQ issues preventing a valid “load” based on a rigid schema. Or maybe it was just easier and quicker to export all the data as string types in the CSV. In either case, the XSPRADA engine will happily ingest that data. Later on, you can CAST each field as needed into a new table on the fly and run queries against the new model, or try others as needed. Similarly, ingestion validation is kept to a minimum by design. For example, it’s quite possible to load an 80-char string into a CHAR(3) field. This is not possible with conventional databases. The implications of this from a performance and flexibility angle are impressive. The XSPRADA database lends itself to internal transformation; hence it favors an ELT model, minus the “L”.

In recent years, there is increasing pressure to push the work of transformation into the DBMS, to enable parallel execution via SQL transformation scripts. This approach has been dubbed ELT since transformation is done after loading.

And finally, to support depth, an engine should allow rich analytics, provide an ability to “focus” in and out on the data, and provide a holistic un-segmented view of the entire data set:

“Modern data analyses involve increasingly sophisticated statistical methods… analysts often need to see both the forest and the trees in running these algorithms… The modern data warehouse should serve both as a deep data repository and as a sophisticated algorithmic runtime engine.”

A salient feature of the XSPRADA engine is its ability to handle multiple “types” of BI work at the same time. For example, it’s possible to mix OLAP, data mining, reporting and ad-hoc workloads simultaneously on the same data (and all of it) without resorting to “optimization” tricks for each mode. Similarly, the need for logical database partitioning doesn’t exist in the XSPRADA engine. Duplicating and re-modeling data islands on separate databases (physical or logical) for use by different departments is neither necessary nor recommended.

In an OLAP use case, there is no need to pre-define or load multidimensional cubes. The very act of querying consistently (meaning more than once) based on fact and dimension axes causes the engine to realize that this particular section of data is being accessed “multi-dimensionally”. It then starts cubing information internally, aggregating as indicated (if needed) by incoming queries. In this mode, perhaps the engine will decide a columnar storage approach is optimal and will re-structure the data accordingly. In a data mining use case, the approach is likely different because incoming queries are “incremental” (often pinpointed) and results are used to generate new queries without pre-determined patterns. The engine will likely start by eliminating vast “wasteland” areas of the data (the forest) from consideration as needed, then proceed to optimize specific islands of interest (the trees) as they become more relevant in the queries.

So overall, I think the XSPRADA analytical engine was indeed designed with “MAD-ness” from the get-go, even if the term didn’t exist years ago. It’s the approach and the philosophy that really matters. In that respect, we’re definitely headed for the MAD-house :)