Thursday, June 4, 2009

Yabadabadoo and off to Austin!

I’m headed out to Austin again tomorrow. Last I checked it was 94F over there with 60% humidity. Guess I won’t be going on my daily power walk for a little while lest I keel over. That’s OK though, I’ll get my workouts at Bone Daddy’s. I’m jazzed up about the trip as usual. I love that town. It has a relaxed, genuine, homey feel I have not found in many other US cities. When people ask “how ya doing?” they actually mean it and expect an answer!

I have prospects, clients and partners to visit. Now that our pre-release RDM/x software is available, I’m getting to see how people actually use these bits in real-life situations and that’s fairly exciting for many reasons.

Product claims are one thing. But it’s quite another to actually see people’s reaction when they discover the software. When folks install our software the usual reaction is “huh? Is this all there is to it? What did I miss?” – The answer is “you missed nothing” – RDM/x deploys and installs as a 5 Meg Windows Service. Read my lips: a 5 megabyte Windows Service EXE is all you need to query TB-level data sets faster and easier than you’re used to. Install the bits, present data, connect to ODBC driver, and ask questions. Done. Naturally, given users’ past experience and expectations with classic analytical engines, it’s hard to swallow initially. And speaking of shockers, here a doozy of a benchmark you can find on the TPC website.

That’s right folks, $6 Million dollars with Godzilla metal and full-blown enterprise software will buy you queries into a mere 1TB of data. Or, you could install and start a 5MB Windows Service on a $20,000 server.

What else am I picking up? People are incredulous about the fact that there’s no need for partitioning, indexing, pre-structuring, worrying about rigid schemas and data models. In one case, I was actually able to load 18GB of data by pulling everything in as strings. These particular data sets were CSV clickstreams for a web analytics application (conveniently, they’re often text log files which suits us well as CSV is the only format we currently ingest). The ETL attempts had become tedious due to poor data quality and uncertain schemas. It wasn’t clear initially what the best schema might be for the type of data at hand so experimentation was needed. So we took more of an ELT approach and just DDL’d everything in as text. We then quickly flipped internally to another schema, casting columns as needed on the fly. And if that schema turned out to be less efficient or interesting than originally thought, we just flipped the whole dataset to another one in one line of SQL. This type of just-in-time flexibility in presenting and transforming data is seldom (if ever) found in more ponderous solutions. For web analytics type of applications (where DQ is typically poor and schemas fairly dynamic) this is a huge competitive advantage.

Another sweet spot is OLAP cubes. Or lack thereof I should say. In typical OLAP applications, building cubes is lengthy, complicated and expensive (as anyone who ever footed an analytical consultant’s bill will attest to). Re-aggregating or adding dimensions is also painfully slow and tedious as data volumes increase (and they always do). A lot of times these processes are setup to run nightly as batch processes. If you’re lucky and if you’ve done your engineering properly (lots of ifs), you show up in the morning and it’s done. But you can’t add a dimension in real-time without impact on the entire system. And you need to plan for your questions up-front, as in “what shall I slice & dice this year?” With RDM/x, you don’t need to setup dimensions explicitly. The very act of querying in a dimensional way alerts the engine to that effect and it starts building “cubes” internally and aggregating as needed, optimizing for specified dimensions. Then adding a dimension is painless – all you need to do is actually query against it more than once. With RDM/x, once is an event, twice is a pattern.

This “anticipatory” self-adjusting behavior is actually at the heart of RDM/x. This is one area where the product distinguishes itself from the competition. Because the software continuously re-structures both data and queries as more and more questions come in. This means you see response time decrease as quantity and complexity of queries increase. This is fundamentally different from conventional database behavior where query response time “flat-lines” up to a given number (not including load time) and pretty much hovers there until and unless an “expert” can optimize either code or configuration (if possible at all). And when you add to or complicate your query patterns, performance suffers.

The RDM/x response-time profile is more of what I call the “dinosaur tail” effect: big initial hump, then exponential taper down over time. Here’s my Fred Flintstone rendition of the effect using MS Paint (nostalgia?):

It is always a blast observing people as they first witness this unique behavior. It’s like discovering a new life form J -- And I’m looking forward to repeating the experience in Austin next week and throughout the country soon enough. If you care for a head-start, pull down the bits from our website and give it a whirl!

No comments:

Post a Comment