Tuesday, March 31, 2009

2 = 2

Trawling through various “sales engineering” blogs recently, I happened to come across this one http://tinyurl.com/dz8nma by Xavier Petit who is apparently a fellow “paisan” from France.  Xavier seems like a sharp guy with an interesting manipulation of written English but nevertheless full of excellent insight, in my opinion.  Not the least of which is his reference to this other site here http://www.morenewmath.com.  Part of my job sometimes involves trying to explain or demo esoteric mathematical concepts to people who want to “drill-down” into our product’s technology (rare, but it happens) so you can imagine I was eager to learn more about this “new math” approach to performing successful demos.  Little did I realize initially that these were just mathematically-expressed (terse) philosophical principles (equations about life) – some of which are really pertinent to what XSPRADA is about.  For instance:

I wish I had a dollar (ok, in this economy, make that a Jackson) for every time I’d heard someone say “you guys can’t do this, it’s impossible”.  Now we have working software to prove them wrong.

A lot of folks initially said we were crazy for even attempting to disrupt the relational “status-quo”.  Five years later, here we are, and in good company I might add.

The concept of uniqueness is very hard to define mathematically.  Our ability to do just that is one pillar of our math-based technology.

Why disappoint people (and the market) by making overhyped promises and under-delivering.  We went the other route.

When perception matches reality over time, you have a successful product (think Rolex or BMW). People often ask me “what’s your value proposition”? I say “You tell me, here’s the software, here’s what it does, but don’t take my word, try it out.”  They’ve all heard the “better, faster, cheaper” pitch before, so why waste their time when it’s better spent actually using the software?

And my personal favorite, which has nothing to do with software, databases, or XSPRADA but fulfills me as a wine lover.  A votre santé! J

Sunday, March 29, 2009

Mind your own business (intelligence)

As I mentioned previously, I’m not a BI power user.  Until twelve months ago, I couldn’t have told you the difference between HOLAP, ROLAP or MOLAP, much less explained what a slowly-changing dimension was, or what the difference was between star and snowflake schemas.  

Nevertheless, I do have a technical background and I’ve learned a boatload about BI in the past year.  In the past months, I’ve been trying to put myself in the mid-market LOB user’s shoe by using all and any available tool I could find to “do BI” using our ODBC-talking analytical database as the backend (I wanted to do this from a non-enterprise perspective).  

I tried to get my hands on every product I could get from front-end analysis packages (either thick client or web-based) to ETL tools, both proprietary or open source.  Those are part of what I call the BI “ecosystem”.  My testing had several goals:

  1. How easy is the tool to install, setup and use without direction, manuals or training.
  2. How easy is it to connect to our database engine via ODBC.
  3. How easy is it to extract data from our engine and do some very basic analysis.

So I loaded the small Pentaho Mondrian sample database (FoodMart) into the XSPRADA engine and went to work.

Microsoft Excel  -- www.microsoft.com/excel

With Excel 2007, I generated an ODBC connection and quickly pulled in a data subset using Microsoft Query.  Then, I inserted a pivot table based on that data island in a new sheet, and did basic slice/dice analysis on the data.  Total time spent: all of maybe 5 minutes.  There’s a small instructional video on how to do this at http://tinyurl.com/ckfwp6.  Not rocket science.

QlikView  -- www.qliktech.com

You can download evaluation copies of QlikView from the website. When I first ran it, I thought, hey, this is MDX for dummies with an Excel twist (workbook paradigm).  QlickTech actually bills itself as “Excel on steroids”.  You can get results with QlikView fairly quickly out of the box.  Yes it takes a bit of training and practice to do anything really significant, as the product is very rich (and can do some serious ETL in the process) but overall it’s very visual and intuitive.  And if you run into any issues, their sales engineers are readily available and quite effective.

Tableau -- www.tableausoftware.com

I pulled Tableau evaluation bits from their website and also attended one of their webinars.  Unfortunately, Tableau doesn’t talk to ODBC. I’m not sure why.  That pretty much ended my evaluation right there.

AltoSoft  -- www.altosoft.com

I tried Insight Studio but I couldn’t extract table schemas from our system via ODBC.  Quite honestly, I think it’s because our driver probably doesn’t support some level 1 API the tool is using.  So then I tried pulling from my FoodMart database inside SQL Server 2008.  Even then, I never “got” the tool.  It requires some training and/or patience I simply don’t have.  It’s too complicated right off the bat.  I see I can create some PKIs, but then the UI isn’t really obvious or smooth.  It just doesn’t “flow” and there’s some web-based dashboard stuff too which itself isn’t exactly self-explanatory and pops up tons of little Cassini web server on my system.

Panorama -- www.panoramasoftware.com

For obvious reasons, these folks only talk to Microsoft SQL Server and SAP back ends.  Sharp UI and good usability, but, in my case, of little use.

Talend  -- www.talend.com

Talend is a French open source (OSS) company. (Incidentally, in this economy, they managed to recently score a $12M round --  not too shabby).  I tried their Open Studio tool and was able to do some very basic ETL to generate and convert CSV files to XSPRADA CVS specifications. This is important because at the moment, this is the only format you can present to our engine. So I wanted to see how easily someone could extract data from say Oracle or SQL and convert it to our specs.  The answer is: easily. That tool has a lot of knobs and switches and deep functionality but I was able to do simple stuff fairly quickly without diving too deep into Java code.  Given my OSS biases, I was impressed.

Pentaho  -- www.pentaho.com

Pentaho distributes an open source OLAP platform called Mondrian that talks SQL via MDX to numerous underlying databases, including ours.  I’ve been using Mondrian on Windows with the built-in JDBC-ODBC bridge to our database for months now without a problem.  With Mondrian’s web-based UI you can slice and dice FoodMart data any which way you like.  Where it gets a little hairy is actually setting up the cubing structures for your own data.  That does take expertise and is not for the casual LOB user (think: lots of trial and error) . But with enough effort and community support, numerous people and organizations have apparently done just that.

Birst  -- www.birst.com

I was very excited to hear about this new company because it seemed they had a really new approach to hosted on-demand BI.  They have a subscription-based service.  So I quickly created an account and tried to upload some simple data (SQL’s from my ODBC connection) and see immediate results. Unfortunately that was about a week ago and since then, I still haven’t been able to either load or analyze data on their site due to various technical glitches.  This is really unfortunate because Birst’s claim to fame is “ease of use” but if they can work the quirks out (and find some really good UX people) I think there’s huge potential in their “hands-off” approach which is basically what I was looking for, namely “here’s some data, here’s a question, show me something quickly without my having to become a cube expert”.

Gooddata  -- www.gooddata.com

This company is another “on-demand” hosted BI offering.  The application is in beta mode and it shows.  Nevertheless, I was able to upload data and run slice/dice charts in a very intuitive manner within minutes.  One of the major issues I see is performance.  Even with small data sets, upload and processing operations are very slow (I believe they run off of Amazon cloud services).  I am hopeful the UI/performance/usability quirks will be resolved in a production version because this is the only on-demand player I’ve seen that seems to “get it” so far.  Ironically enough, they offer the Mondrian sample data set as a “default” project to play with when you create an account.  Felt right at home.

I draw the following conclusions/recommendations from this little experiment:

·        Doing BI is not easy (duh).  Although some of these tools are more intuitive than others, we’re still a long way from “easy” BI.  Even with Excel you still have to know something about pivot tables, cubes (if you’re connecting to SSAS) and connections.  With Mondrian, it really helps to know MDX.  In that context, the on-demand solutions are simpler.  And of course with hosted approaches, you don’t need a license key for Office.

LOB users can’t use OSS tools without “expert” help.  There is no way a typical business person can suffer the slings and arrows of Mondrian or Talend quirks and “gotchas” on his or her own.

A lot of vendors are really very confused about what constitutes good UX.  In some cases, I truly wonder if these folks ever run usability studies.  Please take the app home to a spouse or parent and have them use it.  I swear that’s helpful.  My mom is an Excel wizard yet can barely plug in a USB wireless mouse.

Unless you have a tool that is easier to setup and use than Excel, don’t bother!  I mean what’s the point? Excel has massive market penetration.  Nobody is going to care about a new application unless it can beat Excel hands down in usability and performance.  Also, you can share Excel spreadsheets and data in the “cloud” nowadays using various providers (like google) so again, unless you have some compelling collaboration features to beat that, think again.

On the enterprise side, people are pegged into lock-in situation driven by licensing discounts and vendor pressure.  If you’re an Oracle, IBM or Microsoft shop, chances are you will be using a certain set of BI tools no matter what.  And I don’t think a Microsoft shop will be running Cognos, or an IBM shop SSAS.  Companie like QlikTech have a really tough time penetrating the large enterprise I believe.  

But everywhere else, new approaches are not only available, but also politically possible.  So I think the commoditization of BI will have to come from the bottom up. In that respect, the foundation currently being laid by on-demand and hosted BI services is much more important to the future of BI than meets the eye.

Thursday, March 12, 2009

There are only two types of people I cannot stand: people who are intolerant of other peoples' cultures, and the dutch.

In my quest to mine the business intelligence market to the last drop, when I’m not busy doing other things like talking to people, setting up POCs, coding, looking at product features or evaluating what I call “ecosystem” products like BI and ETL platforms, I spend a heck of a lot of time online reading blogs and scanning websites.

This market is very dynamic and it’s important to keep up and spot trends and pain points early on. This isn’t unlike an NSA type of endeavor really. There are gazillions of “signals” out there and it’s up to me to pick up the pieces and make sense out of them as best I can.

When I do this successfully, I can not only enhance our competitive edge, but also discuss intelligently how we fit into the business intelligence “big picture”. In this business, someone who is not constantly educating himself is in a world of hurt.

To do this, I scan several dozen blogs and websites an almost daily basis (see my partial list following this posting). This activity, combined with talking to folks in the BI world, product specialists, consultants, being involved in coding and testing a full-fledged database product, and attending webinars for the past year has given me a certain outlook on the market that carries “no baggage”. Let me explain.

I don’t come at this with the perspective of someone who has been immersed exclusively in the BI world for decades, or someone who has the hands-on background designing and implementing data warehouses his entire career. I’m just a software engineer who happens to have worked for a multitude of different businesses and in numerous domains, not just BI. So I feel this gives me a more “detached” view of the DW/BI market than an industry expert with BI-specific experience might have. That being said, the following are nothing more than observation-based personal opinions about the warehousing and BI market developed over the past 12 months. I’d love to hear your thoughts as well – at the risk of being branded a mere neophyte.

And then there was light…
Acceptance of “non-standard” (read: not from Microsoft, IBM or Oracle) approaches to analytical databases (including non-relational approaches) is going mainstream in the warehousing realm. Four to six years ago, if you talked to anyone in the enterprise about trying a “non-relational” database, they’d flip the bozo bit on you instantly. Nowadays, although relational “bigots” still (and always will) exist, BI people are more open-minded, mostly thanks to dozens of “new-breed” players in the market with well-documented successful implementations (not the least of which are Netezza, Teradata, Vertica and SybaseIQ). To use a phrase I detest, people are finally “thinking outside the Big Three box”.

Gimme a little OLTP with that OLAP would you?
The distinction between operational (transactional) and warehousing (analytical) business activity is blurring. Operational and analytical business efforts are often integrated. Warehousing and analytics is no longer “the crazy uncle in the attic” project. Another sign of this is the recent desire to enable ever more frequent insert/updates to warehouse stores than in the past. Not content with infrequent batch updates, people are now looking at efficiently pushing update/inserts real time into their warehouse. The updates often come from both an ODS and external data sources. Unfortunately most analytical databases are designed around the assumption that warehousing is mostly read-only (InfoBright ICE doesn’t even support DML) and optimize as such. I think they’re up against hard times unless they can “transactionalize” quickly.

Open source: cheaper than free?
Open source is having a large impact in analytics for both economic and practical reasons. Nowadays, businesses can set up data marts using engines like InfoBright (MySQL), for example, and use BI/Integration tools like Talend’s Open Studio or Pentaho’s Mondrian. Clearly a lot of this is driven by current economics, but open source deployment and licensing models are competing head-on with proprietary solutions on many levels.

Hey buddy, want some good BI?
Everyone’s talking about BI. Microsoft is running TV ad spots about it during prime-time and pushing the concept in a very public manner. After buying up Datallegro last year, they announced the Madison project, put out best-practice configurations for Dell and HP “appliances” hosting SQL Server, (http://www.intelligententerprise.com/channels/information_management/showArticle.jhtml;jsessionid=GSPUKPDEIPLJCQSNDLPSKHSCJUNN2JVN?articleID=214502509) and have been discretely adding warehouse-oriented features to SQL since 2007 (http://msdn.microsoft.com/en-us/library/cc278097.aspx). What’s more, they own the BI desktop with Excel and Office 14 promises a significant OLAP functional push. Vaporware? Maybe, but I think they’re gunning to eat a lot of BI folks’ lunches out there in the coming months. If anyone can commoditize DW and BI technology, it’s Microsoft.

Performance, shperformance
“Performance” is a big mystery. No one really knows how to define it in the DW/BI space. Is it load time? Is it query response time? Is it data presentation to result time? Does it include backup time? How about data recovery? Opinions differ. Service level expectations are often misguided or unrealistic. People don’t seem to care much about TPC-H or TPC-DS performance metrics.

The proof is in the concept
POC POC POC! People want to see numbers on their data on their hardware up close and personal. That’s the way it should be. There’s too much hype in the industry, and people tune out the bullshit http://www.biblogs.com/2009/03/12/rejecting-stale-tech-marketing-words/ -- Who can blame them. These folks have been abused and lied to for decades now. As I’ve been on the other side of the fence many times, I can relate. That’s why I always keep it real simple when talking about our product. Everyone’s heard the “better, faster, simpler” shpiel a million times, so why insult their intelligence. Give them a set of keys and let them test-drive the darn product! POCs should be run as described at http://www.altosoftcommunity.com/ and http://www.altosoftcommunity.com/?p=61 and http://www.dbms2.com/2009/03/02/ideas-for-bi-pocs/. There isn’t a single solution in this market that applies perfectly to every customer across the board and any vendor claiming otherwise is either naïve or disingenuous.

Cloud cool-aid
“Cloud computing” madness has taken hold in the DW/BI industry as well. Vertica and Kognitio are big pushers of warehouse cloud hosting solutions. The buzzword now is DaaS (data as a service). I find it hard to get excited about this recent trend. I don’t mean to Andy Rooney the whole concept, but in enterprise warehousing and BI, given the amounts of data involved and the security, governance, availability, and SLA issues, I just don’t “grok” it. Maybe for small segments of “cold” data? I don’t know. I see the craze, I feel the buzz, I get the marketing upside, and I notice the traction but…I am not of the body. The power of clouds doesn’t compel me.

Here’s my list of DW/BI industry blogs:

Tuesday, March 10, 2009

"Out of nothing I have created a strange new universe". [Janos Bolyai]

Right off the bat, let me refer you to an interesting discussion of extended set theory on the web at http://www.xprogramming.com/xpmag/xstSomeThoughts.htm written by by Ron Jeffries. It's not a bad conceptual introduction but I'm going to try to do it one better in simplicity below.

In our last post we discussed the notion of “distinguishability” among classical set theory members, and how XSPRADA extended set theory allowed one to clearly differentiate between and uniquely identify extended set members. These extended set members become values in extended set “couplets” and an XSPRADA “couplet” is an XSPRADA “ordered pair”. Each member is defined by a scope and a constituent piece, either of which is interchangeable at will, but neither of which can exist without the other. Now, in cases where order and identity are irrelevant, classical set theory can be used to manipulate these members. But in cases where we do need to identify, discern or order set members, we apply extended set theory accordingly as such:

Suppose you have a set of elements {A, B, C}. We want to represent both mathematically and physically (in a computer) so we can do calculations on it using the mathematics to answer questions, and then bring answers back to the computer. For example, we may want to count members in a set (cardinality), or calculate the intersection of two sets. So typically you store this information on the computer in one way or another. Then you “read” back the physical data (via disk I/O), map it to your mathematical ontology (in the software), do your calculations (count or intersect operation), then bring results back to the “real world” namely as a SQL result. The faster you can do this, the better off everyone is.

In a computer, you can’t simply store {A,B,C} because all possible permutations of the triplet {A,B,C} can be represented in many different ways (well, six actually, or factorial of 3) in memory (or storage). So which one do you pick? Clearly you can’t store every possible combination efficiently. Even if you wanted to, you’d still have to somehow indicate which combination you meant in a given context. So engineers came up with a hack called “indexing”. But now you also have to store the index values along with the data, even though the index itself is not native to the information. Back on the math side, when you load this record back into the model, there is still no inherent notion of order. To the model, it makes no difference if you load back A B C or C B A or B A C because they are all equivalent. So you have to load the index as well. Then you need to do some math “magic” to tie the two, and promulgate the magic all the way back to a result, which you must then move back to the physical world as well. It’s the quintessential anvil to the ankle problem.

Extended set theory models physical storage (RAM or disk) using newly defined set members and so right off the bat we would express our initial set as {1.A, 2.B, 3.C}. This cannot be confused with an entirely different set known as {1.B, 2.C, 3.A}. When “materializing” this abstraction to disk or RAM, the representation is clearly as follows:

0x002200 A
0x002201 B
0x002202 C

In other words, we have now modeled computer addressing using extended set scopes. This is one of the many possible applications of extended set theory for information management.

Now take a different arrangement, for example:

0x002200 C
0x002201 A
0x002202 B

When you move this back to the extended math model (in the software), you end up with {1.C, 2.A, 3.B}. No need for slight-of-hand indexing or other “unnatural” steps. There is zero ambiguity.

Now suppose you have 2 different sets and wish to determine if they are “equivalent” as follows:

0x002200 A
0x002201 B
0x002202 C

0x002200 C
0x002201 A
0x002202 B

Using classical set theory, you don’t have enough information (without indexing hacks) to mathematically make the call. In extended set theory, you end up comparing two sets {1.A, 2.B, 3.C} and {1.C, 2.A, 3.B}. You can see 1.A and 1.C are clearly distinct, as are 2.B and 2.A and 3.C and 3.B. Why? Because extended set theory says two members are equal only if both their scopes and constituents are equal! Consequently, you can say with certainty “these two sets are different”.

Let’s apply this to finding intersection between these two sets. In classical set theory intersecting sets {A, B, C} and {C, B, A} yields {A, B, C}. Intuitively, both sets seem to contain the same elements. But do they really? It depends. In extended set theory, the question is: do {1.A, 2.B, 3.C} and {1.C, 2.B, 3.A} have anything in common? And the answer is a resounding no as none of the elements has matching scope and constituents. In this example, if order matters, classical set theory cannot help us.

If you’ve been following me so far and thinking, this is really stupidly simple, you’re catching on. Simple in concept? Yes. As Einstein once said, “Make everything as simple as possible, but not simpler”. Simple in implementation? That’s another story. But the fact is, the implications for information management are mind-boggling. Why? Six reasons:

First, since the mathematics formally models order and relationships, and can manipulate these internally via a formal algebra, humans don’t need to pre-index or model information before exploiting it. The software extracts that information from the data directly. More interestingly, it does so by examining the bits and bytes right off storage with no loading or pre-canned processing logic needed.

Second, every piece of information presented to the system is internalized in the math engine. This means the original format of the information (be it relational or unstructured, CSV, XML, or video) becomes moot once inside the math engine. Once modeled, any information (bits and bytes) becomes sets, which can be manipulated and operated on consistently.

Third, because the mathematical model being maintained is complete and closed, the integrity of the “universe” is maintained at all times. This means information is immutable. It cannot be changed or deleted. This feature is called “temporal invariance”. Consequently, you can actually query the database as if you were asking questions weeks or months ago, not unlike a time machine.

Fourth, every query (question) fed to the system is transformed into its mathematically equivalent set of algebraic expressions. As such it can be optimized mathematically. Expressions and relations are also cached and exploited in real time. This lets the system discover existing relationships that a human user could not have foreseen (and much less asked about). It also lets the system adapt to “poor SQL” or poorly formulated questions without impacting results or performance.

Fifth, the math engine partitions queries. As a matter of fact, it can be shown mathematically that the number of possible query “types” in a given business intelligence domain is finite. No matter how complex or convoluted a user or machine-generated query is, the engine can determine without ambiguity which partition “bucket” it belongs to. Because this taxonomy is well-defined and finite, spectacular optimization techniques can be applied.

Sixth, it can be mathematically demonstrated that a finite number of operations suffice to handle any possible workload presented to the engine. The mathematics prove that any query can be answered using a small set of well-defined operations such as union, intersect and complement, for example. Consequently, the XSPRADA math engine need only implement a fairly small number of operations, and no more. As these operations can be threaded and parallelized in the software, this makes for a very highly optimizable high-performance system.

So from a technical perspective, XSPRADA technology and its application to database software is a genuine breakthrough. But more importantly, for business intelligence purposes (reporting, OLAP, decision support, or data mining) the rigorous application of formal mathematics to the software implementation of an analytical database engine offers unequaled competitive advantage. I’ll describe those in real-world terms in a subsequent post, and I'll also dig up a couple places and instances where this technology was actually succesfully applied in the past as well!

Thursday, March 5, 2009

Smashing baby!

In the last post I promised to dive a little deeper into the math-based XSPRADA technology without putting readers to sleep.  That’s a tall order, as most people flee at the mere mention of mathematics (and end-users typically don’t care about internal implementation details) but this story is entertaining as well.  And I need to set the stage explaining what make XSPRADA technology so valuable to the industry, and different from the rest, so here goes.

In classical set theory, everything is defined in terms of sets (groups).  Sets contain “members”, and several things can be done with them.  One is deciding if they are indeed members or not.  Another is defining “relationships” between members.  And another is defining operations on sets.  So for example, one might ask a membership question such as “is this handbag in our system”?    A relationship might be defined between a product and a manufacturer as in “give me all handbags manufactured by Gucci”.  An operation might define “intersection” between sets.   For example, intersecting product with geography might yield “all the Gucci handbags sold on the West coast”.

The most popular databases since the late 1960s are based on classical set theory (whence the name “relational”) as described in the famous 1970 E. F. Codd paper.  This technology remains unchanged since then.  But around the same time, David Childs and Franklin Westervelt, two researchers at the University of Michigan looked at how classical set theory was used to model databases and concluded there had to be a better, faster, more intelligent way to manage information.

Why?  Because they premised that classical set theory was handicapped when applied to computer systems.  Indeed, it did not provide a mechanism to uniquely identify set members.  Although seemingly innocuous, this is a major hindrance.  That’s right, although classical set theory clearly defines “ordered” pairs (couplets) there is no formal way to distinguish between { a, b } and {b, a} or between member ‘a’ and ‘b’ for that matter.  So the term “ordered”, in the classical sense, is relative, and unintuitive.  All the mathematics can say about a set is that it has X elements.  Period.  How you want to represent that on paper or inside a computer system is an “exercise left for the reader”, and a painful one at that.  Why?  Because if the math cannot distinguish set members uniquely and much less order them, programmers have to implement the functionality via software. And that gets complicated, expensive and slow.

The mathematics itself is not flawed.  Ordered pairs (couplets) always existed.  At issue is whether this pair is well-defined or not.  It certainly isn’t conveniently defined.  Classical set theory doesn’t allow us to deal one-to-one with technical reality, namely computing systems, where uniqueness (identity) and order rule.  Yet CPUs, Memory banks, disk drives, and all computers could not exist without them.  For example, you couldn’t have memory addressing in RAM without the ability to define and identify unique addresses.  And multiple values can’t simultaneously reside in the same memory or disk location.   Identity and order are as fundamental to computers and software as it is to the human mind.

 And so the question became, is it possible to “extend” the math framework to create inherent set member “distinguishability” so mathematics can fully model the real world, making databases faster, simpler and more efficient?  Fifty years later, the answer is yes, as XSPRADA technology proves.  In my next post, I’ll explain how and outline the practical benefits.

Tuesday, March 3, 2009

Mini-Me, you complete me.

Ok in this case, it's certainly no one remotely "mini". More like "maxi": I received an email from Curt Monash recently and he pointed out some needed updates and a typo in a previous entry here.

First of all, Curt posted the lastest PPT deck from his presentation at TDWI recently (which I was not able to attend unfortunately, which is a shame since I'm only 4 hours driving from Vegas and have a "permanent" room at the Encore and...did I mention how convenient a trip it is for me? Did I mention the steak frite at Mon Ami Gabi? My favorite dinner hangout in Sin City? Yeah, moving right on...)

So Curt's latest deck on how to pick an analytical database vendor is at:


Furthermore, and embarrassingly so, I mis-spelled Curt's website which it at



So...thanks for completing, I mean correcting me Dr. Monash.