Big Honking Databases: "Out of nothing I have created a strange new universe". [Janos Bolyai]

Right off the bat, let me refer you to an interesting discussion of extended set theory on the web at http://www.xprogramming.com/xpmag/xstSomeThoughts.htm written by by Ron Jeffries. It's not a bad conceptual introduction but I'm going to try to do it one better in simplicity below.

In our last post we discussed the notion of “distinguishability” among classical set theory members, and how XSPRADA extended set theory allowed one to clearly differentiate between and uniquely identify extended set members. These extended set members become values in extended set “couplets” and an XSPRADA “couplet” is an XSPRADA “ordered pair”. Each member is defined by a scope and a constituent piece, either of which is interchangeable at will, but neither of which can exist without the other. Now, in cases where order and identity are irrelevant, classical set theory can be used to manipulate these members. But in cases where we do need to identify, discern or order set members, we apply extended set theory accordingly as such:

Suppose you have a set of elements {A, B, C}. We want to represent both mathematically and physically (in a computer) so we can do calculations on it using the mathematics to answer questions, and then bring answers back to the computer. For example, we may want to count members in a set (cardinality), or calculate the intersection of two sets. So typically you store this information on the computer in one way or another. Then you “read” back the physical data (via disk I/O), map it to your mathematical ontology (in the software), do your calculations (count or intersect operation), then bring results back to the “real world” namely as a SQL result. The faster you can do this, the better off everyone is.

In a computer, you can’t simply store {A,B,C} because all possible permutations of the triplet {A,B,C} can be represented in many different ways (well, six actually, or factorial of 3) in memory (or storage). So which one do you pick? Clearly you can’t store every possible combination efficiently. Even if you wanted to, you’d still have to somehow indicate which combination you meant in a given context. So engineers came up with a hack called “indexing”. But now you also have to store the index values along with the data, even though the index itself is not native to the information. Back on the math side, when you load this record back into the model, there is still no inherent notion of order. To the model, it makes no difference if you load back A B C or C B A or B A C because they are all equivalent. So you have to load the index as well. Then you need to do some math “magic” to tie the two, and promulgate the magic all the way back to a result, which you must then move back to the physical world as well. It’s the quintessential anvil to the ankle problem.

Extended set theory models physical storage (RAM or disk) using newly defined set members and so right off the bat we would express our initial set as {1.A, 2.B, 3.C}. This cannot be confused with an entirely different set known as {1.B, 2.C, 3.A}. When “materializing” this abstraction to disk or RAM, the representation is clearly as follows:

ADDRESS VALUE
0x002200 A
0x002201 B
0x002202 C

In other words, we have now modeled computer addressing using extended set scopes. This is one of the many possible applications of extended set theory for information management.

Now take a different arrangement, for example:

ADDRESS VALUE
0x002200 C
0x002201 A
0x002202 B

When you move this back to the extended math model (in the software), you end up with {1.C, 2.A, 3.B}. No need for slight-of-hand indexing or other “unnatural” steps. There is zero ambiguity.

Now suppose you have 2 different sets and wish to determine if they are “equivalent” as follows:

ADDRESS VALUE
0x002200 A
0x002201 B
0x002202 C

ADDRESS VALUE
0x002200 C
0x002201 A
0x002202 B

Using classical set theory, you don’t have enough information (without indexing hacks) to mathematically make the call. In extended set theory, you end up comparing two sets {1.A, 2.B, 3.C} and {1.C, 2.A, 3.B}. You can see 1.A and 1.C are clearly distinct, as are 2.B and 2.A and 3.C and 3.B. Why? Because extended set theory says two members are equal only if both their scopes and constituents are equal! Consequently, you can say with certainty “these two sets are different”.

Let’s apply this to finding intersection between these two sets. In classical set theory intersecting sets {A, B, C} and {C, B, A} yields {A, B, C}. Intuitively, both sets seem to contain the same elements. But do they really? It depends. In extended set theory, the question is: do {1.A, 2.B, 3.C} and {1.C, 2.B, 3.A} have anything in common? And the answer is a resounding no as none of the elements has matching scope and constituents. In this example, if order matters, classical set theory cannot help us.

If you’ve been following me so far and thinking, this is really stupidly simple, you’re catching on. Simple in concept? Yes. As Einstein once said, “Make everything as simple as possible, but not simpler”. Simple in implementation? That’s another story. But the fact is, the implications for information management are mind-boggling. Why? Six reasons:

First, since the mathematics formally models order and relationships, and can manipulate these internally via a formal algebra, humans don’t need to pre-index or model information before exploiting it. The software extracts that information from the data directly. More interestingly, it does so by examining the bits and bytes right off storage with no loading or pre-canned processing logic needed.

Second, every piece of information presented to the system is internalized in the math engine. This means the original format of the information (be it relational or unstructured, CSV, XML, or video) becomes moot once inside the math engine. Once modeled, any information (bits and bytes) becomes sets, which can be manipulated and operated on consistently.

Third, because the mathematical model being maintained is complete and closed, the integrity of the “universe” is maintained at all times. This means information is immutable. It cannot be changed or deleted. This feature is called “temporal invariance”. Consequently, you can actually query the database as if you were asking questions weeks or months ago, not unlike a time machine.

Fourth, every query (question) fed to the system is transformed into its mathematically equivalent set of algebraic expressions. As such it can be optimized mathematically. Expressions and relations are also cached and exploited in real time. This lets the system discover existing relationships that a human user could not have foreseen (and much less asked about). It also lets the system adapt to “poor SQL” or poorly formulated questions without impacting results or performance.

Fifth, the math engine partitions queries. As a matter of fact, it can be shown mathematically that the number of possible query “types” in a given business intelligence domain is finite. No matter how complex or convoluted a user or machine-generated query is, the engine can determine without ambiguity which partition “bucket” it belongs to. Because this taxonomy is well-defined and finite, spectacular optimization techniques can be applied.

Sixth, it can be mathematically demonstrated that a finite number of operations suffice to handle any possible workload presented to the engine. The mathematics prove that any query can be answered using a small set of well-defined operations such as union, intersect and complement, for example. Consequently, the XSPRADA math engine need only implement a fairly small number of operations, and no more. As these operations can be threaded and parallelized in the software, this makes for a very highly optimizable high-performance system.

So from a technical perspective, XSPRADA technology and its application to database software is a genuine breakthrough. But more importantly, for business intelligence purposes (reporting, OLAP, decision support, or data mining) the rigorous application of formal mathematics to the software implementation of an analytical database engine offers unequaled competitive advantage. I’ll describe those in real-world terms in a subsequent post, and I'll also dig up a couple places and instances where this technology was actually succesfully applied in the past as well!

2 comments:

AnonymousApril 9, 2010 at 4:50 AM
This is a fascinating post, Jerome! I've been thinking about the implications of XST in the context of linked data best practice/principles.

I'm wondering if you would agree that the Memento Project exhibits the sort of "temporal invariance" you are describing?

John Erickson
UnknownApril 9, 2010 at 4:57 AM
Geez I dunno John, I've long since then left XSPRADA (now www.algebraixdata.com)- you might try and hook up with Chris Piedmonte or Gary Sherman there to discuss further. If you need an intro, let me know.
Best,
J.

Big Honking Databases

Tuesday, March 10, 2009

"Out of nothing I have created a strange new universe". [Janos Bolyai]

2 comments:

About Me

Small sample of blogs I follow

Search This Blog

Tracer

Followers

Blog Archive