Big Honking Databases: February 2009

Sunday, February 22, 2009

Try the hotpockets they're breathtaking!

So, as we saw earlier, XSPRADA belongs to a new generation of software companies building specialized analytical database engines for business intelligence purposes. And as previously described, each of these companies concocts a “special recipe” of “new-age” engineering features to offer simpler, faster and cheaper results than the competition. At least, that’s the marketing shpiel. So what makes XSPRADA a little different from the rest? Let’s re-examine secret sauce ingredients from our last post and see how (and if) each is used in the XSPRADA kitchen:

Row-based or column-based?
Neither. XSPRADA doesn’t think in terms of rows or columns, or any other pre-determined canned data structure, tabulated or not. The XSPRADA approach is one of “adaptive data re-structuring” where data is re-arranged on disk based on queries being asked. If it makes sense to re-structure a given piece of information in a particular way, then XSPRADA does just that. This means that, in some cases, the information will indeed be re-arranged into columns, but such an arrangement is not a hard-coded rule. Rather, it is adapted to the queries being received, taking into consideration the nature and type of the data itself. So, it is entirely possible to have the same information re-arranged on disk in numerous different ways simultaneously to satisfy different querying patterns.

Caching and direct memory access
In the XSPRADA engine, all queries are automatically converted to their algebraic (mathematical) form and cached as such in memory. Result sets may or may not end up realized on disk as needed. In that respect, “materialized views” are automatically generated in RAM as needed by the engine and based on usage without explicit user intervention. Basically if you hit a certain area of the data frequently, you can be confident it will end up being cached fairly quickly. A similar process occurs when a user is setting up and sending in “sliced and dice” queries. XSPRADA detects this behavior and immediately starts building an OLAP cube internally, possibly aggregating information as needed to answer subsequent queries faster. As such, there is no explicit need to pre-define cubes or cache aggregates because these steps are taken automatically based on usage patterns.

Separating engine functions to judiciously place some "closer to the data" for pre-processing queries
There is no need for such logical-physical layer optimizations in the XSPRADA system because the engine is already as close to the bits on disk as can be. As a matter of fact, the XSPRADA engine reads data directly off disk store and requires no specific “loading” step. The data is ready for use as soon as a SQL CREATE TABLE (or INSERT INTO) DDL statement is issued.

Maximizing storage I/O
XSPRADA manages its own disk I/O using a channelized streamed architecture. The more attached storage you have, the more channels you can support and the faster the I/O becomes. As a matter of fact, the XSPRADA engine scales better with disk space than with CPU cores. The XSPRADA I/O system is streamed and not random-access based as many conventional transactional systems are. Through adaptive data restructuring, the XSPRADA engine paginates most accessed data dynamically in contiguous “islands” yielding much higher performance when streaming to and from the dedicated pages.

Automatic indexing algorithms either at load time, processing time or both.
The XSPRADA engine requires no indexes, no FK/PK or any other standard database constraint typically seen in OLTP systems to pre-define data relationships. XSPRADA indexes the data internally based on inspection and queries asked. Internal relationships are also inferred automatically based on usage patterns. This is particularly attractive in mixed-workload environments where different areas of the data may have different relationships and/or models and in conventional systems, addressing both “worlds” at the same time in ad-hoc fashion is less than practical or fast.

Proprietary optimization of SQL queries
XSPRADA aggressively optimizes all incoming SQL queries. The SQL is transformed into algebraic expressions then mathematically optimized, processed and stored as such. As a consequence, the system is more forgiving of poorly tuned SQL queries than others might be. XSPRADA reduces all SQL queries into its internal mathematical model, whose integrity is maintained at all times. Based on your current query and looking at past history, XSPRADA can determine a more optimal query path (if necessary) than the one being suggested and re-arrange the query accordingly.

Custom hardware solutions (SMP, grid or MMP based, share-nothing or share-all implementations)
At the moment, XSPRADA runs on multi-core multi-disk SMP commodity hardware. XSPRADA is a pure software play.

Software compression
XSPRADA does not implement any form of compression at the current time. Although it is of course possible to compressed volumes using NTFS (on Windows platforms), such mechanism is independent of and transparent to the XSPRADA software.

Now, there is more to this than just a few simple feature/functionality points undoubtedly. But the question beckons: what makes XSPRADA technology significantly different or better than others? The answer to that is called ALGEBRAIX, but that’s just a fancy term for some pretty impressive mathematics called extended set theory. In a subsequent post, we’ll explore how that magic works without boring the reader to death :)

Wednesday, February 18, 2009

Group, we have some newcomers here today with us

Until five or six years ago, if you were a decent-sized organization, and were looking at a relational database to run business operations, you were either evaluating Microsoft SQL Server, Oracle or IBM DB2 offerings. Sure, there were plenty of other players in the market but those three were the "safe" ones to use and, as the saying went, no one ever got fired for buying from the "Big Three". Competition was scarce and IT budgets were healthy. Life was pretty good, especially on the vendor side.

As analytical and BI applications became more and more prevalent in the industry, it quickly became obvious that the "Big Three" had significant scalability and costing issues when employed in a data warehousing scenario. For one thing, these legacy products were designed for transactional processing, and not analytical work, two very different use cases. For another, information volumes were growing exponentially, well beyond the software’s capability to keep up. (Keep in mind there hadn't been any significant breakthroughs in relational database technology for over 40 years). What's worse, all these legacy systems still cost big bucks while failing to solve new pain points quickly enough.

To address this, over the past recent years, dozens of new companies started offering dedicated analytical database engines. In other words, some very bright people started rethinking and reengineering databases specifically for handling analytics on a massive scale. And some very desperate customers started paying serious attention to these new offerings, using them, and spreading the word. Some of the better known players include Teradata, Sybase (their IQ product), Netezza, and Vertica. Today virtually all large enterprises rely on a proprietary analytical database solution to manage their business both strategically and tactically.

If you really want to learn from the best on the topic, suffice to say that Curt Monash is probably the ultimate authority in this business. His recently-posted Powerpoint deck on how to select a data warehouse DBMS (http://www.monash.com/uploads/How-to-buy-data-warehouse-DBMS-February-2009.ppt) is about as clear and informational as anything else out there. It also has an extensive list of "new-breed" players in the market. And anyone seriously following this game cannot fail to read Curt's blog on a daily basis at www.dbm2.com.
That being said, let me try and simplify even more.

In one way or another, these "new breed" players all boast one version or another of a "secret sauce" that makes them uniquely suitable to handling terabytes of information, giving them "simpler, faster cheaper" advantages. At the end of the day, and minus all the marketing hoopla and chest-beating frenzy, these folks more or less make the following arguments:

Information stores are growing so fast that conventional OLTP databases cannot keep up with current and projected data volumes.
Conventional engines cannot serve analytical needs as they were designed for operational high-flow transactions where dealing with rows is efficient for write-many/read-few usage patterns.
In data warehouse situations, reads are frequent, while writes are not and most reads pull fairly small chunks of data.
This being the case, you can't build row-centric databases to address transactional issues and expect them to also double as scalable, performant and affordable analytical engines.

At that point, pretty much every company has a proprietary "secret sauce" to address these issues usually comprising one or more of the following ingredients:

Replace row-based with column-based designs at the logical and storage levels.
This approach lets the software deal with entire columns of data in one fell swoop instead of managing rows. So if you have a 20 column table, and a query hitting 3 of those columns only, you can bring back the 3 columns in one shot, instead of bringing back all the rows and then filtering the unwanted fields. This is a grossly oversimplified explanation but that's the general approach taken by the many "columnar" players in this field including Vertica, InfoBright and Sybase who was actually the first to market using this approach with SybaseIQ.

In-memory caching (DIM)
You look at the queries coming in and determine which areas of your data can be cached in memory for better performance. Almost all players do something in that area. It's simply sound design given that most analytical queries (specifically OLAP) tend to hit a small proportion of the entire information store (when you get in the tens of terabytes).

Move engine functions judiciously "closer to the data" for pre-processing queries
This is one way to try and improve disk I/O as one of the major bottlenecks in this space is moving bytes around. CPU to disk speed ration has remained around 1000:1 for decades now. Even using newer SSD technology, the discrepancy is still significant. As the fastest way to do something is to not do it at all, minimizing byte transport yields crucial performance advantages. If you can calculate and generate preliminary results very close to the storage layer and only move those at the last minute as necessary, you can improve I/O bottlenecks. Oracle’s Exadata (its analytical offering) apparently makes use of this feature extensively.

Maximizing storage I/O
Some companies implement their own disk caching and I/O operations at the O/S level using streaming or block transfer approaches.

Automatic indexing algorithms either at load time, processing time or both.
Most new breed players do some sort of indexing internally either during initial data load or when queries come in. Indexes and relationships can be inferred from the questions being asked and subsequent results optimized. Bit indexing on low cardinality fields is also a well-established technique that can have significant performance advantages especially in columnar architectures. In large data warehouses, data integrity is typically enforced during the ETL phase. Also warehouse database schemas are typically not very normalized and consist mainly of surrogate keys. There are no complex relationships and no real-time external data insertions to check against complex relationships.

Proprietary optimization of SQL queries
SQL queries for business intelligence can be aggressively optimized, even in an ad-hoc environment. BI queries are typically easy to categorize, analyze, and consequently optimize.

Custom hardware solutions
SMP, grid or MMP based, share-nothing or share-all implementations. At odds here are two different “schools” of technical thought: those who feel it is more efficient to distribute processing power among independent networked “nodes” (MPP shared-nothing approach) and those who believe it is better to share massive processing power and possibly storage (SMP shared-everything approach). There are technical and cost benefits and drawback to both approaches, as is usually the case in the engineering realm. For a great article on this topic, go to http://ianalyze.net/2009/02/shared-nothing-shared-something-shared.html

Software compression
Compression can offer significant storage and load-time benefits. The sweet spot is being able to actually query from compressed streams directly without taking the CPU overhead of decompression.

Which "secret sauce" or mix thereof will work better? The answer is always a resounding "it depends". It depends on the business case, on the data, on the economics and the time frame. It depends on a lot of things, and there are few "canned" answers in this business. In my next blog, I will examine how XSPRADA fits into this story and what it is they do a little differently in the world of analytical engines.

Wednesday, February 4, 2009

You're switched on!

Last year, a startup database company founded in 2004 asked me if I'd be interested in coming on board as a sales engineer as they were about to release their first product in the US market. They needed someone with solid technical experience in the enterprise software field. An engineer, who had been around the block a couple times, was totally comfortable around Visual Studio and CVS, and had a clue what “full table scan”, ROLAP, and “petabyte” meant . Someone who could whip up a technical presentation for a CEO at 9am and follow up with a deep-dive for a skeptical Chief Architect and his DBAs at 11am. A technologist who loved to engage customers (fairly rare), listen to them, stand in their shoes, and help them out.

Having followed the company's progress since inception, even writing code for it in the early days, I knew the product fairly well. I understood where it came from, what it did, and how it did it. More importantly, I knew enough about current information management challenges to realize the competitive upside was gigantic. The nature of this technology was revolutionary and nothing like it had ever been attempted since the days of E. F. Codd. I had seen early proof-of-concept versions of the software and knew its capability and potential. I knew I could help build and evangelize the product. I knew I could explain its merits, and I knew we could use it to solve major pain points in the market. And that's how, after twenty years of hands-on coding and architecture, I finally re-joined the "dark side" and became a sales and applications engineer once again.

At XSPRADA, I act as both client-facing and internal technical resource. My mandate is to work with customers and partners to see how our software can help them solve their business problems. In the process I try to flush out their needs and challenges and use our technology to address them. The most important part of my job is listening. The most rewarding part is the “ah ha!” moment. In subsequent posts, I’ll dive into what XSPRADA does a little more and explore the burgeoning analytical database business I now live in, who the players are and what they're up to.

Tuesday, February 3, 2009

Let myself introduce...myself.

This will be my first post on my first blog about really big databases, business intelligence, the analytics industry, and the men & women who make it all happen. My name is Jerome, I was born in France, grew up in NYC, did my time in Jersey and now live in Southern California not too far from the Pacific. The dog picture on on this page is my buddy Domino. He's a howling Canaan/wolf mix with a passion for fruit.

I've spent about twenty years in the technology field doing software development, design, architecture, management, and consulting. I started out of College as an applications engineer, but quickly realized baking bits and messing with Turbo-C until the wee hours of the night was more fulfilling to me. I ended up specializing in C then C++. When Microsoft Windows 2.0 came out, I jumped on the bandwagon and never looked back. I worked on so many different types of applications and technology areas that I currently have to look at an old resume to remember them all!

In 2001 I picked up C# and .NET when “managed code”and CLR replaced ATL/COM/C++ and eventually became a hands-on "architect" and .NET expert. I managed engineering teams, and thrived in “agile” environments where early and continuous customer engagement is sacrosanct. I ended up specializing in "plumbing" architecture by building service-oriented platforms with WCF. I always loved to “connect the dots”. Throughout the years I worked for dozens of different companies of all sizes in numerous industries. Everything from tiny startups to mega corporations. At one time, I travelled fairly often in and out of the country and enjoyed that a lot. I once did a demo for Mercedes Benz in Stuttgart, Germany and witnessed the first SL-500 prototype zipping around on their test track. At a Danish company’s annual sales meeting at Hamlet’s pad in Kronborg Castle, Copenhagen, I trained attendees in the use of new flagship software we had recently minted in the US. (Yes, the place is definitely haunted). I once flew to Tel Aviv, Israel on three day’s notice to help AOL integrate their software with the ICQ team after AOL had bought Mirabilis for a half billion dollars. In the process, I ran into the Prime Minister at the time, Benjamin Netanyahu, at a Knesset honor award reception for the founders.

I started two software consulting companies. One back East in Jersey and one in Southern California. That experience really sharpened my people and business skills. In the process, I learned how to absorb information quickly, and separate the substance from the nonsense. I learned how to deal with non-technical people on a regular basis (namely, those who write the checks), how to explain complicated concepts to lay people, and how to market myself in good and bad times. I learned how to manage crisis situations. I learned how to navigate effectively across different business areas, and understand their drives and motivations. In two decades I learned what it takes to build software, deploy on time, manage engineering teams, drive releases, evaluate technical risk, and execute consistently. The whole software lifecycle from concept to delivery to final payment and, hopefully, repeat business. Most importantly, I learned how to make clients and employers happy by managing their expectations and always delivering more than promised. At the end of the day, the software has to work, but more importantly, the client has to look good!

The software business is a tough gig, but building software is a deterministic endeavor. Many are called, but few are chosen. Those few are smart, honest, consistent, and detail-obsessed.

Big Honking Databases

Sunday, February 22, 2009

Try the hotpockets they're breathtaking!

Wednesday, February 18, 2009

Group, we have some newcomers here today with us

Wednesday, February 4, 2009

You're switched on!

Tuesday, February 3, 2009

Let myself introduce...myself.

About Me

Small sample of blogs I follow

Search This Blog

Tracer

Followers

Blog Archive