Big Honking Databases: Can Rover Roll Over Too?

In this post, I want to stay in gear-head mode and "demo" a couple of neat tricks from the XSPRADA RDM/x analytical engine puppy. I’m going to address quick prototyping capabilities (thanks to schema agnosticism), incremental inserts, and time invariance.

We’re going to do this with some really simple tables and data structures just to highlight the concepts behind the engine’s functionality. So suppose we start with a CSV table called ‘demo.csv’ consisting of the following rows (NOTE: the file must be terminated by CR-LF):

1000,12,"now is the time"

1001,45,"for all good men"

1002,76,"to come to the aid"

Now we want to load this into RDM/x so we use the XSPRADA SQL extension CREATE TABLE FROM as follows:

create table demo(id int,c1 int,c2 char(128)) from "C:\ demo.csv"

Then we do a SELECT * on the table just to make sure all is well (for example, using QTODBC or any other ODBC-compliant SQL front-end tool) and we see:

1000 12 now is the time

1001 45 for all good men

1002 76 to come to the aid

And if I look at my schema in the QTOBDC object browser I see this as expected:

So this is fine and dandy when you know the actual schema of your data but now what if you don’t or what if you’re not sure or what if you're out to determine what the best schema might be given your application?

Ok well what we can do it simply load everything as text types. So now we do:

create table demo(id char(128),c1 char(128),c2 char(128)) from "C:\ demo.csv"

The original ‘demo’ table is overwritten with the new schema which now becomes:

And now we can play around with that until we’re satisfied. Assume we suspect the optimal way to work with this data might be using INT for the index, a double precision type for the c1 column, and an 80-char VARCHAR for the last field. We might take our existing table and “flip” it into a new schema (called flipped) as such:

select cast(id as int), cast(c1 as double precision), cast(c2 as varchar(80)) from demo into flipped

Note how we convert the schema into a new one directly into a new table on the fly. Now if we select from flipped we see:

1000 12.0 now is the time

1001 45.0 for all good men

1002 76.0 to come to the aid

And the schema for the new table, as expected, is:

Now if we needed to do some querying on numerics for C1, we could. Note we didn’t even have to explicitly create or schema the ‘flipped’ table. RDM/x took care of that automatically, making these types of operations ideal for quick prototyping. Incidentally, we could have done this directly on the existing table as well. This type of flexibility is pretty cool and allows you to “play” with the data in a trial and error mode.

Now suppose we come across some additional data and we wish to INSERT this information into our existing database. This is typically what happens with daily incremental into a data warehouse. On a periodic basis, chunks of data are added to (typically) fact tables. Our incremental CSV looks as such:

2000,132,"to be "

2001,465,"or not to be"

2002,786,"that is the question"

There are several ways to handle this using RDM/x. The most direct one is to simply use the XSPRADA “INSERT INTO FROM” SQL extension as follows:

INSERT INTO demo FROM “c:\demo_062209.csv”

In this case RDM/x is told “hey, there is more data being presented to you so, algebraically, union it with the existing data”. As RDM/x does not “load” data in the conventional sense of the term or support bulk loading (as it doesn’t need to), additional data can be presented virtually in real-time with little or no effect on the ability to query the system simultaneously.

This is what’s called “non-disruptive live updates” or concurrent load and query capability. Initially, users are a little surprised that bulk inserts are not supported. But in fact, all that’s needed is “dropping” the incremental data somewhere on disk and telling RDM/x of its existence. RDM/x can ingest this information as fast as it can be written to disk.

Another way to do this is by loading the incremental “chunk” into its own table as such:

create table demo_062209(id int,c1 double precision,c2 varchar(80)) from "C:\ demo_062209.csv"

And then doing

select * from demo union select * from demo_062209 into demo

A bit of subtlety there: when you do this, you’re effectively “updating” the existing demo table with the unioned results of the incremental. So you take the old demo table, add the incremental, then flip that back into the original demo table. You could have saved the “old” demo table first before doing this as such:

select * from demo into demo_current

Notice how these “dynamic” tables really behave like variables in a loosely-typed programming language. They are conceptually related to conventional database “views” (although RDM/x does support view objects in the database but from a purely semantic way). As a matter of fact, all tables and views in RDM/x are essentially the same thing and materialized (to disk and/or memory) on a JIT basis anyway.

Of interest here is that you can always recover past instances of any data RDM/x via a nifty little feature called “time invariance”. This feature is unique in the world of databases to the best of my knowledge. Essentially, you can query RDM/x much like a “time machine”, asking it to yield results as if the question were being asked in the past.

So suppose you had inserted your incremental into the ‘demo’ table by mistake and suppose your insert had occurred at a given time such as '2009-06-22 15:13:52.775000000' (and you can tell this if you are logging your SQL statements to the database using the usrtrace option of the RDM/x ODBC driver). You can always recover the state of ‘demo’ at the time by issuing:

SELECT * FROM demo AT TIMESTAMP '2009-06-22 15:13:52.775000000' into recover

Now your ‘recover’ table will contain ‘demo’ exactly the way it was at that time.

The implications are far-reaching because you can essentially always query the RDM/x database at any point in the past. RDM/x never deletes information UNLESS the information has been explicitly deleted using a DROP TABLE or DELETE type of DML statement AND the garbage collector kicks in. Short of that, the database maintains information and integrity through time.

These are just a couple of nifty tricks from one original database engine called RDM/x.

7 comments:

Justin SwanhartJune 22, 2009 at 2:24 PM
Can you explicitly add foreign keys or indexes at table creation time?

Are unique constraints supported?

What happens if there is an error during the SELECT .. INTO? Are such inserts ACID or will I get partial data in a table if the system crashes?

Are customer or composite data types available?
Justin SwanhartJune 22, 2009 at 2:25 PM
Oh, and is there a way to use non CRLF terminated data? I just produced 100GB of raw data that I can't use otherwise :D
UnknownJune 22, 2009 at 2:29 PM
Hey Swany,
No indexes, no PK/FK needed or definable at DDL time. Neither are unique constraints. UDTs are not supported. No data is moved on a select into, only the "equations" are updated in the algebraic cache until materialization is needed. You will not get partial data.
UnknownJune 22, 2009 at 2:44 PM
@Swany -- currently CSV files have to be comma-separated and CRLF terminated. How did you produce your data?
Justin SwanhartJune 22, 2009 at 3:27 PM
I produced them with SELECT ... INTO OUTFILE on a linux MySQL daemon.

I can reproduce the file with the LINES TERMINATED BY '\r\n' clause to get it terminated for windows.
UnknownJune 22, 2009 at 3:29 PM
@Swany -- yeah that would be the best route. Email me jerome.pineau@xsprada.com if you need any help setting up/playing with RDM/x.
UnknownJune 22, 2009 at 3:59 PM
As a general statement, I will be addressing the XSPRADA approach to ACID in my next blog. Standby it will be interesting :)

Big Honking Databases

Monday, June 22, 2009

Can Rover Roll Over Too?

7 comments:

About Me

Small sample of blogs I follow

Search This Blog

Tracer

Followers

Blog Archive