Cool Technology that's been around for a while

I'm adding this entry at 35,000 feet. I'm on the way back from a customer meeting and the first thing I see as I enter the plane is a large WiFi sticker on the slide of the plane. It turns out that Delta now has 1800+ planes flying with WiFi enabled. I paid $12.95 and now I'm able to anything on my laptop that I could have done back home (although I must admit I'm quite a bit more cramped). This may not be that new, but its the first time I've had a chance to do this and I think it is pretty cool.
The other thing I think is pretty cool that I wanted to share is Informix support for time series data. Infomrix is the only relational database that has access methods and functions designed specifically for time series data. In case you were wondering, a time series is a set of related data that changes over time. For instance the stock trades for IBM, or the smart meter energy usage, over time. In a way you can think of a time series as an array only instead of asking for the 1st, 2nd, and 3rd records in the array you can ask for the Jan 1st, Jan 2nd, and Jan 3rd items in the array - in other words an array accessed by time. Typical access to time series data is by time range, meaning queries tend to look at all the data in a time range before moving to another time series.
Informix has taken advantage of this and built storage which is optimized for this kind of access. We cluster data for a particular time series to minimize the number of I/O's needed to retrieve data for a particular series. This means that if you want all the IBM stock trades for Jan 1st, 2nd and 3rd we insure that those pieces of data are clustered together on the same physical disk page.
The other thing we do is a sort of compression to insure the data is small as possible. One of the things we found is that quite often time series data can be sparsely populated. Because of this we only create pages for time series where there is actually data. For instance, if you had stock data for IBM for 2007 and 2009 but not 2008 we will not reserve any space for 2008. Later when that data becomes available we will add it into the series.
Another way we save space is that we do not store any NULL value columns. If you define a time series to hold 6 columns of data and you enter a record that has 2 columns that contain NULL those NULLs will not take any space. This is because we add a header to every record that indicates which columns (if any) are NULL. Relational databases use a value to indicate NULL typically, which means NULLS take the same amount of space as non-NULL values. Since NULL columns tend to be common in time series this can lead to a lot of space savings.
We also save space by not storing the timestamps if we don't have to. We can do this because it is very simple to calculate the timestamp of intervalized data. For example if you are storing smart meter data that comes in every 15 minutes all you need to know is the timestamp of the first entry in the time series. After that it is very ease to do the datetime math to calculate the timestamp for every other entry in the series.
To finish up with space savings, there is also space save by not requiring an "id" attached to each record. For example, if you had a time series of stock data and used a standard relational schema you would have to add some sort of stock id to each record so that you determine what stock the data belogned to. With our time series approach this is not required. We would store the "id" once and then every record would be associated with that entry and not need to have an "id" attached.
We have just completed a POC for a smart meter company and one of the things that drove them to look at Informix and time series was this space savings. In their case they have 3.5 million smart meters each generating a record every 15 minutes, and they want to save 25 months worth of data. Doing the math you get:
3,500,000 * 96 intervals/day * 760 days = 255,360,000,000 (about 255 billion records!)
Right off the bat we are going to save the size of an id which is 8 bytes, the size of a timestamp which is 11 bytes, so 19 bytes times 255 billion is quite a savings. This is pretty much what we saw at the POC - we used about 1/3 the space that Oracle did.
The other savings will be in the number of rows that have to be managed. In the case of time series there is one huge row per meter - so 3.5 million rows vs 255 billion rows. Index maintenance, statistics maintenance, storage management all become quite a problem.
All in all I think this technology is pretty cool and something which we should use more often. In another entry I will go into some detail about the query side of time series which is equally a good story. I'll also talk about how the time series fits with BI queries.
- 452 reads
-
- Feed: Kevin Brown
- Original article













