Range Queries: Is the Bottleneck Seeks or Bandwidth?

Posted on Monday, January 21, 2008

Last time I talked about point queries.  The conclusion was that big databases and point queries don’t mix.  It’s ok to do them from time to time, but it’s not how you’re going to use your database, unless you have a lot of time.  Today, I’d like to talk about range queries, which seem much more useful for the analysis of big databases, say in a business intelligence setting.

Recall that the focus is on the storage engine (a la MySQL) level, and a database on a single disk—the one we are using for illustration is the 1TB Hitachi Deskstar 7K1000.  It has a disk seek time 14ms and transfer rate of around 69MB/s [See tomshardware.com] Now imagine filling the disk with random pairs, each 8 bytes.  So that’s 62.5 billion pairs.

Range Queries

Suppose the above data is stored in a B-tree, and that you’d like to iterate over all the data in order by key.  Further suppose that the B-tree has 4KB blocks.  Thus each leaf has 256 key-value pairs.  The tree has around 244 million leaves.  Before we figure out how long it’s going to take to look at those leaves, we have to understand how aged the B-tree is. 

A B-tree that you get from loading up the data in sorted order in one fell swoop has all the leaves laid out in a relatively nice order, so that sibling leaves tend to be on the same track, and you get lots of leaves for each disk seek.  As the B-tree ages (either from doing insertions & deletions, or from inserting things in a more random order to begin with), the leaves tend to be scattered all over the place.  Let’s assume that the tree is well aged to see how bad things can get.

A range query over the whole data set in our scenario will give a disk seek per leaf, which would take 3.4 x 10^6 seconds (= 244 x 10^6 x 14 x 10^-3 seconds) or 39 days.  For one query.

Now the question is, can you do better?  The first thing to note is that I have made my argument as to how long a B-tree would take, and in particular one that’s aged and has a specific block size.  In the point query case, I made an argument about *any* data structure.  In this case, I can’t.  I can’t say that there isn’t some other data structure that ages well, for example, and thus does better.  For example, in the extreme case where the data has been sorted on disk, it can be read at full bandwidth, in which case it would take 10 hours to do a range query.  That’s still a longish time, but it’s not 39 days, and it means that the B-tree example is using about 1% of available disk bandwidth.

So there are certainly ways to fix this.  Use a bigger block size for the B-tree (and kill insertion performance).  Sort the data before insertion into the B-tree (and kill insertion performance).  The point is that unlike point queries, the natural lower bound is band width, not disk seeks, and the textbook data structure for storing data on disk has poor performance for this operation. 

As I’ve already argued, range queries are the natural query type for large data sets, not least because of the argument above, that point queries can’t even be implemented on disks, but also more positively because large data is useful for large-scale market analysis, where aggregate statistics about the data set are mined.  And that’s implemented by range queries.

Conclusion

Unlike point queries, where performance will only be gained by (expensive!) hardware changes, data structural changes are key for range queries.  In a way, we already know this.  What is a data warehouse if not a database that’s been optimized for range queries at the expense of insertion times. 

In future postings, we’ll see if there are any options to data warehouse for achieving fast range queries (hint: there are).  In the next posting, we’ll look at insertion rates.

Comments (0)

The Trouble with Point Queries

Posted on Friday, January 18, 2008

Insertion and Queries

Databases are complicated beasts, but I’d like to focus on the storage engine, just the part that talks to the storage system, and doesn’t have to worry about SQL, etc.: just transactions, concurrency, compression, updates and queries.  In the next couple of blog entry, I’d like to just focus on updates (insertions and deletions) and queries (point and range).  (This delineation between the
front end and the storage engine is clearly architected in MySQL.) And in particular, I’d like to explore which features of a disk limit performance for which operations.

The question is how fast can these operations go?  Point queries are the slow ones, so let’s start with them first.  Suppose you have data on a disk—say a 1TB Hitachi Deskstar 7K1000.  It has a disk seek time 14ms and transfer rate of around 69MB/s [See tomshardware.com] Now imagine filling the disk with a table, where each row has two 8 byte integers.  So that’s 62.5 billion rows (=1TB/16 bytes).

Point Queries

Now let’s access those data in random order.  I still haven’t said how the data are stored, but the point is that it doesn’t matter.  If you are doing a random point query, you need to do a disk seek.  So to access all the data points, it takes almost 28 years (= 62.5 billion rows * 14 ms/row / 32x10^9 ms/year).  That’s assuming no OS overhead on the accesses.

Let’s just repeat: If you fill up a big disk with small records, it takes years to read the data off the disk.  Maybe the file system doesn’t let you fill up the disk completely.  Maybe 16 bytes per row is too small.  Maybe you have a 1TB disk with a slightly lower seek time.  We started at 28 years, so there’s a lot of wiggle room here, and we’d still be talking about years.  Also recall that this analysis is independent of how the data are stored.  No data-structural or software fixes will solve this problem, or even mitigate it.  You could distribute your data on lots of small disks, and then if you could look into the future of your point-query stream, you could pipeline your queries. 

If you can keep your pipeline up to depth 10, you could have 10 simultaneous disk seeks going on 10 separate disks (and you’d have to go through some rigamarole to make sure that you don’t end up with multiple of these 10 seeks going to the same disk).  So even if you go to all this trouble, and you get lucky, you still have to wait 2.8 years for your point queries to finish.  If you start with something that’s this slow, you need too much parallelization to get any kind of reasonable performance.

Thus, the only way to implement a largish database that has a workload that’s heavy on point queries is with hardware, by which I mean, without hard disks.

Next time we’ll talk about range queries.

Comments (0)

Welcome to my blog

Posted on Thursday, January 17, 2008

Welcome to the first entry of my blog.  My name is Martin Farach-Colton.  I’m co-founder and CTO of a startup called Tokutek.  I’m also a professor of Computer Science at Rutgers University. 

Tokutek is based on a bunch of research I did with Michael Bender (SUNY Stony Brook) and Bradley Kuszmaul (MIT).  The purpose of this blog is—more or less—to explain why we started this company.  What I mean is that we have a new way at looking at memory systems based on a relatively new algorithmic approach called Cache Oblivious Algorithmics.  I’d like to tell you what that means for systems and users.

Our new approach is encapsulated in the products we’ll be producing over the coming months and years.  But this is not the place to talk about our products.  Instead, I’d like to talk about my take on memory systems, and what can be done to improve databases and file systems.

The topics here will be a mix of the practical (when does a DB do a get?), medium (when is a disk like a tape?) and theoretical (what’s the tradeoff between insertion time and query time in a dictionary data structure?).

In short, Tokutek was founded on a technological breakthrough, and as a technologist, which, I suppose, is a fancy word for geek, I’d like to tell you about it.

I hope you enjoy!

Comments (0)