I’ve been waving my hands about lower bounds. Well, sometimes I haven’t been waving my hands, because the lower bounds are tight. But in other cases (lenient insertions, range queries), the lower bounds are very far from what we’re used to.
So now, for a bit of math:
Brodal and Fagerberg showed in 2003 that there’s a tradeoff between insertions and queries. The insertions they consider are lenient. Well, any lower bound for lenient is a lower bound for strict, but they also gave upper bounds, so it matters. Also, they don’t know from lenient, but if you look at their upper bound, they are implementing lenient insertions. The queries they consider are, unfortunately, point queries. That’s too bad for us, because we’ve already seen that point queries are just too slow to be of interest on hard disks.
Still, they have matching upper and lower bounds, so let’s see what they say:
They show that, for any p between 0 and 1, if Query time < 1/p * log_B N then Insertion time > (1/pB^{1-p} log_B N). And visa versa, that is, we get a lower envelope of Query, Insertion times that are possible. Let try p=1. This says that if Query time is at most log_B N (which a B-tree gets), then insertion time is at least log_B N (also what a B-tree gets). So a B-tree is optimal for one point of this tradeoff.
More interestingly, when p=1/2, the query time doubles to 2log_B N. The insertion time is then no less than 2/sqrt{B} * log_B N. Wow. Taking B=4K, sqrt{B} = 64, and we get something that’s 32 times faster than a B-tree. But we’re taking about a lower bound. Maybe it’s not a very good lower bound. Maybe there’s no data structure that does this well.
It turns out there is (please see their paper for more details). Consider what that means: you can speed up insertions by a factor of 32 by slowing down queries by a factor of 2. That’s a pretty favorable tradeoff! Unfortunately, their solution doesn’t do range queries any better than B-trees—in fact, it’s somewhat more cumbersome, though no worse asymptotically than B-trees.
Conclusions
The Brodal & Fagerberg is quite nice. They’ve really captured something interesting about storing data on disk. However, it’s not the result we need, because the tradeoff that I think is interesting “in the field” is one between insertions and range queries, the kind of thing that translates to a MySQL database. But range queries depend on locality on secondary storage, and the typical mathematical models in use for thinking about external memory (the Disk Access Model and the Cache-Oblivious Model, about which more later) don’t capture this sort of locality. They focus on intra-block locality, not inter-block locality. I don’t know of any theoretical results that address the problem we have been considering.
So if you can’t do theory, you do hacks. Next time, we’ll start looking at some of the hacks that people use (or claim to use) to speed up their databases. Which of them play nicely with disks?
Comments (1)
Sorry for the delay, now on to range queries and lenient updates. Let’s call them queries and updates, for short. So far, I’ve shown that B-trees (and any of a number of other data structures) are very far from the “tight bound.” I’ll say a bound is a tight if it’s a lower bound and you can come up with data structure that matches it.
So how do we match the bandwidth bound for queries and updates? I already mentioned in passing how to do this, but let’s look more closely.
Fast Updates
The way to get fast updates is to log them. You can easily saturate disk bandwidth by writing out the insertion, deletion and update requests with no index.
A query now will typically start by sorting the data. Even a point query requires looking at all the data, but a range query requires looking at all the data log times (in order to sort it), or using a large amount of extra storage. Let’s focus on sorting for range queries.
For queries that range over all the data, sorting may be tolerable (after all, you’d be doing a lot better in terms of bandwidth than a B-tree gets over the same query!). This solution does, however, have some drawbacks. First, for smaller queries, where you only look at x% of the data, the fraction of bandwidth you are getting is x/log N, for appropriate base of the log. You can beat this somewhat if you put a lot of work into it. But you’re still going to get very little of your bandwidth.
[2 points if you’ve figured out that some queries don’t actually require sorting the data or using lots of space. 2 more points if you figured out that end users don’t want this kind of noise in their database, where some queries turn out to be unpredictably slow. A final 2 points for figuring out that almost no one wants to learn enough about a data base to be able to predict which queries will be slow.]
Fast Queries
The way to get fast queries is to lay out the data on disk carefully. That is, the best you could do would be to take all your data, sort it, and insert it in order into a sensible database (MySQL + InnoDB will do nicely) that will lay out the leaves more or less in order. Then range queries become linear scans on the disk, and you can get close to bandwidth rates. This is one of the main tricks for getting performance out of a data warehouse. Dumping and reloading data for this purpose is also one of the time-consuming things that DBAs do.
Inserting data now requires batching data, sorting it and inserting it. The time per insertion might not be so bad, but the latency is killer. It is not uncommon for data to become available for query in the data base a day or two after it comes into the system.
Conclusion
As we try to tease apart what’s fast and what’s inherently slow in a database, we get the idea that there’s some basic tradeoff between updates (even lenient ones) and range queries. Is this tradeoff inherent, or is there some way to get bandwidth speeds on both lenient updates and range queries?
Next time we’ll look at some tradeoff lower bounds and try to get closer to some answer.
Comments (0)