How Fast Can Updates Run?

Posted on Monday, February 11, 2008

Last time, I introduced the notion of strict and lenient updates.  Now it’s time to see what the performance characteristics are of each.

Just to rehash, we are focusing on the storage engine (a la MySQL) level, and we are looking at a database on a single disk—the one we are using for illustration is the 1TB Hitachi Deskstar 7K1000.  It has a disk seek time 14ms and transfer rate of around 69MB/s [See tomshardware.com] We will insert and delete random pairs, each 8 bytes.  So that’s 62.5 billion pairs to fill the disk.

Strict Updates

These are the easier update types to analyze.  Please review the definition of strict updates from the last blog entry.  Now notice that each insertion or deletion requires a point query.  For example, during an insertion, in order to determine if there’s already a row with a particular key value in the database, one must look up that key.  In order to tell how many rows are removed during a delete operation (if any), one must look up the key being deleted.  There’s no getting around the fact that the disk head needs to move to the location on disk of the information being returned.

Since these point queries are not the main point of what’s going on, and since many people might not realize that they are even happening, I’m going to call them cryptogets (the get because a point query is often referred to as a get at the storage engine level).

Since no strict update can be faster than a point query, filling up the disk with data takes at least 28 years.  And once again, this is independent of the data structure used.

Lenient Updates

What about lenient updates?  The only lower bound is the 10 hour bandwidth lower bound we’ve already talked about.  Ten hours is the time it takes to write the data out sequentially on disk.  And certainly, if you don’t need to return any information at the time of an update, you could just log the updates, which is to say, you could write them out sequentially to disk.

That makes queries slow, but the point is that a log is a (very simple) data structure that matches the disk-bandwidth lower bound.

What about B-trees?  A B-tree index will do (at least) one disk seek per update.  That means that it will take at least 28 years to finish filling the disk.  This ratio is even worse than B-trees achieved for range queries.

[Yes, yes, I know, what about presorting the data before inserting?  Won’t that make things much faster?  Yes, it will, but that’s the subject of another blog entry.]

Conclusions

B-trees do a disk seek per insertion during which they swap in all the information needed to implement a strict update.  They don’t get any performance benefit from implementing lenient updates.  I believe that this is the reason that people don’t make the distinction between different classes of B-trees.

So far I’ve focused on lower bounds.  In future blog entries, I’ll talk about algorithms for actually implementing lenient updates.  Ok, here comes the punch line: lenient updates stomp all over strict updates, performancewise.

But first, we’ll look at the tradeoff between updates and range queries.  Stay tuned.

Comments (0)

Updates & Discipline

Posted on Tuesday, February 05, 2008

So far, I’ve analyzed point and range queries.  Now it’s time to talk about insertions and deletions.  We’ll call the combination updates.  Updates come in two flavors, and today we’ll cover both.

Depending on the exact settings of your database, the updates give a varying amount of feedback.  For example, when a key is deleted, all rows with that key are deleted (assuming the database allows duplicate keys).  The normal behavior is to return the number of rows deleted.  The normal behavior when deleting a key that has no corresponding rows in the database is to return an error message.  On insertion, one can allow duplicate or not.  In the latter case, the storage engine returns an error message if a duplication insertion is attempted. 

We’ll see that the details of error messages have a profound influence on the lower-bound arguments I’ve been making (and we’ll see a bit later that they have a profound influence on how quickly you can actually implement these operations).  Here, we define two classes of update operations:


Strict Updates

These are the insertions and deletions that come by default in MySQL, and many other databases. 


  • Insertions let you know if there’s already a row with that key in the database.

  • Unique-key indices: Returns an error on the insertion of a repeated key.

  • Deleting a key will tell you how many rows had that key.

  • Deletion of a non-existing key will return an error.

As I say, these are the types of updates that we are all used to.

Lenient Updates

The alternative to strict updates is lenient updates.  In a sense, they are equivalent, in that there’s nothing you can do with one that you can’t do with the other.  It’s just going to turn out that lenient updates can be made to run much faster.


  • Insertions do not let you know if there’s already a row with that key in the database.

  • Unique-key indices: Overwrites with new value on the insertion of a repeated key.  Does not return an error.

  • Deleting a key will not tell you how many rows had that key.

  • Deletion of a non-existing key will not return an error.

Conclusions

What’s the upshot?  I’ve been presenting operations in terms of lower bounds that are dominated by disk seeks (slow!) versus those dominated by disk bandwidth (much faster).  See if you can use this analysis to understand the behavior of these two types of updates.

Comments (0)