Tuesday, April 8, 2008

PostgreSQL MVCC and VACUUM

PostgreSQL uses MVCC (multi-version concurrency control) to provide transactional semantics. In the simplest form, MVCC lets each transaction work on a different copy of a row in a table. What it means is whenever a row is updated, a new version of the row is created. The existing readers continue to see the old version of the row whereas new transactions see the new version. This is rather a very simplified presentation of MVCC. In the real world, there can be many versions of the row and a transaction can see one of those versions. The visibility depends on the transaction start time and the transaction isolation level.

The great advantage of MVCC is that readers don't wait for the writers to finish their work. Similarly, writers don't wait for the readers because they work on independent versions of the row. The downside is though that the table may get bloated with the multiple versions of a row. Sooner or later all but one version become invisible to all current or new transactions. The invisible versions are said to be DEAD rows. A maintenance activity called as VACUUM is required to recover these dead rows. The VACUUM scans the entire table, collect the DEAD rows and removes the corresponding index entries. This requires multiple passes over the table. As you can imagine, this can conflict with the normal database operations and can considerably slow down the operations, especially for heavily updated, large tables.

One may argue why we just don't remove the DEAD rows as and when we find them during normal database operations. The problem with that is we also need to cleanup the corresponding index entries. Otherwise the index entries may point nowhere, thus creating dangling references or even worse may point to unrelated rows (if the DEAD row is replaced by some other row), thus giving wrong results.
Finally, it's not possible to reach to the index entries from the heap tuples, making it extremely difficult to remove the dead rows easily.

PostgreSQL has the facility of AutoVacuum which is like a daemon service which periodically checks the tables and automatically vacuums them based on certain configuration parameters. This greatly simplifies the maintenance task.

You can find PostgreSQL documentation here.