How to Troubleshoot PostgreSQL Alerts

So you have your PostgreSQL application deployed on Engine Yard Cloud and everything is going great. You have enabled a few extensions, have added basic redundancy by spinning a database replica, and are busy developing new features. One day though, you look at the dashboard and see this message:

What do these alerts mean? Is the database at risk? Should you escalate to support? This post will help you understand PostgreSQL dashboard alerts and correlate them to the health of your database and application.

Monitoring and the checkpoint check

The alerts I showed you popped up in one of our mission-critical applications. This blog discusses the steps I performed to troubleshoot the cause of the problem and the resolution. But first a little bit of background.

We monitor the health of your PostgreSQL database using a combination of our own custom checks and Bucardo's check_postgres scripts. I'll wave a big wand here and tell you that either Collectd or Nagios (depends on your stack and features) consume the results of these checks and present them to the Engine Yard dashboard.

The following documentation page provides an explanation of the alerts Engine Yard issues for PostgreSQL. Today I'll focus only on the alert I received but refer to the documentation if you see something different in your application's dashboard.

Let's take a closer look at the message:

POSTGRES_CHECKPOINT CRITICAL: Last checkpoint was 16204 seconds ago

I know from this message that the_checkpointcheck originated the alert and the severity of the alert iscritical_. In human talk, the message means that the database has not had a checkpoint for about 4.5 hours! Here is another example:

POSTGRES_CHECKPOINT WARNING: Last checkpoint was 1265 seconds ago

This message means that the database has not had a checkpoint for about 21 minutes.

We issue a WARNING severity when checkpoint delays range from 20 to 30 minutes. For anything that exceeds 30 minutes, the severity of the alert goes to CRITICAL. A checkpoint is a point in the transaction log sequence at which all data files have been updated to reflect the information in the log and flushed to disk. If your system crashes recovery will start from the last known checkpoint. So the checkpoint check helps us confirm two things: that your database consistently takes forward the position in which recovery is started, and in the case of replicas that your standby is keeping up with its master (since the activity the replica sees is what the master has sent it).

For more information about checkpoints and replication, please refer to the Postgresql replication write-ahead-log (WAL) documentation.

Back to my App

Now we understand that the alert I received means that there was a problem with the database replica and its ability to checkpoint. The database logs showed nothing out of order, so I logged into the server console and discovered the following:

# psql
psql (9.1.9, server 9.1.3)
Type "help" for help.

The psql prompt showed me that there was a version mismatch between the database server binaries and the running psql process. This typically happens after a stack update (that includes a minor version bump of your database) is applied on a running environment, and the database process is not restarted. The database server is left in a state where its effectively running two versions at the same time. To ensure that the postgresql process is running the latest version of the database, you__MUST always restart the database process after upgrading your environment.

The stack and version update was absolutely necessary as it included critical security patches - See April 4, 3013 - PostgreSQL security update. But when the stack was applied, the person who applied the stack update didn't restart the postgresql process as outlined in the upgrade instructions. This is typically not a problem, as replication is known to work between patch level versions, but we hit a replication bug in the 9.1.3 to 9.1.9 upgrade which caused replication to break.

Our Solution

So in a nutshell, our database replica became unable to receive WAL archives from its master, checkpoints started falling behind, and we were alerted. Restarting the database process would have solved the problem but instead we decided to utilize the maintenance window to upgrade the server to PostgreSQL 9.2 and create a new replica.

I performed an in-place upgrade of the database master (something that professional services has a lot of experience with) and within minutes the application was back online running the latest version of PostgreSQL.

But troubleshooting this alert made me aware of the issues with our current upgrade process:

  • Documentation on alerts was lacking. There is no place to quickly look up the alerts we present in the UI and their meaning.`
  • Our upgrade message did not remind us to restart the database process (though the release notes did).`
  • An unexpected replication bug between patch versions caused my database replica to become stale.`

Here is what we we'll do to make sure you don't experience the problems I did last week.

New PostgreSQL alert documentation

PostgreSQL alerts will be explained in a new documentation page. We'll work on documenting MySQL and Riak alerts as well.

Improved stack upgrade messages

We will enhance stack release notes with icons to visually indicate if a process restart is needed when a new version of a database is available.

Ability to lock your database version

Without a doubt, we want customers to keep their database stacks up to date with security releases and patches. But it would be fantastic to be able to lock your entire database version (to the patch level) and still receive stack updates.

We have developed (and are internally testing) a toggle to lock your database version. With this feature, I can schedule a maintenance window (to restart the database process when I'm ready) while continuing to receive stack updates. We are still working on documentation but if this feature is something that interests you, please open a support ticket and let me know. It should be in limited access soon!

Hopefully now you have a little more context and information available to interpret the alerts we display in your environment. Exciting things are happening in the Engine Yard's Data stack (think new clusters!). A little hint for the curious, Tasha Drew's excellent weekly recap of engineering always includes juicy details on what we are up to ;)