It never fails. Someone manages to break their Pacemaker cluster, and Henrik starts preaching his usual sermon of why Pacemaker is terrible and why you should never-ever use it. And when that someone is GitHub, which we all know, use and love, then that sermon gets a bit of excess attention. Let's take a quick look at the facts.
The week of September 10, GitHub suffered a couple of outages which caused a total downtime of 1 hour and 46 minutes, as Jesse precisely pointed out in a blog post. Exhibiting the excellent transparency that GitHub always offers at any time its infrastructure is affected by issues (remember their role-model behavior in an SSH security incident a few months back), Jesse explains, in a very detailed way, what happened on one of their Pacemaker clusters.
Now, all of what follows is based exclusively on the information in that blog post of Jesse's. I have no inside knowledge of the incident, so my picture may be incomplete or skewed. But here's my take on it anyway. I do encourage you to read Jesse's post full-length, as the rest of this post otherwise won't make much sense. I'll just quote certain pieces of it and comment on them here.
Please note: nothing in this post should be construed as a put-down of GitHub's excellent staff. They run a fantastic service and do an awesome job. It's just that their post-mortem seems to have created some misconceptions in the MySQL community about the Pacemaker stack as a whole, and those I'd like to help rectify. Also, I'm posting this in the hope that it provides useful insight to both the GitHub folks, and to anyone else facing similar issues.
Enable Maintenance Mode when you should
From the original post:
Monday's migration caused higher load on the database than our operations team has previously seen during these sorts of migrations. So high, in fact, that they caused Percona Replication Manager's health checks to fail on the master. In response to the failed master health check, Percona Replication manager moved the 'active' role and the master database to another server in the cluster and stopped MySQL on the node it perceived as failed.At the time of this failover, the new database selected for the 'active' role had a cold InnoDB buffer pool and performed rather poorly. The system load generated by the site's query load on a cold cache soon caused Percona Replication Manager's health checks to fail again, and the 'active' role failed back to the server it was on originally. At this point, I decided to disable all health checks by enabling Pacemaker's
maintenance-mode; an operating mode in which no health checks or automatic failover actions are performed. Performance on the site slowly recovered as the buffer pool slowly reached normal levels.
Now there's actually several issues in there even in this early stage. Maintenance mode is generally the right thing to do here, but you enable it before making large changes to the configuration, and you disable it when done. If you're uncomfortable with the cluster manager taking its hands off the entire cluster, and you know what you're doing, you could also just disable cluster management and monitoring on a specific resource. Both approaches are explained here.
Also, as far as "health checks failing" on the master is concerned, pretty much the only thing that is likely to cause such a failure in this instance is a timeout, and you can adjust those even on a per-operation basis in Pacemaker. But even that is unnecessary if you enable maintenance mode at the right time.
"Maintenance mode" really means maintenance mode
The following morning, our operations team was notified by a developer of incorrect query results returning from the node providing the 'standby' role. I investigated the situation and determined that when the cluster was placed into maintenance-mode the day before, actions that should have caused the node elected to serve the 'standby' role to change its replication master and start replicating were prevented from occurring.
Well, of course. In maintenance mode, Pacemaker takes its hands off your resources. If you're enabling maintenance mode right in the middle of a failover, then that's not exactly a stellar idea. If you do, then it's your job to complete those actions manually.
I determined that the best course of action was to disable
maintenance-modeto allow Pacemaker and the Percona Replication Manager to rectify the situation.
"Best" might be an exaggeration, if I may say so.
A segfault and rejected cluster messages
Upon attempting to disable
maintenance-mode, a Pacemaker segfault occurred that resulted in a cluster state partition.
OK, that's bad, but what exactly segfaulted? crmd? attrd? pengine? Or the master Heartbeat process? But the next piece of information would have me believe that the segfault really isn't the root cause of the cluster partition:
After this update, two nodes (I'll call them 'a' and 'b') rejected most messages from the third node ('c'), while the third node rejected most messages from the other two.
Now it's a pity that we don't have any version information and logs, but this looks very much like the "not in our membership" issue present up to Pacemaker 1.1.6. This is a known issue, the fix is to update to a more recent version (here's the commit, on GitHub of course), and the workaround is to just restart the Pacemaker services on the affected node(s) while in maintenance mode.
A non-quorate partition running MySQL?
Despite having configured the cluster to require a majority of machines to agree on the state of the cluster before taking action, two simultaneous master election decisions were attempted without proper coordination. In the first cluster, master election was interrupted by messages from the second cluster and MySQL was stopped.
Now this is an example of me being tempted to say, "logs or it didn't happen." If you've got the default no-quorum-policy of "block", and you're getting a non-quorate partition, and you don't have any resources with operations explicitly configured to ignore quorum, then "two simultaneous master election decisions" can only refer to the Designated Coordinator (DC) election, which has no bearing whatsoever on MySQL master status. Luckily, Pacemaker allows us to take a meaningful snapshot of all cluster logs and status after the fact with crm_report. It would be quite interesting to see a tarball from that.
In the second, single-node cluster, node 'c' was elected at 8:19 AM, and any subsequent messages from the other two-node cluster were discarded. As luck would have it, the 'c' node was the node that our operations team previously determined to be out of date. We detected this fact and powered off this out-of-date node at 8:26 AM to end the partition and prevent further data drift, taking down all production database access and thus all access to github.com.
That's obviously a bummer, but really, if that partition is non-quorate, and Pacemaker hasn't explicitly been configured to ignore that, no cluster resources would start there. Needless to say a working fencing configuration would have helped oodles, too.
Your cluster has no crystal ball, but it does have a command line
I'll skip over most of the rest of the GitHub post, because it's an explanation of how these backend issues affected GitHub users. I'll just hop on down to this piece:
The automated failover of our main production database could be described as the root cause of both of these downtime events. In each situation in which that occurred, if any member of our operations team had been asked if the failover should have been performed, the answer would have been a resounding no.
Well, you could have told your Pacemaker of that fact beforehand. Enable maintenance mode and you're good to go.
There are many situations in which automated failover is an excellent strategy for ensuring the availability of a service. After careful consideration, we've determined that ensuring the availability of our primary production database is not one of these situations. To this end, we've made changes to our Pacemaker configuration to ensure failover of the 'active' database role will only occur when initiated by a member of our operations team.
That splash you just heard was the bath water. The scream was the baby being tossed out with it.
Automated failover is a pretty poor strategy in the middle of a large configuration change. And Pacemaker gives you a simple and easy interface to disable it, by changing a single cluster property. Failure to do so may result in problems, and in this case it did.
When you put a baby seat on the passenger side of your car, you disable the air bag to prevent major injury. But if you take that baby seat out and an adult passenger rides with you, are you seriously saying you're going to manually initiate the air bag in case of a crash? I hope you're not.
Finally, our operations team is performing a full audit of our Pacemaker and Heartbeat stack focusing on the code path that triggered the segfault on Tuesday.
That's probably a really good idea. For anyone planning to do the same, we can help.
This article originally appeared on my blog on the
hastexo.com website (now defunct).