Pacemaker's best-kept secret: crm_report
Posted on Sat 30 January 2016 in hints-and-kinks • 3 min read
Whenever things in Pacemaker go wrong (say, for example, resource failover doesn’t work as expected, or your cluster didn’t properly recover after a node shutdown), you’ll want to find out just exactly why that happened. Of course, the actual reason for the malfunction may be buried somewhere deep in your cluster configuration or setup, and so you might need to look at quite a few different sources to pin it down.
Sometimes, too, you want to enlist the help of a colleague, or maybe our help even, to get to the bottom of the issue. And sometimes it’s not practical to let someone access to system to just trigger the problem and watch what breaks.
Thankfully, Pacemaker ships with a utility that helps you collect
everything you or someone else might need to look at, in a simple,
compact format. Unfortunately few people, including even long-time
Pacemaker users, know that it exists: it’s called crm_report
.
Running crm_report
crm_report
‘s command syntax is rather quite simple. You just tell it
how far in the past you want the report to start, and which directory
you want to collect data in:
crm_report -f "2016-01-25 00:00:00" /tmp/crm_report
The directory you specify must not exist. If it does, crm_report
will refuse to run, rather than clobber or mess up your existing
report data.
By analyzing your logs all the way back to a start date you specify,
crm_report
makes it unnecessary for you to actually try to
reproduce the problem. All you need is a rough idea when the issue
occurred, and then you give crm_report
a timestamp a little earlier
than that as its start date.
You can also specify the end of the period you’re interested in. Suppose you’re exactly aware of a 10-minute time window in which the problem occurred. In that case, you could run:
crm_report -f "2016-01-25 01:15:00" -t "2016-01-25 01:25:00" /tmp/crm_report
Either way, crm_report
will collect relevant log data for the
specified time window on the host it is run on, and then connect to
the other cluster nodes (via ssh
) and do the same there. The latter
behavior can be disabled by adding the -S
or --single-node
option,
but there usually isn’t a good reason to do that. In the end,
everything will be rolled into one tarball at
/tmp/crm_report.tar.bz2
You can then pull the report tarball off the node (with scp
,
rsync
, whatever you prefer), and then share it with whom you need
to. Note that the tarball can contain sensitive information such as
passwords, so be careful whom you share it with.
What’s in a crm_report
tarball?
There’s a bunch of truly helpful information in a crm_report
generated tarball. Depending on how your cluster is configured and
what problems were detected, it will contain, among other things:
-
Your current Pacemaker Cluster Information Base (CIB),
-
Your Corosync configuration,
-
Corosync Blackbox output (if
qb-blackbox
is installed on your cluster nodes; you can read more about blackbox support here), -
drbd.conf
and all your DRBD resource configuration files (if your cluster runs DRBD), -
sysinfo.txt
, a text file including your kernel, distro, Pacemaker version, and version information for all your installed packages, -
your Syslog, filtered for the time period you specified in your
crm_report
command invocation, -
diffs for critical system information, if
crm_report
detected discrepancies between nodes.
In other words, it contains pretty much everything that needs to be shared in a critical troubleshooting situation.
Why isn’t this more widely known?
To be perfectly honest, we have no idea. crm_report
has been in
Pacemaker for years, and even prior to its existence, there was a
predecessor named hb_report
. It’s an extraordinarily useful utility,
yet when we ask customers to send a crm_report
tarball during a
Pacemaker troubleshooting engagement, the usual response is, “a
what?”
We hope this post makes crm_report
known to a wider audience, so it
gets the love it deserves.
This article originally appeared on the hastexo.com
website (now defunct).