[LU-8458] Pacemaker script to monitor Lustre servers status - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.10.0
Affects Version/s: None
Labels:
- pacemaker

Rank (Obsolete):
9223372036854775807

Description

A new script to be used in Pacemaker to monitor the Lustre Servers status compatible with ZFS and LDISKFS based Lustre server installations.

This RA is able to monitor a Lustre Server using the Pacemaker's clone technology.

pcs resource create [Resource Name] ocf:pacemaker:healthLUSTRE \
dampen=[seconds 5s] \
--clone

where:

dampen The time to wait (dampening) further changes occur

This script should be located in /usr/lib/ocf/resource.d/heartbeat/ of both the Lustre servers with permission 755.

Default values:

dampen 5s

Default timeout:

start timeout 60s
stop timeout 20s
monitor timeout 60s interval 10s

Compatible and tested:

pacemaker 1.1.13
corosync 2.3.4
pcs 0.9.143
RHEL/CentOS 7.2

Example of procedure to configure:

pcs resource create healthLUSTRE ocf:pacemaker:healthLUSTRE dampen=5s  --clone 

targets=`crm_mon -1|grep 'OST'| awk '{print $1}'` 

for i in $targets; do pcs constraint location $i rule score=-INFINITY lustred lt 1 or not_defined lustred; done

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

healthLUSTRE-097
8 kB
01/Aug/16 2:46 PM

Issue Links

is blocking

LU-9168 Add pacemaker resources to lustre rpms

Resolved

is duplicated by

LU-8455 Pacemaker script for Lustre and ZFS

Resolved

Activity

[LU-8458] Pacemaker script to monitor Lustre servers status

Malcolm Cowe (Inactive) added a comment - 12/Apr/17 3:09 AM

Hi Chris,

Your point is well made, and I understand the principal behind your approach. I think where I have concern is that there is a lot of cross-over with the existing Filesystem RA, and to provide the same level of comprehensive coverage in Lustre might create duplication (in particular for the LDISKFS volumes). I was looking to reduce duplication of effort, if possible.

Even still, we could lift relevant logic from ocf:*:Filesystem and specialise according to our specific need, so there's no fundamental objection. I'd still advocate for separation of ZFS volume management RA separate from Lustre mount / umount.

With regard to monitoring, we need to establish the monitoring criteria that would trigger a "valid" failover event, even if that list is necessarily small. So far, the only established criteria is determining if a given target is mounted. What else would you suggest for an RA?

Malcolm Cowe (Inactive) added a comment - 12/Apr/17 3:09 AM Hi Chris, Your point is well made, and I understand the principal behind your approach. I think where I have concern is that there is a lot of cross-over with the existing Filesystem RA, and to provide the same level of comprehensive coverage in Lustre might create duplication (in particular for the LDISKFS volumes). I was looking to reduce duplication of effort, if possible. Even still, we could lift relevant logic from ocf:*:Filesystem and specialise according to our specific need, so there's no fundamental objection. I'd still advocate for separation of ZFS volume management RA separate from Lustre mount / umount. With regard to monitoring, we need to establish the monitoring criteria that would trigger a "valid" failover event, even if that list is necessarily small. So far, the only established criteria is determining if a given target is mounted. What else would you suggest for an RA?

Gerrit Updater added a comment - 06/Apr/17 1:00 AM

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22297/
Subject: ~~LU-8458~~ pacemaker: Script to monitor Server status
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d35970fc24d730dab28f28990875fc6244fe116c

Gerrit Updater added a comment - 06/Apr/17 1:00 AM Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22297/ Subject: LU-8458 pacemaker: Script to monitor Server status Project: fs/lustre-release Branch: master Current Patch Set: Commit: d35970fc24d730dab28f28990875fc6244fe116c

Christopher Morrone (Inactive) added a comment - 05/Apr/17 10:06 PM

Both appear to be functionally equivalent

You are giving an example for the situation when everything is working. The point of monitoring is to figure out when things go wrong. I've seen too many times when /proc/mounts didn't show anything for the lustre target when the Lustre service wasn't really stopped. Because Lustre lives in the kernel and can screw things up like that.

What is required is a single RA that can start and stop an individual Lustre service, and monitor that service to determine if the underlying hardware might be in use.

It is potentially disastrous for the monitor function to tell pacemaker that the lustre resource isn't running when in fact is might be writing to the storage.

Lustre server services are not the same as client mounts. If a client mount doesn't appear in /proc/mounts, there is little chance that anything can ever use that again on the system. With a lustre service, using "mount" to start the servers has always been a little bit hackish (not that there weren't perfectly good reasons for going that route). Because Lustre lives in the kernel, there are too many ways that the mount can disappear, but the Lustre service is still working with the drive.

Granted, mntdev might not be perfect either.

There might be reasons to have extra, completely advisory healthLUSTRE RAs. I'm not making an evaluation on that. But having those does not excuse us from having proper Lustre service RAs as well. The goals of the "monitor" function in a Lustre sevice RA will necessarily be different than the needs of broader Lustre system monitoring. It might be argued that that sort of monitoring though, should be performed in tools other than pacemaker.

Here is a concrete example: When lustre's "healthcheck" reports a problem, this is something a monitoring system should report to the sysadmin so they can resolve the problem. But many of the situations that helthcheck reports are not relavant to a proper pacemaker RA for lustre. Many of those situations can't be resolved by shooting the node and moving the resource to a failover partner node. So then the RA's monitor section should only report a problem when the problem has a reasonable chance of being solved by failover. Otherwise we can wind up with Lustre services bouncing back and forth between nodes and just exacerbating the problems on a system where there was already an issue.

Telling the problem situations apart is, perhaps, easier said than done. Especially with Lustre.

Christopher Morrone (Inactive) added a comment - 05/Apr/17 10:06 PM Both appear to be functionally equivalent You are giving an example for the situation when everything is working. The point of monitoring is to figure out when things go wrong. I've seen too many times when /proc/mounts didn't show anything for the lustre target when the Lustre service wasn't really stopped. Because Lustre lives in the kernel and can screw things up like that. What is required is a single RA that can start and stop an individual Lustre service, and monitor that service to determine if the underlying hardware might be in use. It is potentially disastrous for the monitor function to tell pacemaker that the lustre resource isn't running when in fact is might be writing to the storage. Lustre server services are not the same as client mounts. If a client mount doesn't appear in /proc/mounts, there is little chance that anything can ever use that again on the system. With a lustre service, using "mount" to start the servers has always been a little bit hackish (not that there weren't perfectly good reasons for going that route). Because Lustre lives in the kernel, there are too many ways that the mount can disappear, but the Lustre service is still working with the drive. Granted, mntdev might not be perfect either. There might be reasons to have extra, completely advisory healthLUSTRE RAs. I'm not making an evaluation on that. But having those does not excuse us from having proper Lustre service RAs as well. The goals of the "monitor" function in a Lustre sevice RA will necessarily be different than the needs of broader Lustre system monitoring. It might be argued that that sort of monitoring though, should be performed in tools other than pacemaker. Here is a concrete example: When lustre's "healthcheck" reports a problem, this is something a monitoring system should report to the sysadmin so they can resolve the problem. But many of the situations that helthcheck reports are not relavant to a proper pacemaker RA for lustre. Many of those situations can't be resolved by shooting the node and moving the resource to a failover partner node. So then the RA's monitor section should only report a problem when the problem has a reasonable chance of being solved by failover. Otherwise we can wind up with Lustre services bouncing back and forth between nodes and just exacerbating the problems on a system where there was already an issue. Telling the problem situations apart is, perhaps, easier said than done. Especially with Lustre.

Malcolm Cowe (Inactive) added a comment - 05/Apr/17 8:47 AM

That's fair. I would suggest, then, that it would be helpful to establish what monitoring is needed and from there decide whether or not to incorporate that into a Lustre-specific "filesystem" RA, or create a canonical "healthLUSTRE" cloned resource, per the original submission on this ticket.

The script in patch 25664 monitors /proc/mounts, while the example in lustre-tools-llnl reads from osd-zfs.*.mntdev.

From /proc/mounts:

[root@ct66-oss1 ~]# awk '/lustre/' /proc/mounts
/dev/sda /mnt/demo-OST0000 lustre ro 0 0
/dev/sdc /mnt/demo-OST0004 lustre ro 0 0

And from osd-{zfs,ldiskfs}.mntdev:

[root@ct66-oss1 ~]# lctl get_param osd*.*.mntdev
osd-ldiskfs.demo-OST0000.mntdev=/dev/sda
osd-ldiskfs.demo-OST0004.mntdev=/dev/sdc

Both appear to be functionally equivalent, and don't appreciably differentiate from the ocf:heartbeat:Filesystem RA with regard to monitoring.

ocf:*:Filesystem also makes it straightforward to add mount options and has the benefit of being available in a canonical distribution. It does have drawbacks, not the least of which is its general ignorance of Lustre, and the potential for unintended side-effects that might need to be more carefully explored.

I'm not arguing against Lustre-specific RAs, only that we are clear about what is required. This ticket was created in recognition of a need for specialist monitoring of health status, and implemented with the healthLUSTRE and healthLNET scripts. These are used to create location constraints in Pacemaker. The separation of health monitoring from the mount RA should reduce duplication of code, and make it easier to test the individual scripts.

If we want to create a more comprehensive RAs, that comprise logic for LDISKFS, ZFS, health monitoring mount options and so on, so be it, but let's not spend time on this needlessly.

Malcolm Cowe (Inactive) added a comment - 05/Apr/17 8:47 AM That's fair. I would suggest, then, that it would be helpful to establish what monitoring is needed and from there decide whether or not to incorporate that into a Lustre-specific "filesystem" RA, or create a canonical "healthLUSTRE" cloned resource, per the original submission on this ticket. The script in patch 25664 monitors /proc/mounts , while the example in lustre-tools-llnl reads from osd-zfs.*.mntdev. From /proc/mounts : [root@ct66-oss1 ~]# awk '/lustre/' /proc/mounts /dev/sda /mnt/demo-OST0000 lustre ro 0 0 /dev/sdc /mnt/demo-OST0004 lustre ro 0 0 And from osd-{zfs,ldiskfs}.mntdev : [root@ct66-oss1 ~]# lctl get_param osd*.*.mntdev osd-ldiskfs.demo-OST0000.mntdev=/dev/sda osd-ldiskfs.demo-OST0004.mntdev=/dev/sdc Both appear to be functionally equivalent, and don't appreciably differentiate from the ocf:heartbeat:Filesystem RA with regard to monitoring. ocf:*:Filesystem also makes it straightforward to add mount options and has the benefit of being available in a canonical distribution. It does have drawbacks, not the least of which is its general ignorance of Lustre, and the potential for unintended side-effects that might need to be more carefully explored. I'm not arguing against Lustre-specific RAs, only that we are clear about what is required. This ticket was created in recognition of a need for specialist monitoring of health status, and implemented with the healthLUSTRE and healthLNET scripts. These are used to create location constraints in Pacemaker. The separation of health monitoring from the mount RA should reduce duplication of code, and make it easier to test the individual scripts. If we want to create a more comprehensive RAs, that comprise logic for LDISKFS, ZFS, health monitoring mount options and so on, so be it, but let's not spend time on this needlessly.

Christopher Morrone (Inactive) added a comment - 04/Apr/17 5:33 PM

Malcolm, yes, the Filesystem RA will issue mount/umount, but it certainly leaves something to be desired in the monitoring area.

Christopher Morrone (Inactive) added a comment - 04/Apr/17 5:33 PM Malcolm, yes, the Filesystem RA will issue mount/umount, but it certainly leaves something to be desired in the monitoring area.

Malcolm Cowe (Inactive) added a comment - 03/Apr/17 12:16 AM

Andreas, that script was added to contrib in 2010 and has not had a meaningful update since. Agree it would be worth cross-referencing (as a sanity check), but I don't think it is worth keeping in the long term.

Unclear why a Lustre-specific mount script (per change 25664) is required, though. Pacemaker's standard ocf::heartbeat::Filesystem RA is pretty much file system agnostic and has been used to manage LDISKFS installs for years. Pretty sure it could mount / umount ZFS targets as well, provided the pool is imported. The healthcheck scripts for LNet and Lustre (the 21812 or 22297 changeset) that were developed plug the Lustre-specific gaps that Filesystem cannot.

Malcolm Cowe (Inactive) added a comment - 03/Apr/17 12:16 AM Andreas, that script was added to contrib in 2010 and has not had a meaningful update since. Agree it would be worth cross-referencing (as a sanity check), but I don't think it is worth keeping in the long term. Unclear why a Lustre-specific mount script (per change 25664) is required, though. Pacemaker's standard ocf::heartbeat::Filesystem RA is pretty much file system agnostic and has been used to manage LDISKFS installs for years. Pretty sure it could mount / umount ZFS targets as well, provided the pool is imported. The healthcheck scripts for LNet and Lustre (the 21812 or 22297 changeset) that were developed plug the Lustre-specific gaps that Filesystem cannot.

Andreas Dilger added a comment - 29/Mar/17 3:38 AM

I notice that lustre/contrib/lustre_server.sh is also a pacemaker resource script for monitoring Lustre servers. It would be worthwhile to look through that script and see if there is anything in there that should be added to the new script, or if it is redundant/inferior and should be removed.

Andreas Dilger added a comment - 29/Mar/17 3:38 AM I notice that lustre/contrib/lustre_server.sh is also a pacemaker resource script for monitoring Lustre servers. It would be worthwhile to look through that script and see if there is anything in there that should be added to the new script, or if it is redundant/inferior and should be removed.

Gerrit Updater added a comment - 28/Feb/17 2:25 PM

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/25664
Subject: ~~LU-8458~~ pacemaker: Resource to manage Lustre Target
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 261b9b81df6035e86adffc557024109aeae2e2f1

Gerrit Updater added a comment - 28/Feb/17 2:25 PM Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/25664 Subject: LU-8458 pacemaker: Resource to manage Lustre Target Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 261b9b81df6035e86adffc557024109aeae2e2f1

Tom Nabarro (Inactive) added a comment - 25/Jan/17 5:24 PM - edited

the patch with the relevant script is https://review.whamcloud.com/#/c/21812/ , the one mentioned above (https://review.whamcloud.com/#/c/22297/) contains an empty script

Tom Nabarro (Inactive) added a comment - 25/Jan/17 5:24 PM - edited the patch with the relevant script is https://review.whamcloud.com/#/c/21812/ , the one mentioned above ( https://review.whamcloud.com/#/c/22297/ ) contains an empty script

Gabriele Paciucci (Inactive) added a comment - 09/Sep/16 1:01 PM

your script should actually be monitoring "lctl get_param health_check" instead of direct /proc access, since direct access will break when this file moves to /sys/fs/lustre

Gabriele Paciucci (Inactive) added a comment - 09/Sep/16 1:01 PM your script should actually be monitoring "lctl get_param health_check" instead of direct /proc access, since direct access will break when this file moves to /sys/fs/lustre

Gerrit Updater added a comment - 02/Sep/16 5:40 PM

Gabriele Paciucci (gabriele.paciucci@intel.com) uploaded a new patch: http://review.whamcloud.com/22297
Subject: ~~LU-8458~~ pacemaker: Script to monitor Server status
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 75a266bb4fc2fa00ee896ecce3bd14af31130dbe

Gerrit Updater added a comment - 02/Sep/16 5:40 PM Gabriele Paciucci (gabriele.paciucci@intel.com) uploaded a new patch: http://review.whamcloud.com/22297 Subject: LU-8458 pacemaker: Script to monitor Server status Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 75a266bb4fc2fa00ee896ecce3bd14af31130dbe

People

Assignee:: Gabriele Paciucci (Inactive)

Reporter:: Gabriele Paciucci (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 01/Aug/16 2:43 PM

Updated:: 01/May/17 1:36 PM

Resolved:: 23/Apr/17 3:54 AM