Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8458

Pacemaker script to monitor Lustre servers status

Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • None
    • 9223372036854775807

    Description

      A new script to be used in Pacemaker to monitor the Lustre Servers status compatible with ZFS and LDISKFS based Lustre server installations.

      This RA is able to monitor a Lustre Server using the Pacemaker's clone technology.

      pcs resource create [Resource Name] ocf:pacemaker:healthLUSTRE \
      dampen=[seconds 5s] \
      --clone 
      

      where:

      • dampen The time to wait (dampening) further changes occur

      This script should be located in /usr/lib/ocf/resource.d/heartbeat/ of both the Lustre servers with permission 755.

      Default values:

      • dampen 5s

      Default timeout:

      • start timeout 60s
      • stop timeout 20s
      • monitor timeout 60s interval 10s

      Compatible and tested:

      • pacemaker 1.1.13
      • corosync 2.3.4
      • pcs 0.9.143
      • RHEL/CentOS 7.2

      Example of procedure to configure:

      pcs resource create healthLUSTRE ocf:pacemaker:healthLUSTRE dampen=5s  --clone 
      
      targets=`crm_mon -1|grep 'OST'| awk '{print $1}'` 
      
      for i in $targets; do pcs constraint location $i rule score=-INFINITY lustred lt 1 or not_defined lustred; done 
      

      Attachments

        Issue Links

          Activity

            [LU-8458] Pacemaker script to monitor Lustre servers status
            pjones Peter Jones added a comment -

            Landed for 2.10

            pjones Peter Jones added a comment - Landed for 2.10

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25664/
            Subject: LU-8458 pacemaker: Resource to manage Lustre Target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 000a1aab890cf9a4fa4279ae449b7b7279fba512

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25664/ Subject: LU-8458 pacemaker: Resource to manage Lustre Target Project: fs/lustre-release Branch: master Current Patch Set: Commit: 000a1aab890cf9a4fa4279ae449b7b7279fba512

            Hi Chris,

            Your point is well made, and I understand the principal behind your approach. I think where I have concern is that there is a lot of cross-over with the existing Filesystem RA, and to provide the same level of comprehensive coverage in Lustre might create duplication (in particular for the LDISKFS volumes). I was looking to reduce duplication of effort, if possible.

            Even still, we could lift relevant logic from ocf:*:Filesystem and specialise according to our specific need, so there's no fundamental objection. I'd still advocate for separation of ZFS volume management RA separate from Lustre mount / umount.

            With regard to monitoring, we need to establish the monitoring criteria that would trigger a "valid" failover event, even if that list is necessarily small. So far, the only established criteria is determining if a given target is mounted. What else would you suggest for an RA?

            malkolm Malcolm Cowe (Inactive) added a comment - Hi Chris, Your point is well made, and I understand the principal behind your approach. I think where I have concern is that there is a lot of cross-over with the existing Filesystem RA, and to provide the same level of comprehensive coverage in Lustre might create duplication (in particular for the LDISKFS volumes). I was looking to reduce duplication of effort, if possible. Even still, we could lift relevant logic from ocf:*:Filesystem and specialise according to our specific need, so there's no fundamental objection. I'd still advocate for separation of ZFS volume management RA separate from Lustre mount / umount. With regard to monitoring, we need to establish the monitoring criteria that would trigger a "valid" failover event, even if that list is necessarily small. So far, the only established criteria is determining if a given target is mounted. What else would you suggest for an RA?

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22297/
            Subject: LU-8458 pacemaker: Script to monitor Server status
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d35970fc24d730dab28f28990875fc6244fe116c

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22297/ Subject: LU-8458 pacemaker: Script to monitor Server status Project: fs/lustre-release Branch: master Current Patch Set: Commit: d35970fc24d730dab28f28990875fc6244fe116c

            Both appear to be functionally equivalent

            You are giving an example for the situation when everything is working. The point of monitoring is to figure out when things go wrong. I've seen too many times when /proc/mounts didn't show anything for the lustre target when the Lustre service wasn't really stopped. Because Lustre lives in the kernel and can screw things up like that.

            What is required is a single RA that can start and stop an individual Lustre service, and monitor that service to determine if the underlying hardware might be in use.

            It is potentially disastrous for the monitor function to tell pacemaker that the lustre resource isn't running when in fact is might be writing to the storage.

            Lustre server services are not the same as client mounts. If a client mount doesn't appear in /proc/mounts, there is little chance that anything can ever use that again on the system. With a lustre service, using "mount" to start the servers has always been a little bit hackish (not that there weren't perfectly good reasons for going that route). Because Lustre lives in the kernel, there are too many ways that the mount can disappear, but the Lustre service is still working with the drive.

            Granted, mntdev might not be perfect either.

            There might be reasons to have extra, completely advisory healthLUSTRE RAs. I'm not making an evaluation on that. But having those does not excuse us from having proper Lustre service RAs as well. The goals of the "monitor" function in a Lustre sevice RA will necessarily be different than the needs of broader Lustre system monitoring. It might be argued that that sort of monitoring though, should be performed in tools other than pacemaker.

            Here is a concrete example: When lustre's "healthcheck" reports a problem, this is something a monitoring system should report to the sysadmin so they can resolve the problem. But many of the situations that helthcheck reports are not relavant to a proper pacemaker RA for lustre. Many of those situations can't be resolved by shooting the node and moving the resource to a failover partner node. So then the RA's monitor section should only report a problem when the problem has a reasonable chance of being solved by failover. Otherwise we can wind up with Lustre services bouncing back and forth between nodes and just exacerbating the problems on a system where there was already an issue.

            Telling the problem situations apart is, perhaps, easier said than done. Especially with Lustre.

            morrone Christopher Morrone (Inactive) added a comment - Both appear to be functionally equivalent You are giving an example for the situation when everything is working. The point of monitoring is to figure out when things go wrong. I've seen too many times when /proc/mounts didn't show anything for the lustre target when the Lustre service wasn't really stopped. Because Lustre lives in the kernel and can screw things up like that. What is required is a single RA that can start and stop an individual Lustre service, and monitor that service to determine if the underlying hardware might be in use. It is potentially disastrous for the monitor function to tell pacemaker that the lustre resource isn't running when in fact is might be writing to the storage. Lustre server services are not the same as client mounts. If a client mount doesn't appear in /proc/mounts, there is little chance that anything can ever use that again on the system. With a lustre service, using "mount" to start the servers has always been a little bit hackish (not that there weren't perfectly good reasons for going that route). Because Lustre lives in the kernel, there are too many ways that the mount can disappear, but the Lustre service is still working with the drive. Granted, mntdev might not be perfect either. There might be reasons to have extra, completely advisory healthLUSTRE RAs. I'm not making an evaluation on that. But having those does not excuse us from having proper Lustre service RAs as well. The goals of the "monitor" function in a Lustre sevice RA will necessarily be different than the needs of broader Lustre system monitoring. It might be argued that that sort of monitoring though, should be performed in tools other than pacemaker. Here is a concrete example: When lustre's "healthcheck" reports a problem, this is something a monitoring system should report to the sysadmin so they can resolve the problem. But many of the situations that helthcheck reports are not relavant to a proper pacemaker RA for lustre. Many of those situations can't be resolved by shooting the node and moving the resource to a failover partner node. So then the RA's monitor section should only report a problem when the problem has a reasonable chance of being solved by failover. Otherwise we can wind up with Lustre services bouncing back and forth between nodes and just exacerbating the problems on a system where there was already an issue. Telling the problem situations apart is, perhaps, easier said than done. Especially with Lustre.

            That's fair. I would suggest, then, that it would be helpful to establish what monitoring is needed and from there decide whether or not to incorporate that into a Lustre-specific "filesystem" RA, or create a canonical "healthLUSTRE" cloned resource, per the original submission on this ticket.

            The script in patch 25664 monitors /proc/mounts, while the example in lustre-tools-llnl reads from osd-zfs.*.mntdev.

            From /proc/mounts:

            [root@ct66-oss1 ~]# awk '/lustre/' /proc/mounts
            /dev/sda /mnt/demo-OST0000 lustre ro 0 0
            /dev/sdc /mnt/demo-OST0004 lustre ro 0 0
            

            And from osd-{zfs,ldiskfs}.mntdev:

            [root@ct66-oss1 ~]# lctl get_param osd*.*.mntdev
            osd-ldiskfs.demo-OST0000.mntdev=/dev/sda
            osd-ldiskfs.demo-OST0004.mntdev=/dev/sdc
            

            Both appear to be functionally equivalent, and don't appreciably differentiate from the ocf:heartbeat:Filesystem RA with regard to monitoring.

            ocf:*:Filesystem also makes it straightforward to add mount options and has the benefit of being available in a canonical distribution. It does have drawbacks, not the least of which is its general ignorance of Lustre, and the potential for unintended side-effects that might need to be more carefully explored.

            I'm not arguing against Lustre-specific RAs, only that we are clear about what is required. This ticket was created in recognition of a need for specialist monitoring of health status, and implemented with the healthLUSTRE and healthLNET scripts. These are used to create location constraints in Pacemaker. The separation of health monitoring from the mount RA should reduce duplication of code, and make it easier to test the individual scripts.

            If we want to create a more comprehensive RAs, that comprise logic for LDISKFS, ZFS, health monitoring mount options and so on, so be it, but let's not spend time on this needlessly.

            malkolm Malcolm Cowe (Inactive) added a comment - That's fair. I would suggest, then, that it would be helpful to establish what monitoring is needed and from there decide whether or not to incorporate that into a Lustre-specific "filesystem" RA, or create a canonical "healthLUSTRE" cloned resource, per the original submission on this ticket. The script in patch 25664 monitors /proc/mounts , while the example in lustre-tools-llnl reads from osd-zfs.*.mntdev. From /proc/mounts : [root@ct66-oss1 ~]# awk '/lustre/' /proc/mounts /dev/sda /mnt/demo-OST0000 lustre ro 0 0 /dev/sdc /mnt/demo-OST0004 lustre ro 0 0 And from osd-{zfs,ldiskfs}.mntdev : [root@ct66-oss1 ~]# lctl get_param osd*.*.mntdev osd-ldiskfs.demo-OST0000.mntdev=/dev/sda osd-ldiskfs.demo-OST0004.mntdev=/dev/sdc Both appear to be functionally equivalent, and don't appreciably differentiate from the ocf:heartbeat:Filesystem RA with regard to monitoring. ocf:*:Filesystem also makes it straightforward to add mount options and has the benefit of being available in a canonical distribution. It does have drawbacks, not the least of which is its general ignorance of Lustre, and the potential for unintended side-effects that might need to be more carefully explored. I'm not arguing against Lustre-specific RAs, only that we are clear about what is required. This ticket was created in recognition of a need for specialist monitoring of health status, and implemented with the healthLUSTRE and healthLNET scripts. These are used to create location constraints in Pacemaker. The separation of health monitoring from the mount RA should reduce duplication of code, and make it easier to test the individual scripts. If we want to create a more comprehensive RAs, that comprise logic for LDISKFS, ZFS, health monitoring mount options and so on, so be it, but let's not spend time on this needlessly.

            Malcolm, yes, the Filesystem RA will issue mount/umount, but it certainly leaves something to be desired in the monitoring area.

            morrone Christopher Morrone (Inactive) added a comment - Malcolm, yes, the Filesystem RA will issue mount/umount, but it certainly leaves something to be desired in the monitoring area.

            Andreas, that script was added to contrib in 2010 and has not had a meaningful update since. Agree it would be worth cross-referencing (as a sanity check), but I don't think it is worth keeping in the long term.

            Unclear why a Lustre-specific mount script (per change 25664) is required, though. Pacemaker's standard ocf::heartbeat::Filesystem RA is pretty much file system agnostic and has been used to manage LDISKFS installs for years. Pretty sure it could mount / umount ZFS targets as well, provided the pool is imported. The healthcheck scripts for LNet and Lustre  (the 21812 or 22297 changeset) that were developed plug the Lustre-specific gaps that Filesystem cannot.

            malkolm Malcolm Cowe (Inactive) added a comment - Andreas, that script was added to contrib in 2010 and has not had a meaningful update since. Agree it would be worth cross-referencing (as a sanity check), but I don't think it is worth keeping in the long term. Unclear why a Lustre-specific mount script (per change 25664) is required, though. Pacemaker's standard ocf::heartbeat::Filesystem RA is pretty much file system agnostic and has been used to manage LDISKFS installs for years. Pretty sure it could mount / umount ZFS targets as well, provided the pool is imported. The healthcheck scripts for LNet and Lustre  (the 21812 or 22297 changeset) that were developed plug the Lustre-specific gaps that Filesystem cannot.

            I notice that lustre/contrib/lustre_server.sh is also a pacemaker resource script for monitoring Lustre servers. It would be worthwhile to look through that script and see if there is anything in there that should be added to the new script, or if it is redundant/inferior and should be removed.

            adilger Andreas Dilger added a comment - I notice that lustre/contrib/lustre_server.sh is also a pacemaker resource script for monitoring Lustre servers. It would be worthwhile to look through that script and see if there is anything in there that should be added to the new script, or if it is redundant/inferior and should be removed.

            Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/25664
            Subject: LU-8458 pacemaker: Resource to manage Lustre Target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 261b9b81df6035e86adffc557024109aeae2e2f1

            gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/25664 Subject: LU-8458 pacemaker: Resource to manage Lustre Target Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 261b9b81df6035e86adffc557024109aeae2e2f1

            People

              gabriele.paciucci Gabriele Paciucci (Inactive)
              gabriele.paciucci Gabriele Paciucci (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: