Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8458

Pacemaker script to monitor Lustre servers status

Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • None
    • 9223372036854775807

    Description

      A new script to be used in Pacemaker to monitor the Lustre Servers status compatible with ZFS and LDISKFS based Lustre server installations.

      This RA is able to monitor a Lustre Server using the Pacemaker's clone technology.

      pcs resource create [Resource Name] ocf:pacemaker:healthLUSTRE \
      dampen=[seconds 5s] \
      --clone 
      

      where:

      • dampen The time to wait (dampening) further changes occur

      This script should be located in /usr/lib/ocf/resource.d/heartbeat/ of both the Lustre servers with permission 755.

      Default values:

      • dampen 5s

      Default timeout:

      • start timeout 60s
      • stop timeout 20s
      • monitor timeout 60s interval 10s

      Compatible and tested:

      • pacemaker 1.1.13
      • corosync 2.3.4
      • pcs 0.9.143
      • RHEL/CentOS 7.2

      Example of procedure to configure:

      pcs resource create healthLUSTRE ocf:pacemaker:healthLUSTRE dampen=5s  --clone 
      
      targets=`crm_mon -1|grep 'OST'| awk '{print $1}'` 
      
      for i in $targets; do pcs constraint location $i rule score=-INFINITY lustred lt 1 or not_defined lustred; done 
      

      Attachments

        Issue Links

          Activity

            [LU-8458] Pacemaker script to monitor Lustre servers status

            Andreas, that script was added to contrib in 2010 and has not had a meaningful update since. Agree it would be worth cross-referencing (as a sanity check), but I don't think it is worth keeping in the long term.

            Unclear why a Lustre-specific mount script (per change 25664) is required, though. Pacemaker's standard ocf::heartbeat::Filesystem RA is pretty much file system agnostic and has been used to manage LDISKFS installs for years. Pretty sure it could mount / umount ZFS targets as well, provided the pool is imported. The healthcheck scripts for LNet and Lustre  (the 21812 or 22297 changeset) that were developed plug the Lustre-specific gaps that Filesystem cannot.

            malkolm Malcolm Cowe (Inactive) added a comment - Andreas, that script was added to contrib in 2010 and has not had a meaningful update since. Agree it would be worth cross-referencing (as a sanity check), but I don't think it is worth keeping in the long term. Unclear why a Lustre-specific mount script (per change 25664) is required, though. Pacemaker's standard ocf::heartbeat::Filesystem RA is pretty much file system agnostic and has been used to manage LDISKFS installs for years. Pretty sure it could mount / umount ZFS targets as well, provided the pool is imported. The healthcheck scripts for LNet and Lustre  (the 21812 or 22297 changeset) that were developed plug the Lustre-specific gaps that Filesystem cannot.

            I notice that lustre/contrib/lustre_server.sh is also a pacemaker resource script for monitoring Lustre servers. It would be worthwhile to look through that script and see if there is anything in there that should be added to the new script, or if it is redundant/inferior and should be removed.

            adilger Andreas Dilger added a comment - I notice that lustre/contrib/lustre_server.sh is also a pacemaker resource script for monitoring Lustre servers. It would be worthwhile to look through that script and see if there is anything in there that should be added to the new script, or if it is redundant/inferior and should be removed.

            Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/25664
            Subject: LU-8458 pacemaker: Resource to manage Lustre Target
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 261b9b81df6035e86adffc557024109aeae2e2f1

            gerrit Gerrit Updater added a comment - Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/25664 Subject: LU-8458 pacemaker: Resource to manage Lustre Target Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 261b9b81df6035e86adffc557024109aeae2e2f1
            tanabarr Tom Nabarro (Inactive) added a comment - - edited

            the patch with the relevant script is https://review.whamcloud.com/#/c/21812/ , the one mentioned above (https://review.whamcloud.com/#/c/22297/) contains an empty script

            tanabarr Tom Nabarro (Inactive) added a comment - - edited the patch with the relevant script is https://review.whamcloud.com/#/c/21812/ , the one mentioned above ( https://review.whamcloud.com/#/c/22297/ ) contains an empty script

            your script should actually be monitoring "lctl get_param health_check" instead of direct /proc access, since direct access will break when this file moves to /sys/fs/lustre

            gabriele.paciucci Gabriele Paciucci (Inactive) added a comment - your script should actually be monitoring "lctl get_param health_check" instead of direct /proc access, since direct access will break when this file moves to /sys/fs/lustre

            Gabriele Paciucci (gabriele.paciucci@intel.com) uploaded a new patch: http://review.whamcloud.com/22297
            Subject: LU-8458 pacemaker: Script to monitor Server status
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 75a266bb4fc2fa00ee896ecce3bd14af31130dbe

            gerrit Gerrit Updater added a comment - Gabriele Paciucci (gabriele.paciucci@intel.com) uploaded a new patch: http://review.whamcloud.com/22297 Subject: LU-8458 pacemaker: Script to monitor Server status Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 75a266bb4fc2fa00ee896ecce3bd14af31130dbe

            People

              gabriele.paciucci Gabriele Paciucci (Inactive)
              gabriele.paciucci Gabriele Paciucci (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: