Details

    • New Feature
    • Resolution: Fixed
    • Minor
    • Lustre 2.10.0
    • None
    • None
    • 9223372036854775807

    Description

      A new script to be used in Pacemaker to manage ZFS pools and Lustre targets.

      This RA is able to manage (import/export) ZFS pools and Lustre Target (mount/umount).

      pcs resource create <Resource Name> ocf:heartbeat:LustreZFS \ 
      pool="<ZFS Pool Name>" \
      volume="<ZFS Volume Name>" \
      mountpoint="<Mount Point" \
      OCF_CHECK_LEVEL=10
      

      where:

      • pool is the pool name of the ZFS resource created in advance
      • volume is the volume name created on the ZFS pool during the Lustre format (mkfs.lustre).
      • mount point is the mount point created in advance on both the Lustre servers
      • OCF_CHECK_LEVEL is optional and enable an extra monitor on the status of the pool

      This script should be located in /usr/lib/ocf/resource.d/heartbeat/ of both the Lustre servers with permission 755.

      The script provides protection from double imports of the pools. In order to activate this functionality is important to configure the hostid protection in ZFS using the genhostid command.

      Default values:

      • no defaults

      Default timeout:

      • start timeout 300s
      • stop timeout 300s
      • monitor timeout 300s interval 20s

      Compatible and tested:

      • pacemaker 1.1.13
      • corosync 2.3.4
      • pcs 0.9.143
      • RHEL/CentOS 7.2

      Attachments

        Issue Links

          Activity

            [LU-8455] Pacemaker script for Lustre and ZFS

            Hi Vaughn,

            Try using the path ocf:pacemaker:Lustre-MDS-ZFS, instead of ocf:heartbeat:Lustre-MDS-ZFS. You can also verify the list of available RAs using the command pcs resource list. For example:

            [root@ct66-mds2 ~]# pcs resource list ocf:pacemaker | awk 'tolower($0) ~ /lustre|lnet/'
            ocf:pacemaker:Lustre-MDS-ZFS - Lustre and ZFS management when the MDT and MGT
            ocf:pacemaker:LustreZFS - Lustre and ZFS management
            ocf:pacemaker:healthLNET - LNet connectivity
            ocf:pacemaker:healthLUSTRE - lustre servers healthy
            

             

            malkolm Malcolm Cowe (Inactive) added a comment - Hi Vaughn, Try using the path ocf: pacemaker :Lustre-MDS-ZFS , instead of ocf: heartbeat :Lustre-MDS-ZFS . You can also verify the list of available RAs using the command pcs resource list . For example: [root@ct66-mds2 ~]# pcs resource list ocf:pacemaker | awk 'tolower($0) ~ /lustre|lnet/' ocf:pacemaker:Lustre-MDS-ZFS - Lustre and ZFS management when the MDT and MGT ocf:pacemaker:LustreZFS - Lustre and ZFS management ocf:pacemaker:healthLNET - LNet connectivity ocf:pacemaker:healthLUSTRE - lustre servers healthy  

            I've been trying to use the script to create the HA volumn/dataset resources with the following syntax:

            pcs resource create hail-mgt ocf:heartbeat:Lustre-MDS-ZFS pool="ha.mds" volume="mgt" mountpoint="/lustre/hail/mgt"

            Each attempt returns the following error:

            Error: Unable to create resource 'ocf:heartbeat:Lustre-MDS-ZFS', it is not installed on this system (use --force to override)

            I can see that the agent script, Lustre-MDS-ZFS, is dropped into the correct location when I run this syntax with the debug option enabled.  I also see the script being ran with a return 0 value. I'm not exactly sure what the problem is. Could it be missing some binary that I'm not seeing in the debug output? Anyway, I would greatly appreciate some guidance with solving this.

            Here are the details about my configuration:

            Red Hat Enterprise Linux Server release 7.3 (Maipo)
            pcs-0.9.152-10.el7.x86_64
            pacemaker-1.1.15-11.el7_3.2.x86_64
            corosync-2.4.0-4.el7.x86_64

            fence-agents-common-4.0.11-47.el7_3.2.x86_64
            fence-agents-powerman-4.0.11-7.ch6.x86_64
            libxshmfence-1.2-1.el7.x86_64

            This is being deployed in a diskless 2 node HA Lustre environment. Please let me know if you require me to open a ticket concerning this issue.

            veclinton Vaughn E. Clinton (Inactive) added a comment - - edited I've been trying to use the script to create the HA volumn/dataset resources with the following syntax: pcs resource create hail-mgt ocf:heartbeat:Lustre-MDS-ZFS pool="ha.mds" volume="mgt" mountpoint="/lustre/hail/mgt" Each attempt returns the following error: Error: Unable to create resource 'ocf:heartbeat:Lustre-MDS-ZFS', it is not installed on this system (use --force to override) I can see that the agent script, Lustre-MDS-ZFS, is dropped into the correct location when I run this syntax with the debug option enabled.  I also see the script being ran with a return 0 value. I'm not exactly sure what the problem is. Could it be missing some binary that I'm not seeing in the debug output? Anyway, I would greatly appreciate some guidance with solving this. Here are the details about my configuration: Red Hat Enterprise Linux Server release 7.3 (Maipo) pcs-0.9.152-10.el7.x86_64 pacemaker-1.1.15-11.el7_3.2.x86_64 corosync-2.4.0-4.el7.x86_64 fence-agents-common-4.0.11-47.el7_3.2.x86_64 fence-agents-powerman-4.0.11-7.ch6.x86_64 libxshmfence-1.2-1.el7.x86_64 This is being deployed in a diskless 2 node HA Lustre environment. Please let me know if you require me to open a ticket concerning this issue.

            Added a new version 0.99.5 that fix the amount issue

            gabriele.paciucci Gabriele Paciucci (Inactive) added a comment - Added a new version 0.99.5 that fix the amount issue

            Using "umount -f" will evict all of the clients at unmount time, which means they will get IO errors for any in-flight writes when the server is started again. This is useful if the administrator knows that the clients have already been stopped, for example.

            I don't think that using the "-f" flag will significantly reduce the amount of time that the unmount will take, but if you have numbers to that effect it would be interesting to see. It may be possible to improve the performance of the non-f unmount if that is a real issue.

            adilger Andreas Dilger added a comment - Using "umount -f" will evict all of the clients at unmount time, which means they will get IO errors for any in-flight writes when the server is started again. This is useful if the administrator knows that the clients have already been stopped, for example. I don't think that using the "-f" flag will significantly reduce the amount of time that the unmount will take, but if you have numbers to that effect it would be interesting to see. It may be possible to improve the performance of the non-f unmount if that is a real issue.

            Hi bmerchant thank you for this feedback. I would like to consult with adilger about this.
            I'm using umount -f because I don't want wait long time to close the file system if several clients are connected. If I hit the stop timeout in pacemaker, pacemaker will perform a stonith of the node.
            But at the same time the -f option is disabling the recovery and this could cause issues with data?
            adilger could you advice please. Thank you.

            gabriele.paciucci Gabriele Paciucci (Inactive) added a comment - Hi bmerchant thank you for this feedback. I would like to consult with adilger about this. I'm using umount -f because I don't want wait long time to close the file system if several clients are connected. If I hit the stop timeout in pacemaker, pacemaker will perform a stonith of the node. But at the same time the -f option is disabling the recovery and this could cause issues with data? adilger could you advice please. Thank you.

            We have been using this script and ran into an issue on manual failover (i.e. pcs resource disable/enable or move/relocate). On resource stop the script force unmounts Lustre targets (umount -f). In live tests this caused the recovery flags to not be set on OSTs, causing client eviction on remount. Removing the -f flag resolves the problem, but we are curious if removing it could present any other issues.

            I do notice Christopher Morrone's script omits the -f flag.

            bmerchant Bradley Merchant (Inactive) added a comment - We have been using this script and ran into an issue on manual failover (i.e. pcs resource disable/enable or move/relocate). On resource stop the script force unmounts Lustre targets (umount -f). In live tests this caused the recovery flags to not be set on OSTs, causing client eviction on remount. Removing the -f flag resolves the problem, but we are curious if removing it could present any other issues. I do notice Christopher Morrone's script omits the -f flag.

            There is a considerable amount of duplication between this ticket and LU-8458. One of them should probably be closed.

            adilger Andreas Dilger added a comment - There is a considerable amount of duplication between this ticket and LU-8458 . One of them should probably be closed.

            Hi morrone,
            as I already mentioned in our discussion, I found problems (and weakness) in pacemaker to manage two resources colocated. One of the problem was to move (from your example) MGS from jet1 to jet2. The only way to do this is moving jet1-1, but this means that if the MGS resource failed, the other resource is not moved... I tried also using the resource group concept but without success.
            I tested this in RHEL 6 maybe in RHEL 7 they fixed, this issue. Could you please test this?

            I also suggest to use the resource stickness option to avoid any failback.

            gabriele.paciucci Gabriele Paciucci (Inactive) added a comment - Hi morrone , as I already mentioned in our discussion, I found problems (and weakness) in pacemaker to manage two resources colocated. One of the problem was to move (from your example) MGS from jet1 to jet2 . The only way to do this is moving jet1-1 , but this means that if the MGS resource failed, the other resource is not moved... I tried also using the resource group concept but without success. I tested this in RHEL 6 maybe in RHEL 7 they fixed, this issue. Could you please test this? I also suggest to use the resource stickness option to avoid any failback.

            I promised to share my version of pacemaker OCF resource agent scripts by the end of this week so here they are:

            https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/zpool
            https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/lustre

            Keep in mind that these are still work-in-progress. They are certainly rough around the edges. The general idea is that zpool manages zpool inport/export, and lustre manages mount/umount of individual zfs datasets.

            In pacemaker I have ordering and colocation contraints to express the relationship between the two resources. Here is an example set of constraints that are working on our testbed. This is for two MDS nodes with shared storage. The zpool of the first MDS is used by two lustre targets: the MGS and the first MDS. This is a real-world example of needed to support two lustre filesystems in the same pool (all of our production filesystems are like that, even though DNE exists only on testbeds for now).

                  <rsc_location id="jet1-1_loc_20" rsc="jet1-1" node="jet1" score="20"/>
                  <rsc_location id="jet1-1_loc_10" rsc="jet1-1" node="jet2" score="10"/>
                  <rsc_location id="MGS_loc_20" rsc="MGS" node="jet1" score="20"/>
                  <rsc_location id="MGS_loc_10" rsc="MGS" node="jet2" score="10"/>
                  <rsc_order id="MGS_order" first="jet1-1" then="MGS" kind="Mandatory"/>
                  <rsc_colocation id="MGS_colocation" rsc="MGS" with-rsc="jet1-1" score="INFINITY"/>
                  <rsc_location id="lquake-MDT0000_loc_20" rsc="lquake-MDT0000" node="jet1" score="20"/>
                  <rsc_location id="lquake-MDT0000_loc_10" rsc="lquake-MDT0000" node="jet2" score="10"/>
                  <rsc_order id="lquake-MDT0000_order" first="jet1-1" then="lquake-MDT0000" kind="Mandatory"/>
                  <rsc_colocation id="lquake-MDT0000_colocation" rsc="lquake-MDT0000" with-rsc="jet1-1" score="INFINITY"/>
                  <rsc_location id="jet2-1_loc_20" rsc="jet2-1" node="jet2" score="20"/>
                  <rsc_location id="jet2-1_loc_10" rsc="jet2-1" node="jet1" score="10"/>
                  <rsc_location id="lquake-MDT0001_loc_20" rsc="lquake-MDT0001" node="jet2" score="20"/>
                  <rsc_location id="lquake-MDT0001_loc_10" rsc="lquake-MDT0001" node="jet1" score="10"/>
                  <rsc_order id="lquake-MDT0001_order" first="jet2-1" then="lquake-MDT0001" kind="Mandatory"/>
                  <rsc_colocation id="lquake-MDT0001_colocation" rsc="lquake-MDT0001" with-rsc="jet2-1" score="INFINITY"/>
            

            I wrote a script to generate the pacemaker cib.xml for an entire lustre server cluster starting from an ldev.conf file. It is also on github:

            https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/ldev2cib

            ldev2cib is also a work-in-progress. Currently stonith is entirely disabled. We'll work on adding stonith once the local powerman stonith agent is ready to be integrated into the whole.

            Yes, the lustre script is zfs-only at this time. It could be expanded to work with ldiskfs too if that makes sense. I'm also open to keeping ldiskfs and zfs lustre RA scripts separate.

            morrone Christopher Morrone (Inactive) added a comment - I promised to share my version of pacemaker OCF resource agent scripts by the end of this week so here they are: https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/zpool https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/lustre Keep in mind that these are still work-in-progress. They are certainly rough around the edges. The general idea is that zpool manages zpool inport/export, and lustre manages mount/umount of individual zfs datasets. In pacemaker I have ordering and colocation contraints to express the relationship between the two resources. Here is an example set of constraints that are working on our testbed. This is for two MDS nodes with shared storage. The zpool of the first MDS is used by two lustre targets: the MGS and the first MDS. This is a real-world example of needed to support two lustre filesystems in the same pool (all of our production filesystems are like that, even though DNE exists only on testbeds for now). <rsc_location id="jet1-1_loc_20" rsc="jet1-1" node="jet1" score="20"/> <rsc_location id="jet1-1_loc_10" rsc="jet1-1" node="jet2" score="10"/> <rsc_location id="MGS_loc_20" rsc="MGS" node="jet1" score="20"/> <rsc_location id="MGS_loc_10" rsc="MGS" node="jet2" score="10"/> <rsc_order id="MGS_order" first="jet1-1" then="MGS" kind="Mandatory"/> <rsc_colocation id="MGS_colocation" rsc="MGS" with-rsc="jet1-1" score="INFINITY"/> <rsc_location id="lquake-MDT0000_loc_20" rsc="lquake-MDT0000" node="jet1" score="20"/> <rsc_location id="lquake-MDT0000_loc_10" rsc="lquake-MDT0000" node="jet2" score="10"/> <rsc_order id="lquake-MDT0000_order" first="jet1-1" then="lquake-MDT0000" kind="Mandatory"/> <rsc_colocation id="lquake-MDT0000_colocation" rsc="lquake-MDT0000" with-rsc="jet1-1" score="INFINITY"/> <rsc_location id="jet2-1_loc_20" rsc="jet2-1" node="jet2" score="20"/> <rsc_location id="jet2-1_loc_10" rsc="jet2-1" node="jet1" score="10"/> <rsc_location id="lquake-MDT0001_loc_20" rsc="lquake-MDT0001" node="jet2" score="20"/> <rsc_location id="lquake-MDT0001_loc_10" rsc="lquake-MDT0001" node="jet1" score="10"/> <rsc_order id="lquake-MDT0001_order" first="jet2-1" then="lquake-MDT0001" kind="Mandatory"/> <rsc_colocation id="lquake-MDT0001_colocation" rsc="lquake-MDT0001" with-rsc="jet2-1" score="INFINITY"/> I wrote a script to generate the pacemaker cib.xml for an entire lustre server cluster starting from an ldev.conf file. It is also on github: https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/ldev2cib ldev2cib is also a work-in-progress. Currently stonith is entirely disabled. We'll work on adding stonith once the local powerman stonith agent is ready to be integrated into the whole. Yes, the lustre script is zfs-only at this time. It could be expanded to work with ldiskfs too if that makes sense. I'm also open to keeping ldiskfs and zfs lustre RA scripts separate.

            Oh, I see! The reboot isn't necessarily as bad as I was thinking because the node simply booting again does not, in theory, introduce usage of a lustre device. You are relying on pacemaker to not start the device on that node after reboot. Which is odd, because you trust pacemaker there, but not enough to necessarily start it only in when place else where (which is why you are attempting direct stonith control).

            I guess it isn't quite as scary as I originally though. But the script needs to have options passed in to list the possible nodes that the resource can live on, rather than assuming there are only two nodes in the entire cluster and that the other node must be the only node that needs power cycling. As long as doing that, it definitely shouldn't be in any place but contrib.

            morrone Christopher Morrone (Inactive) added a comment - Oh, I see! The reboot isn't necessarily as bad as I was thinking because the node simply booting again does not, in theory, introduce usage of a lustre device. You are relying on pacemaker to not start the device on that node after reboot. Which is odd, because you trust pacemaker there, but not enough to necessarily start it only in when place else where (which is why you are attempting direct stonith control). I guess it isn't quite as scary as I originally though. But the script needs to have options passed in to list the possible nodes that the resource can live on, rather than assuming there are only two nodes in the entire cluster and that the other node must be the only node that needs power cycling. As long as doing that, it definitely shouldn't be in any place but contrib.

            Could you elaborate on what went wrong in pacemaker?

            You are leaving out that your script explicitly introduces a new failure mode with its racy, reboot appoach to atomic pool access. Does your new failure mode, and all the added complexity really balance out the rare failure mode in pacemaker?

            I think I would rather rely on Pacemaker's reliability than a racy reboot mechanism.

            morrone Christopher Morrone (Inactive) added a comment - Could you elaborate on what went wrong in pacemaker? You are leaving out that your script explicitly introduces a new failure mode with its racy, reboot appoach to atomic pool access. Does your new failure mode, and all the added complexity really balance out the rare failure mode in pacemaker? I think I would rather rely on Pacemaker's reliability than a racy reboot mechanism.

            People

              gabriele.paciucci Gabriele Paciucci (Inactive)
              gabriele.paciucci Gabriele Paciucci (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: