[LU-8455] Pacemaker script for Lustre and ZFS - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.10.0
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

A new script to be used in Pacemaker to manage ZFS pools and Lustre targets.

This RA is able to manage (import/export) ZFS pools and Lustre Target (mount/umount).

pcs resource create <Resource Name> ocf:heartbeat:LustreZFS \ 
pool="<ZFS Pool Name>" \
volume="<ZFS Volume Name>" \
mountpoint="<Mount Point" \
OCF_CHECK_LEVEL=10

where:

pool is the pool name of the ZFS resource created in advance
volume is the volume name created on the ZFS pool during the Lustre format (mkfs.lustre).
mount point is the mount point created in advance on both the Lustre servers
OCF_CHECK_LEVEL is optional and enable an extra monitor on the status of the pool

This script should be located in /usr/lib/ocf/resource.d/heartbeat/ of both the Lustre servers with permission 755.

The script provides protection from double imports of the pools. In order to activate this functionality is important to configure the hostid protection in ZFS using the genhostid command.

Default values:

no defaults

Default timeout:

start timeout 300s
stop timeout 300s
monitor timeout 300s interval 20s

Compatible and tested:

pacemaker 1.1.13
corosync 2.3.4
pcs 0.9.143
RHEL/CentOS 7.2

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Lustre-ZFS-RA-0.99.5-1.noarch.rpm
10 kB
20/Feb/17 9:08 PM

Issue Links

duplicates

LU-8458 Pacemaker script to monitor Lustre servers status

Resolved

Activity

[LU-8455] Pacemaker script for Lustre and ZFS

Malcolm Cowe (Inactive) added a comment - 03/Apr/17 10:12 PM

Hi Vaughn,

Try using the path ocf:pacemaker:Lustre-MDS-ZFS, instead of ocf:heartbeat:Lustre-MDS-ZFS. You can also verify the list of available RAs using the command pcs resource list. For example:

[root@ct66-mds2 ~]# pcs resource list ocf:pacemaker | awk 'tolower($0) ~ /lustre|lnet/'
ocf:pacemaker:Lustre-MDS-ZFS - Lustre and ZFS management when the MDT and MGT
ocf:pacemaker:LustreZFS - Lustre and ZFS management
ocf:pacemaker:healthLNET - LNet connectivity
ocf:pacemaker:healthLUSTRE - lustre servers healthy

Malcolm Cowe (Inactive) added a comment - 03/Apr/17 10:12 PM Hi Vaughn, Try using the path ocf: pacemaker :Lustre-MDS-ZFS , instead of ocf: heartbeat :Lustre-MDS-ZFS . You can also verify the list of available RAs using the command pcs resource list . For example: [root@ct66-mds2 ~]# pcs resource list ocf:pacemaker | awk 'tolower($0) ~ /lustre|lnet/' ocf:pacemaker:Lustre-MDS-ZFS - Lustre and ZFS management when the MDT and MGT ocf:pacemaker:LustreZFS - Lustre and ZFS management ocf:pacemaker:healthLNET - LNet connectivity ocf:pacemaker:healthLUSTRE - lustre servers healthy

Vaughn E. Clinton (Inactive) added a comment - 03/Apr/17 3:19 PM - edited

I've been trying to use the script to create the HA volumn/dataset resources with the following syntax:

pcs resource create hail-mgt ocf:heartbeat:Lustre-MDS-ZFS pool="ha.mds" volume="mgt" mountpoint="/lustre/hail/mgt"

Each attempt returns the following error:

Error: Unable to create resource 'ocf:heartbeat:Lustre-MDS-ZFS', it is not installed on this system (use --force to override)

I can see that the agent script, Lustre-MDS-ZFS, is dropped into the correct location when I run this syntax with the debug option enabled. I also see the script being ran with a return 0 value. I'm not exactly sure what the problem is. Could it be missing some binary that I'm not seeing in the debug output? Anyway, I would greatly appreciate some guidance with solving this.

Here are the details about my configuration:

Red Hat Enterprise Linux Server release 7.3 (Maipo)
pcs-0.9.152-10.el7.x86_64
pacemaker-1.1.15-11.el7_3.2.x86_64
corosync-2.4.0-4.el7.x86_64

fence-agents-common-4.0.11-47.el7_3.2.x86_64
fence-agents-powerman-4.0.11-7.ch6.x86_64
libxshmfence-1.2-1.el7.x86_64

This is being deployed in a diskless 2 node HA Lustre environment. Please let me know if you require me to open a ticket concerning this issue.

Vaughn E. Clinton (Inactive) added a comment - 03/Apr/17 3:19 PM - edited I've been trying to use the script to create the HA volumn/dataset resources with the following syntax: pcs resource create hail-mgt ocf:heartbeat:Lustre-MDS-ZFS pool="ha.mds" volume="mgt" mountpoint="/lustre/hail/mgt" Each attempt returns the following error: Error: Unable to create resource 'ocf:heartbeat:Lustre-MDS-ZFS', it is not installed on this system (use --force to override) I can see that the agent script, Lustre-MDS-ZFS, is dropped into the correct location when I run this syntax with the debug option enabled. I also see the script being ran with a return 0 value. I'm not exactly sure what the problem is. Could it be missing some binary that I'm not seeing in the debug output? Anyway, I would greatly appreciate some guidance with solving this. Here are the details about my configuration: Red Hat Enterprise Linux Server release 7.3 (Maipo) pcs-0.9.152-10.el7.x86_64 pacemaker-1.1.15-11.el7_3.2.x86_64 corosync-2.4.0-4.el7.x86_64 fence-agents-common-4.0.11-47.el7_3.2.x86_64 fence-agents-powerman-4.0.11-7.ch6.x86_64 libxshmfence-1.2-1.el7.x86_64 This is being deployed in a diskless 2 node HA Lustre environment. Please let me know if you require me to open a ticket concerning this issue.

Gabriele Paciucci (Inactive) added a comment - 20/Feb/17 9:09 PM

Added a new version 0.99.5 that fix the amount issue

Gabriele Paciucci (Inactive) added a comment - 20/Feb/17 9:09 PM Added a new version 0.99.5 that fix the amount issue

Andreas Dilger added a comment - 16/Feb/17 10:44 PM

Using "umount -f" will evict all of the clients at unmount time, which means they will get IO errors for any in-flight writes when the server is started again. This is useful if the administrator knows that the clients have already been stopped, for example.

I don't think that using the "-f" flag will significantly reduce the amount of time that the unmount will take, but if you have numbers to that effect it would be interesting to see. It may be possible to improve the performance of the non-f unmount if that is a real issue.

Andreas Dilger added a comment - 16/Feb/17 10:44 PM Using "umount -f" will evict all of the clients at unmount time, which means they will get IO errors for any in-flight writes when the server is started again. This is useful if the administrator knows that the clients have already been stopped, for example. I don't think that using the "-f" flag will significantly reduce the amount of time that the unmount will take, but if you have numbers to that effect it would be interesting to see. It may be possible to improve the performance of the non-f unmount if that is a real issue.

Gabriele Paciucci (Inactive) added a comment - 16/Feb/17 10:11 PM

Hi bmerchant thank you for this feedback. I would like to consult with adilger about this.
I'm using umount -f because I don't want wait long time to close the file system if several clients are connected. If I hit the stop timeout in pacemaker, pacemaker will perform a stonith of the node.
But at the same time the -f option is disabling the recovery and this could cause issues with data?
adilger could you advice please. Thank you.

Gabriele Paciucci (Inactive) added a comment - 16/Feb/17 10:11 PM Hi bmerchant thank you for this feedback. I would like to consult with adilger about this. I'm using umount -f because I don't want wait long time to close the file system if several clients are connected. If I hit the stop timeout in pacemaker, pacemaker will perform a stonith of the node. But at the same time the -f option is disabling the recovery and this could cause issues with data? adilger could you advice please. Thank you.

Bradley Merchant (Inactive) added a comment - 16/Feb/17 5:51 PM

We have been using this script and ran into an issue on manual failover (i.e. pcs resource disable/enable or move/relocate). On resource stop the script force unmounts Lustre targets (umount -f). In live tests this caused the recovery flags to not be set on OSTs, causing client eviction on remount. Removing the -f flag resolves the problem, but we are curious if removing it could present any other issues.

I do notice Christopher Morrone's script omits the -f flag.

Bradley Merchant (Inactive) added a comment - 16/Feb/17 5:51 PM We have been using this script and ran into an issue on manual failover (i.e. pcs resource disable/enable or move/relocate). On resource stop the script force unmounts Lustre targets (umount -f). In live tests this caused the recovery flags to not be set on OSTs, causing client eviction on remount. Removing the -f flag resolves the problem, but we are curious if removing it could present any other issues. I do notice Christopher Morrone's script omits the -f flag.

Andreas Dilger added a comment - 11/Feb/17 9:16 AM

There is a considerable amount of duplication between this ticket and ~~LU-8458~~. One of them should probably be closed.

Andreas Dilger added a comment - 11/Feb/17 9:16 AM There is a considerable amount of duplication between this ticket and LU-8458 . One of them should probably be closed.

Gabriele Paciucci (Inactive) added a comment - 10/Sep/16 3:12 PM

Hi morrone,
as I already mentioned in our discussion, I found problems (and weakness) in pacemaker to manage two resources colocated. One of the problem was to move (from your example) MGS from jet1 to jet2. The only way to do this is moving jet1-1, but this means that if the MGS resource failed, the other resource is not moved... I tried also using the resource group concept but without success.
I tested this in RHEL 6 maybe in RHEL 7 they fixed, this issue. Could you please test this?

I also suggest to use the resource stickness option to avoid any failback.

Gabriele Paciucci (Inactive) added a comment - 10/Sep/16 3:12 PM Hi morrone , as I already mentioned in our discussion, I found problems (and weakness) in pacemaker to manage two resources colocated. One of the problem was to move (from your example) MGS from jet1 to jet2 . The only way to do this is moving jet1-1 , but this means that if the MGS resource failed, the other resource is not moved... I tried also using the resource group concept but without success. I tested this in RHEL 6 maybe in RHEL 7 they fixed, this issue. Could you please test this? I also suggest to use the resource stickness option to avoid any failback.

Christopher Morrone (Inactive) added a comment - 09/Sep/16 11:10 PM

I promised to share my version of pacemaker OCF resource agent scripts by the end of this week so here they are:

https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/zpool
https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/lustre

Keep in mind that these are still work-in-progress. They are certainly rough around the edges. The general idea is that zpool manages zpool inport/export, and lustre manages mount/umount of individual zfs datasets.

In pacemaker I have ordering and colocation contraints to express the relationship between the two resources. Here is an example set of constraints that are working on our testbed. This is for two MDS nodes with shared storage. The zpool of the first MDS is used by two lustre targets: the MGS and the first MDS. This is a real-world example of needed to support two lustre filesystems in the same pool (all of our production filesystems are like that, even though DNE exists only on testbeds for now).

      <rsc_location id="jet1-1_loc_20" rsc="jet1-1" node="jet1" score="20"/>
      <rsc_location id="jet1-1_loc_10" rsc="jet1-1" node="jet2" score="10"/>
      <rsc_location id="MGS_loc_20" rsc="MGS" node="jet1" score="20"/>
      <rsc_location id="MGS_loc_10" rsc="MGS" node="jet2" score="10"/>
      <rsc_order id="MGS_order" first="jet1-1" then="MGS" kind="Mandatory"/>
      <rsc_colocation id="MGS_colocation" rsc="MGS" with-rsc="jet1-1" score="INFINITY"/>
      <rsc_location id="lquake-MDT0000_loc_20" rsc="lquake-MDT0000" node="jet1" score="20"/>
      <rsc_location id="lquake-MDT0000_loc_10" rsc="lquake-MDT0000" node="jet2" score="10"/>
      <rsc_order id="lquake-MDT0000_order" first="jet1-1" then="lquake-MDT0000" kind="Mandatory"/>
      <rsc_colocation id="lquake-MDT0000_colocation" rsc="lquake-MDT0000" with-rsc="jet1-1" score="INFINITY"/>
      <rsc_location id="jet2-1_loc_20" rsc="jet2-1" node="jet2" score="20"/>
      <rsc_location id="jet2-1_loc_10" rsc="jet2-1" node="jet1" score="10"/>
      <rsc_location id="lquake-MDT0001_loc_20" rsc="lquake-MDT0001" node="jet2" score="20"/>
      <rsc_location id="lquake-MDT0001_loc_10" rsc="lquake-MDT0001" node="jet1" score="10"/>
      <rsc_order id="lquake-MDT0001_order" first="jet2-1" then="lquake-MDT0001" kind="Mandatory"/>
      <rsc_colocation id="lquake-MDT0001_colocation" rsc="lquake-MDT0001" with-rsc="jet2-1" score="INFINITY"/>

I wrote a script to generate the pacemaker cib.xml for an entire lustre server cluster starting from an ldev.conf file. It is also on github:

https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/ldev2cib

ldev2cib is also a work-in-progress. Currently stonith is entirely disabled. We'll work on adding stonith once the local powerman stonith agent is ready to be integrated into the whole.

Yes, the lustre script is zfs-only at this time. It could be expanded to work with ldiskfs too if that makes sense. I'm also open to keeping ldiskfs and zfs lustre RA scripts separate.

Christopher Morrone (Inactive) added a comment - 09/Sep/16 11:10 PM I promised to share my version of pacemaker OCF resource agent scripts by the end of this week so here they are: https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/zpool https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/lustre Keep in mind that these are still work-in-progress. They are certainly rough around the edges. The general idea is that zpool manages zpool inport/export, and lustre manages mount/umount of individual zfs datasets. In pacemaker I have ordering and colocation contraints to express the relationship between the two resources. Here is an example set of constraints that are working on our testbed. This is for two MDS nodes with shared storage. The zpool of the first MDS is used by two lustre targets: the MGS and the first MDS. This is a real-world example of needed to support two lustre filesystems in the same pool (all of our production filesystems are like that, even though DNE exists only on testbeds for now). <rsc_location id="jet1-1_loc_20" rsc="jet1-1" node="jet1" score="20"/> <rsc_location id="jet1-1_loc_10" rsc="jet1-1" node="jet2" score="10"/> <rsc_location id="MGS_loc_20" rsc="MGS" node="jet1" score="20"/> <rsc_location id="MGS_loc_10" rsc="MGS" node="jet2" score="10"/> <rsc_order id="MGS_order" first="jet1-1" then="MGS" kind="Mandatory"/> <rsc_colocation id="MGS_colocation" rsc="MGS" with-rsc="jet1-1" score="INFINITY"/> <rsc_location id="lquake-MDT0000_loc_20" rsc="lquake-MDT0000" node="jet1" score="20"/> <rsc_location id="lquake-MDT0000_loc_10" rsc="lquake-MDT0000" node="jet2" score="10"/> <rsc_order id="lquake-MDT0000_order" first="jet1-1" then="lquake-MDT0000" kind="Mandatory"/> <rsc_colocation id="lquake-MDT0000_colocation" rsc="lquake-MDT0000" with-rsc="jet1-1" score="INFINITY"/> <rsc_location id="jet2-1_loc_20" rsc="jet2-1" node="jet2" score="20"/> <rsc_location id="jet2-1_loc_10" rsc="jet2-1" node="jet1" score="10"/> <rsc_location id="lquake-MDT0001_loc_20" rsc="lquake-MDT0001" node="jet2" score="20"/> <rsc_location id="lquake-MDT0001_loc_10" rsc="lquake-MDT0001" node="jet1" score="10"/> <rsc_order id="lquake-MDT0001_order" first="jet2-1" then="lquake-MDT0001" kind="Mandatory"/> <rsc_colocation id="lquake-MDT0001_colocation" rsc="lquake-MDT0001" with-rsc="jet2-1" score="INFINITY"/> I wrote a script to generate the pacemaker cib.xml for an entire lustre server cluster starting from an ldev.conf file. It is also on github: https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/ldev2cib ldev2cib is also a work-in-progress. Currently stonith is entirely disabled. We'll work on adding stonith once the local powerman stonith agent is ready to be integrated into the whole. Yes, the lustre script is zfs-only at this time. It could be expanded to work with ldiskfs too if that makes sense. I'm also open to keeping ldiskfs and zfs lustre RA scripts separate.

Christopher Morrone (Inactive) added a comment - 09/Sep/16 11:01 PM

Oh, I see! The reboot isn't necessarily as bad as I was thinking because the node simply booting again does not, in theory, introduce usage of a lustre device. You are relying on pacemaker to not start the device on that node after reboot. Which is odd, because you trust pacemaker there, but not enough to necessarily start it only in when place else where (which is why you are attempting direct stonith control).

I guess it isn't quite as scary as I originally though. But the script needs to have options passed in to list the possible nodes that the resource can live on, rather than assuming there are only two nodes in the entire cluster and that the other node must be the only node that needs power cycling. As long as doing that, it definitely shouldn't be in any place but contrib.

Christopher Morrone (Inactive) added a comment - 09/Sep/16 11:01 PM Oh, I see! The reboot isn't necessarily as bad as I was thinking because the node simply booting again does not, in theory, introduce usage of a lustre device. You are relying on pacemaker to not start the device on that node after reboot. Which is odd, because you trust pacemaker there, but not enough to necessarily start it only in when place else where (which is why you are attempting direct stonith control). I guess it isn't quite as scary as I originally though. But the script needs to have options passed in to list the possible nodes that the resource can live on, rather than assuming there are only two nodes in the entire cluster and that the other node must be the only node that needs power cycling. As long as doing that, it definitely shouldn't be in any place but contrib.

Christopher Morrone (Inactive) added a comment - 08/Sep/16 6:30 PM

Could you elaborate on what went wrong in pacemaker?

You are leaving out that your script explicitly introduces a new failure mode with its racy, reboot appoach to atomic pool access. Does your new failure mode, and all the added complexity really balance out the rare failure mode in pacemaker?

I think I would rather rely on Pacemaker's reliability than a racy reboot mechanism.

Christopher Morrone (Inactive) added a comment - 08/Sep/16 6:30 PM Could you elaborate on what went wrong in pacemaker? You are leaving out that your script explicitly introduces a new failure mode with its racy, reboot appoach to atomic pool access. Does your new failure mode, and all the added complexity really balance out the rare failure mode in pacemaker? I think I would rather rely on Pacemaker's reliability than a racy reboot mechanism.

People

Assignee:: Gabriele Paciucci (Inactive)

Reporter:: Gabriele Paciucci (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 01/Aug/16 8:37 AM

Updated:: 22/Oct/17 11:41 PM

Resolved:: 19/Sep/17 1:50 PM