[LU-8455] Pacemaker script for Lustre and ZFS Created: 01/Aug/16 Updated: 22/Oct/17 Resolved: 19/Sep/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | Gabriele Paciucci (Inactive) | Assignee: | Gabriele Paciucci (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
A new script to be used in Pacemaker to manage ZFS pools and Lustre targets. This RA is able to manage (import/export) ZFS pools and Lustre Target (mount/umount). pcs resource create <Resource Name> ocf:heartbeat:LustreZFS \ pool="<ZFS Pool Name>" \ volume="<ZFS Volume Name>" \ mountpoint="<Mount Point" \ OCF_CHECK_LEVEL=10 where:
This script should be located in /usr/lib/ocf/resource.d/heartbeat/ of both the Lustre servers with permission 755. The script provides protection from double imports of the pools. In order to activate this functionality is important to configure the hostid protection in ZFS using the genhostid command. Default values:
Default timeout:
Compatible and tested:
|
| Comments |
| Comment by Gerrit Updater [ 08/Aug/16 ] | ||||||||||||||||
|
Gabriele Paciucci (gabriele.paciucci@intel.com) uploaded a new patch: http://review.whamcloud.com/21812 | ||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 01/Sep/16 ] | ||||||||||||||||
|
Version 0.99.2
| ||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 01/Sep/16 ] | ||||||||||||||||
|
TBD:
| ||||||||||||||||
| Comment by Christopher Morrone [ 01/Sep/16 ] | ||||||||||||||||
|
Other suggestions:
FYI, I hope to get our local resource agents in better shape by the end of next week. I'll share what I have when the scripts are more presentable. My thinking is that the zpool resource agent might be packaged along with ZFS, and the lustre resource agent might be packaged with Lustre. See this for resource agent packaging guidance: http://www.linux-ha.org/doc/dev-guides/_installing_and_packaging_resource_agents.html For our local lustre resource agent, I am taking the approach of having the user pass in the Lustre service name so that the script can monitor /proc/fs/lustre/ to see if the service is running. | ||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 02/Sep/16 ] | ||||||||||||||||
|
Hi morrone, 1. the stonith hack was requested by adilger due the fact we don't have MMP in place for ZFS and again this is perfectly managed by pacemaker (tested in production). 2. The script should be used to manage more Lustre services. This is the output crm_mon in production: Cluster name: kapollo_oss
Last updated: Thu Jun 2 03:55:58 2016 Last change: Wed May 25 04:42:25 2016 by root via cibadmin on kapollo01
Stack: corosync
Current DC: kapollo02 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum
2 nodes and 16 resources configured
Online: [ kapollo01 kapollo02 ]
Full list of resources:
kapollo01-ipmi (stonith:fence_ipmilan): Started kapollo02
kapollo02-ipmi (stonith:fence_ipmilan): Started kapollo01
ost00 (ocf::heartbeat:LustreZFS): Started kapollo02
ost01 (ocf::heartbeat:LustreZFS): Started kapollo01
ost02 (ocf::heartbeat:LustreZFS): Started kapollo02
ost03 (ocf::heartbeat:LustreZFS): Started kapollo01
ost04 (ocf::heartbeat:LustreZFS): Started kapollo02
ost05 (ocf::heartbeat:LustreZFS): Started kapollo01
ost06 (ocf::heartbeat:LustreZFS): Started kapollo02
ost07 (ocf::heartbeat:LustreZFS): Started kapollo01
ost08 (ocf::heartbeat:LustreZFS): Started kapollo02
ost09 (ocf::heartbeat:LustreZFS): Started kapollo01
Clone Set: healthLNET-clone [healthLNET]
Started: [ kapollo01 kapollo02 ]
Clone Set: healthLUSTRE-clone [healthLUSTRE]
Started: [ kapollo01 kapollo02 ]
PCSD Status:
kapollo01: Online
kapollo02: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
3. Okay I started with this approach in my (not released but presented in LAD14) very first version of this effort: 1 RA to managing zpool, 1 RA to mount/umount Lustre. After several tests I decided to have a single script because the colocation constraints in pacemaker gave me funny results and are not solid enough. I can reconsider this in the future. 4. I'm on this 5. I'm interested in your approach and also I'm interested in see if your experiments with Pacemaker's constraints give you better results. 6. The general lustre status (/proc/fs/lustre/health_status) is monitored by this agent in | ||||||||||||||||
| Comment by Christopher Morrone [ 06/Sep/16 ] | ||||||||||||||||
STONITH is an excellent idea. But Pacemaker handles that at higher levels, the resource agent script should not have anything to do with it. As far as I can tell, your script doesn't do real STONITH (at least it didn't before, and it is only optional now). It has a very racy reboot in place instead. Dangerous and scary if you ask me. We wouldn't run that in production. Maybe you can explain the theory of operation a bit more to me to assuage my concern.
But your resource agent makes node assumptions and tries to trigger power control and such that will interfere with the higher level Pacemaker's own attempts to move services around. It seems especially racy and dangerous to have multiple services on a node.
I'm not talking about the global "health_status", I'm talking about looking for proc entries for the service being managed by the instance of the resource agent. At the moment I am looking for the ZFS dataset in /proc/fs/lustre/osd-zfs/*/mntdev. If I find it, I know that Lustre still has some stake in the dataset. I can't say that it is entirely sufficient, but I know for a fact that lustre services are not always shut down when the devices disappear from /proc/mounts. Although the more that I think about it, maybe .../mntdev isn't what I want either. I don't think we are terribly concerned in the lustre RA about whether the disk/dataset is in use. Before a dataset can be moved to another node, the zpool RA will have to be moved to that other node. zpool export can't succeed if the device is still in use. So I can most likely leave actual zpool usage state to the zpool RA. So back to the lustre RA. I want the lustre RA to be able to detect the situation where the umount succeeds, but the lustre service has not stopped. My working assumption is that the /proc/fs/lustre/osd-zfs/*/mntdev will also stick around until the service has actually stopped. I assume that the service keeps a reference on that device internally until it completes shut down. But I could be wrong. The advantage of looking at .../mndev is simpler configuration. It is not necessary to tell the lustre RA the name of the resource it is managing (e.g. MGS, lquake-MDT0000, lquake-OST000a). But if .../mntdev is not as reliable as I am hoping, then we would have to add a third required configuration parameter. I have dataset and mountpoint currently, and I would have to add servicename if .../mntdev will not suit our needs. By the way, I am considering making mountpoint optional, because there is already a zfs "mountpoint" property. The RA could look the mountpoint up there if the admins didn't specify it the pacemaker configuration. | ||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 07/Sep/16 ] | ||||||||||||||||
|
Thank you for your comments, let's start with the first part. Pacemaker is managing STONITH at the same level of stonith_admin according this documentation: https://github.com/ClusterLabs/pacemaker/blob/e022430d4df217b6764ea3f79ddf63432f98fd66/fencing/README.md In the broadest terms, stonith works like this: 1. The initiator (an external program such as stonith_admin, or the cluster itself via the crmd) asks the local stonithd, "Hey, can you fence this node?" 2. The local stonithd asks all the stonithd's in the cluster (including itself), "Hey, what fencing devices do you have access to that can fence this node?" .... as I already mentioned stonith_admin is integrated in the stonith workflow. The man page for stonith_admin is misleading the reboot option (-B) means actually power cycle for the fence agent and the fence (-F) option means power off, in fact in the stonith_admin source code https://github.com/ClusterLabs/pacemaker/blob/e022430d4df217b6764ea3f79ddf63432f98fd66/fencing/admin.c: case 'B':
rc = mainloop_fencing(st, target, "reboot", timeout, tolerance);
break;
case 'F':
rc = mainloop_fencing(st, target, "off", timeout, tolerance);
break;
and in the fence_ipmilan python source: def reboot_cycle(_, options):
output = run_command(options, create_command(options, "cycle"))
return bool(re.search('chassis power control: cycle', str(output).lower()))
The theory that I have implemented is really simple: We should notice if the stop command fail on the other node or the node crash, Pacemaker is scheduling a stonith by default. The second stonith command scheduled by my script is not executed because the stonithd daemon is smart enough to not execute parallel stonith commands for the same host. I completely agree with you that a script shouldn't execute a stonith command but until we have MMP, this is the only way I can image to be sure to not corrupt the pool's metadata. | ||||||||||||||||
| Comment by Christopher Morrone [ 07/Sep/16 ] | ||||||||||||||||
I don't understand what is misleading about that. The man page says that -B is reboot, the code says that case 'B' is "reboot". The man page says -F is fence, the code says case 'F' is "off". Those statements appear to me to be in full agreement. So I still don't understand how a racy power cycle (i.e. reboot) would be your preferred MMP stand-in.
But that isn't really so simple in practice. Just because one device is unclean, you are killing all devices on the other node and starting failover procedures. Further, there is really no reason to do that unless you don't trust your own RA on that other node. Pacemaker has already run your RA on the other node and determined that the zpool is not running there. If Pacemaker could not run the RA on that other node, then it would have fenced the node on its own. So why is your script second guessing your own script on the other node, and pacemaker itself? If you can't trust Pacemaker, then...well you can't trust it and things are going to go wrong in many other ways. Your script is also introducing a lot of unnecessary service interruptions. If the other node is flat out powered off, then all of the services on that node now have to move, unnecessarily, and other nodes are now doing double-duty in hosting Lustre services. Now a sysadmin needs to get involved to fix the situation. Again, unnecessarily. If instead your script does a reboot, the entire process is racy and dangerous. Your script has no information about what form the reboot/power cycle takes (hard power cycle? Linux clean "reboot" command?) and no information about the timing. Your script doesn't know how long to wait until the other node is no longer using the shared disk resource in question, and no idea how fast it needs to run to capture the disk resource before the other node grabs it again during its boot up process. I would argue that no fixed "sleep" numbers will ever be a good idea in that situation. I would suspect that there is a high degree of risk that your script will corrupt a filesystem in production if it is used long enough.
STONITH is totally reasonable, with or without MMP. I disagree that your solution is the only way to not corrupt the pool. I suspect that your approach (at least the one that employs reboots) is more likely to corrupt a pool at some point than the approach of writing a zpool RA with good monitor function, and letting Pacemaker ensure that the resource is never started on two nodes (zpool never imported on two nodes) at the same time. That is Pacemaker's job. | ||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 07/Sep/16 ] | ||||||||||||||||
|
Hi morrone, to close the reboot discussion. The fence agent converts (second code) the "reboot" command in a power cycle (power off/on): not in an OS reboot. The "fencing" command is a power off only. | ||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 08/Sep/16 ] | ||||||||||||||||
|
The second part of your comment should be placed into the pacemaker contest and logic and maybe we should review together all the possible scenarios. Pacemaker logic: When a single resource fail to stop, pacemaker is fencing the node (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/s1-resourceoperate-HAAR.html). This behavior can be changed but this is the default. Possible scenario:
So my stonith procedure looks like useless (and if you make the same exercise for LDISKFS also the MMP protection in theory could be considered useless), but during my stress tests I saw situation where something went wrong in pacemaker and thanks to the additional protection, the pool wasn't corrupted. Clarified that I can improve the script implementing a variable (OCF_STONITH_ENABLE) to enable/disable this stonith protection into the script (for brave sys admin). | ||||||||||||||||
| Comment by Christopher Morrone [ 08/Sep/16 ] | ||||||||||||||||
|
Could you elaborate on what went wrong in pacemaker? You are leaving out that your script explicitly introduces a new failure mode with its racy, reboot appoach to atomic pool access. Does your new failure mode, and all the added complexity really balance out the rare failure mode in pacemaker? I think I would rather rely on Pacemaker's reliability than a racy reboot mechanism. | ||||||||||||||||
| Comment by Christopher Morrone [ 09/Sep/16 ] | ||||||||||||||||
|
Oh, I see! The reboot isn't necessarily as bad as I was thinking because the node simply booting again does not, in theory, introduce usage of a lustre device. You are relying on pacemaker to not start the device on that node after reboot. Which is odd, because you trust pacemaker there, but not enough to necessarily start it only in when place else where (which is why you are attempting direct stonith control). I guess it isn't quite as scary as I originally though. But the script needs to have options passed in to list the possible nodes that the resource can live on, rather than assuming there are only two nodes in the entire cluster and that the other node must be the only node that needs power cycling. As long as doing that, it definitely shouldn't be in any place but contrib. | ||||||||||||||||
| Comment by Christopher Morrone [ 09/Sep/16 ] | ||||||||||||||||
|
I promised to share my version of pacemaker OCF resource agent scripts by the end of this week so here they are: https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/zpool Keep in mind that these are still work-in-progress. They are certainly rough around the edges. The general idea is that zpool manages zpool inport/export, and lustre manages mount/umount of individual zfs datasets. In pacemaker I have ordering and colocation contraints to express the relationship between the two resources. Here is an example set of constraints that are working on our testbed. This is for two MDS nodes with shared storage. The zpool of the first MDS is used by two lustre targets: the MGS and the first MDS. This is a real-world example of needed to support two lustre filesystems in the same pool (all of our production filesystems are like that, even though DNE exists only on testbeds for now). <rsc_location id="jet1-1_loc_20" rsc="jet1-1" node="jet1" score="20"/>
<rsc_location id="jet1-1_loc_10" rsc="jet1-1" node="jet2" score="10"/>
<rsc_location id="MGS_loc_20" rsc="MGS" node="jet1" score="20"/>
<rsc_location id="MGS_loc_10" rsc="MGS" node="jet2" score="10"/>
<rsc_order id="MGS_order" first="jet1-1" then="MGS" kind="Mandatory"/>
<rsc_colocation id="MGS_colocation" rsc="MGS" with-rsc="jet1-1" score="INFINITY"/>
<rsc_location id="lquake-MDT0000_loc_20" rsc="lquake-MDT0000" node="jet1" score="20"/>
<rsc_location id="lquake-MDT0000_loc_10" rsc="lquake-MDT0000" node="jet2" score="10"/>
<rsc_order id="lquake-MDT0000_order" first="jet1-1" then="lquake-MDT0000" kind="Mandatory"/>
<rsc_colocation id="lquake-MDT0000_colocation" rsc="lquake-MDT0000" with-rsc="jet1-1" score="INFINITY"/>
<rsc_location id="jet2-1_loc_20" rsc="jet2-1" node="jet2" score="20"/>
<rsc_location id="jet2-1_loc_10" rsc="jet2-1" node="jet1" score="10"/>
<rsc_location id="lquake-MDT0001_loc_20" rsc="lquake-MDT0001" node="jet2" score="20"/>
<rsc_location id="lquake-MDT0001_loc_10" rsc="lquake-MDT0001" node="jet1" score="10"/>
<rsc_order id="lquake-MDT0001_order" first="jet2-1" then="lquake-MDT0001" kind="Mandatory"/>
<rsc_colocation id="lquake-MDT0001_colocation" rsc="lquake-MDT0001" with-rsc="jet2-1" score="INFINITY"/>
I wrote a script to generate the pacemaker cib.xml for an entire lustre server cluster starting from an ldev.conf file. It is also on github: https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/ldev2cib ldev2cib is also a work-in-progress. Currently stonith is entirely disabled. We'll work on adding stonith once the local powerman stonith agent is ready to be integrated into the whole. Yes, the lustre script is zfs-only at this time. It could be expanded to work with ldiskfs too if that makes sense. I'm also open to keeping ldiskfs and zfs lustre RA scripts separate. | ||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 10/Sep/16 ] | ||||||||||||||||
|
Hi morrone, I also suggest to use the resource stickness option to avoid any failback. | ||||||||||||||||
| Comment by Andreas Dilger [ 11/Feb/17 ] | ||||||||||||||||
|
There is a considerable amount of duplication between this ticket and | ||||||||||||||||
| Comment by Bradley Merchant [ 16/Feb/17 ] | ||||||||||||||||
|
We have been using this script and ran into an issue on manual failover (i.e. pcs resource disable/enable or move/relocate). On resource stop the script force unmounts Lustre targets (umount -f). In live tests this caused the recovery flags to not be set on OSTs, causing client eviction on remount. Removing the -f flag resolves the problem, but we are curious if removing it could present any other issues. I do notice Christopher Morrone's script omits the -f flag. | ||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 16/Feb/17 ] | ||||||||||||||||
|
Hi bmerchant thank you for this feedback. I would like to consult with adilger about this. | ||||||||||||||||
| Comment by Andreas Dilger [ 16/Feb/17 ] | ||||||||||||||||
|
Using "umount -f" will evict all of the clients at unmount time, which means they will get IO errors for any in-flight writes when the server is started again. This is useful if the administrator knows that the clients have already been stopped, for example. I don't think that using the "-f" flag will significantly reduce the amount of time that the unmount will take, but if you have numbers to that effect it would be interesting to see. It may be possible to improve the performance of the non-f unmount if that is a real issue. | ||||||||||||||||
| Comment by Gabriele Paciucci (Inactive) [ 20/Feb/17 ] | ||||||||||||||||
|
Added a new version 0.99.5 that fix the amount issue | ||||||||||||||||
| Comment by Vaughn E. Clinton [ 03/Apr/17 ] | ||||||||||||||||
|
I've been trying to use the script to create the HA volumn/dataset resources with the following syntax: pcs resource create hail-mgt ocf:heartbeat:Lustre-MDS-ZFS pool="ha.mds" volume="mgt" mountpoint="/lustre/hail/mgt" Each attempt returns the following error: Error: Unable to create resource 'ocf:heartbeat:Lustre-MDS-ZFS', it is not installed on this system (use --force to override) I can see that the agent script, Lustre-MDS-ZFS, is dropped into the correct location when I run this syntax with the debug option enabled. I also see the script being ran with a return 0 value. I'm not exactly sure what the problem is. Could it be missing some binary that I'm not seeing in the debug output? Anyway, I would greatly appreciate some guidance with solving this. Here are the details about my configuration: Red Hat Enterprise Linux Server release 7.3 (Maipo) fence-agents-common-4.0.11-47.el7_3.2.x86_64 This is being deployed in a diskless 2 node HA Lustre environment. Please let me know if you require me to open a ticket concerning this issue. | ||||||||||||||||
| Comment by Malcolm Cowe (Inactive) [ 03/Apr/17 ] | ||||||||||||||||
|
Hi Vaughn, Try using the path ocf:pacemaker:Lustre-MDS-ZFS, instead of ocf:heartbeat:Lustre-MDS-ZFS. You can also verify the list of available RAs using the command pcs resource list. For example: [root@ct66-mds2 ~]# pcs resource list ocf:pacemaker | awk 'tolower($0) ~ /lustre|lnet/'
ocf:pacemaker:Lustre-MDS-ZFS - Lustre and ZFS management when the MDT and MGT
ocf:pacemaker:LustreZFS - Lustre and ZFS management
ocf:pacemaker:healthLNET - LNet connectivity
ocf:pacemaker:healthLUSTRE - lustre servers healthy
| ||||||||||||||||
| Comment by Vaughn E. Clinton [ 04/Apr/17 ] | ||||||||||||||||
|
Malcolm, Thanks for the response! I really appreciate the help with this since I'm very new at PCS/Pacemaker/Corosync setups. Anyway, I ran the following command with the syntax as you suggested. Here's the return from the command: pcs resource list ocf:pacemaker | awk 'tolower($0) ~ /lustre|lnet/' I even attempted with heartbeat and here's the return for that attempt: pcs resource list ocf:heartbeat | awk 'tolower($0) ~ /lustre|lnet/' I did attempt to create the resources anyway and it failed as with the previous attempts: pcs resource create hail-mgt ocf:pacemaker:LustreZFS pool="ha-mds" volume="mgt" mountpoint="/lustre/hail/mgmt" I forgot to add the version of the resource-agents RPM that installed in this environment: resource-agents-3.9.5-82.el7_3.3.x86_64 **Again, thanks for the assistance | ||||||||||||||||
| Comment by Malcolm Cowe (Inactive) [ 04/Apr/17 ] | ||||||||||||||||
|
From the output, it looks as though PCS cannot find almost any resources. Probably need to check that the packages are installed correctly. For reference, the packages on my server are: [root@ct66-mds2 ~]# rpm -qa resource-agents resource-agents-3.9.5-82.el7_3.6.x86_64 [root@ct66-mds2 ~]# rpm -qa Lustre-ZFS-RA Lustre-ZFS-RA-0.99.5-1.noarch The RAs are installed in /usr/lib/ocf/resource.d, in subdirectories for each class. For example, the pacemaker directory on one of my servers looks like this: [root@ct66-mds2 ~]# ls /usr/lib/ocf/resource.d/pacemaker ClusterMon Dummy healthLNET HealthSMART LustreZFS pingd Stateful SystemHealth controld HealthCPU healthLUSTRE Lustre-MDS-ZFS ping remote SysInfo The pcs resource list command scans these directories to assemble the list of available RAs. Running pcs resource list with no further arguments should return a large list of available resource agents. If none of the RAs are showing up, but there are files listed in /usr/lib/ocf/resource.d/{heartbeat,pacemaker}, then it is possible that there is a permissions problem. All the RAs need to have the executable bit set, and on a default install will have mode 755 on all files and directories, owned by root. If they are correct, then perhaps something like SELinux is interfering, although I would hope that that is unlikely. | ||||||||||||||||
| Comment by Vaughn E. Clinton [ 05/Apr/17 ] | ||||||||||||||||
|
Checked to see what the resource option could locate with respect to ZFS and here's what I got: pcs resource list | grep -i zfs ls /usr/lib/ocf/resource.d/pacemaker ls /usr/lib/ocf/resource.d/heartbeat
ls /usr/lib/ocf/resource.d/llnl/ The LLNL agents were installed yesterday by another staff member and we were able to successfully create the resources using the LLNL RA scripts but not the Intel ones: Online: [ mds00 mds01 ] Full list of resources: hammer_io6 (stonith:fence_powerman): Started mds00
Anyway, if you have any other suggestions, I'd welcome them because I'd prefer using a vendor RA but will settle with the LLNL one for the moment. Thanks again for the support with this. Cheers, | ||||||||||||||||
| Comment by Andreas Dilger [ 16/Sep/17 ] | ||||||||||||||||
|
Is there more to be done here, or should this ticket be closed? I believe the ZFS RA scripts were landed upstream? | ||||||||||||||||
| Comment by Nathaniel Clark [ 18/Sep/17 ] | ||||||||||||||||
|
I think this can be closed. ZFS RA was merged upstream, and the Lustre resource agents are available. | ||||||||||||||||
| Comment by Malcolm Haak - NCI (Inactive) [ 18/Oct/17 ] | ||||||||||||||||
|
My apologies, I see LUSTREhealth has been merged in Or do we use the RPM attached here? Doesn't worry me what the answer is, it just seems to be a bit difficult to determine from the current state of the ticket/git repo | ||||||||||||||||
| Comment by Malcolm Cowe (Inactive) [ 19/Oct/17 ] | ||||||||||||||||
|
mhaakddn: Take a look here: http://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services Nathaniel Clark has upstreamed the ZFS RA into the resource agents project on GitHub but it will take some time to filter into OS distros. The above-referenced page shows how to download and incorporate into a pacemaker cluster. | ||||||||||||||||
| Comment by Malcolm Haak - NCI (Inactive) [ 22/Oct/17 ] | ||||||||||||||||
|
Thanks for that! |