[LU-8455] Pacemaker script for Lustre and ZFS - Whamcloud Community JIRA

Details

Type: New Feature
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.10.0
Affects Version/s: None
Labels:
None

Rank (Obsolete):
9223372036854775807

Description

A new script to be used in Pacemaker to manage ZFS pools and Lustre targets.

This RA is able to manage (import/export) ZFS pools and Lustre Target (mount/umount).

pcs resource create <Resource Name> ocf:heartbeat:LustreZFS \ 
pool="<ZFS Pool Name>" \
volume="<ZFS Volume Name>" \
mountpoint="<Mount Point" \
OCF_CHECK_LEVEL=10

where:

pool is the pool name of the ZFS resource created in advance
volume is the volume name created on the ZFS pool during the Lustre format (mkfs.lustre).
mount point is the mount point created in advance on both the Lustre servers
OCF_CHECK_LEVEL is optional and enable an extra monitor on the status of the pool

This script should be located in /usr/lib/ocf/resource.d/heartbeat/ of both the Lustre servers with permission 755.

The script provides protection from double imports of the pools. In order to activate this functionality is important to configure the hostid protection in ZFS using the genhostid command.

Default values:

no defaults

Default timeout:

start timeout 300s
stop timeout 300s
monitor timeout 300s interval 20s

Compatible and tested:

pacemaker 1.1.13
corosync 2.3.4
pcs 0.9.143
RHEL/CentOS 7.2

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Lustre-ZFS-RA-0.99.5-1.noarch.rpm
10 kB
20/Feb/17 9:08 PM

Issue Links

duplicates

LU-8458 Pacemaker script to monitor Lustre servers status

Resolved

Activity

[LU-8455] Pacemaker script for Lustre and ZFS

Bradley Merchant (Inactive) added a comment - 16/Feb/17 5:51 PM

We have been using this script and ran into an issue on manual failover (i.e. pcs resource disable/enable or move/relocate). On resource stop the script force unmounts Lustre targets (umount -f). In live tests this caused the recovery flags to not be set on OSTs, causing client eviction on remount. Removing the -f flag resolves the problem, but we are curious if removing it could present any other issues.

I do notice Christopher Morrone's script omits the -f flag.

Bradley Merchant (Inactive) added a comment - 16/Feb/17 5:51 PM We have been using this script and ran into an issue on manual failover (i.e. pcs resource disable/enable or move/relocate). On resource stop the script force unmounts Lustre targets (umount -f). In live tests this caused the recovery flags to not be set on OSTs, causing client eviction on remount. Removing the -f flag resolves the problem, but we are curious if removing it could present any other issues. I do notice Christopher Morrone's script omits the -f flag.

Andreas Dilger added a comment - 11/Feb/17 9:16 AM

There is a considerable amount of duplication between this ticket and ~~LU-8458~~. One of them should probably be closed.

Andreas Dilger added a comment - 11/Feb/17 9:16 AM There is a considerable amount of duplication between this ticket and LU-8458 . One of them should probably be closed.

Gabriele Paciucci (Inactive) added a comment - 10/Sep/16 3:12 PM

Hi morrone,
as I already mentioned in our discussion, I found problems (and weakness) in pacemaker to manage two resources colocated. One of the problem was to move (from your example) MGS from jet1 to jet2. The only way to do this is moving jet1-1, but this means that if the MGS resource failed, the other resource is not moved... I tried also using the resource group concept but without success.
I tested this in RHEL 6 maybe in RHEL 7 they fixed, this issue. Could you please test this?

I also suggest to use the resource stickness option to avoid any failback.

Gabriele Paciucci (Inactive) added a comment - 10/Sep/16 3:12 PM Hi morrone , as I already mentioned in our discussion, I found problems (and weakness) in pacemaker to manage two resources colocated. One of the problem was to move (from your example) MGS from jet1 to jet2 . The only way to do this is moving jet1-1 , but this means that if the MGS resource failed, the other resource is not moved... I tried also using the resource group concept but without success. I tested this in RHEL 6 maybe in RHEL 7 they fixed, this issue. Could you please test this? I also suggest to use the resource stickness option to avoid any failback.

Christopher Morrone (Inactive) added a comment - 09/Sep/16 11:10 PM

I promised to share my version of pacemaker OCF resource agent scripts by the end of this week so here they are:

https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/zpool
https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/lustre

Keep in mind that these are still work-in-progress. They are certainly rough around the edges. The general idea is that zpool manages zpool inport/export, and lustre manages mount/umount of individual zfs datasets.

In pacemaker I have ordering and colocation contraints to express the relationship between the two resources. Here is an example set of constraints that are working on our testbed. This is for two MDS nodes with shared storage. The zpool of the first MDS is used by two lustre targets: the MGS and the first MDS. This is a real-world example of needed to support two lustre filesystems in the same pool (all of our production filesystems are like that, even though DNE exists only on testbeds for now).

      <rsc_location id="jet1-1_loc_20" rsc="jet1-1" node="jet1" score="20"/>
      <rsc_location id="jet1-1_loc_10" rsc="jet1-1" node="jet2" score="10"/>
      <rsc_location id="MGS_loc_20" rsc="MGS" node="jet1" score="20"/>
      <rsc_location id="MGS_loc_10" rsc="MGS" node="jet2" score="10"/>
      <rsc_order id="MGS_order" first="jet1-1" then="MGS" kind="Mandatory"/>
      <rsc_colocation id="MGS_colocation" rsc="MGS" with-rsc="jet1-1" score="INFINITY"/>
      <rsc_location id="lquake-MDT0000_loc_20" rsc="lquake-MDT0000" node="jet1" score="20"/>
      <rsc_location id="lquake-MDT0000_loc_10" rsc="lquake-MDT0000" node="jet2" score="10"/>
      <rsc_order id="lquake-MDT0000_order" first="jet1-1" then="lquake-MDT0000" kind="Mandatory"/>
      <rsc_colocation id="lquake-MDT0000_colocation" rsc="lquake-MDT0000" with-rsc="jet1-1" score="INFINITY"/>
      <rsc_location id="jet2-1_loc_20" rsc="jet2-1" node="jet2" score="20"/>
      <rsc_location id="jet2-1_loc_10" rsc="jet2-1" node="jet1" score="10"/>
      <rsc_location id="lquake-MDT0001_loc_20" rsc="lquake-MDT0001" node="jet2" score="20"/>
      <rsc_location id="lquake-MDT0001_loc_10" rsc="lquake-MDT0001" node="jet1" score="10"/>
      <rsc_order id="lquake-MDT0001_order" first="jet2-1" then="lquake-MDT0001" kind="Mandatory"/>
      <rsc_colocation id="lquake-MDT0001_colocation" rsc="lquake-MDT0001" with-rsc="jet2-1" score="INFINITY"/>

I wrote a script to generate the pacemaker cib.xml for an entire lustre server cluster starting from an ldev.conf file. It is also on github:

https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/ldev2cib

ldev2cib is also a work-in-progress. Currently stonith is entirely disabled. We'll work on adding stonith once the local powerman stonith agent is ready to be integrated into the whole.

Yes, the lustre script is zfs-only at this time. It could be expanded to work with ldiskfs too if that makes sense. I'm also open to keeping ldiskfs and zfs lustre RA scripts separate.

Christopher Morrone (Inactive) added a comment - 09/Sep/16 11:10 PM I promised to share my version of pacemaker OCF resource agent scripts by the end of this week so here they are: https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/zpool https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/lustre Keep in mind that these are still work-in-progress. They are certainly rough around the edges. The general idea is that zpool manages zpool inport/export, and lustre manages mount/umount of individual zfs datasets. In pacemaker I have ordering and colocation contraints to express the relationship between the two resources. Here is an example set of constraints that are working on our testbed. This is for two MDS nodes with shared storage. The zpool of the first MDS is used by two lustre targets: the MGS and the first MDS. This is a real-world example of needed to support two lustre filesystems in the same pool (all of our production filesystems are like that, even though DNE exists only on testbeds for now). <rsc_location id="jet1-1_loc_20" rsc="jet1-1" node="jet1" score="20"/> <rsc_location id="jet1-1_loc_10" rsc="jet1-1" node="jet2" score="10"/> <rsc_location id="MGS_loc_20" rsc="MGS" node="jet1" score="20"/> <rsc_location id="MGS_loc_10" rsc="MGS" node="jet2" score="10"/> <rsc_order id="MGS_order" first="jet1-1" then="MGS" kind="Mandatory"/> <rsc_colocation id="MGS_colocation" rsc="MGS" with-rsc="jet1-1" score="INFINITY"/> <rsc_location id="lquake-MDT0000_loc_20" rsc="lquake-MDT0000" node="jet1" score="20"/> <rsc_location id="lquake-MDT0000_loc_10" rsc="lquake-MDT0000" node="jet2" score="10"/> <rsc_order id="lquake-MDT0000_order" first="jet1-1" then="lquake-MDT0000" kind="Mandatory"/> <rsc_colocation id="lquake-MDT0000_colocation" rsc="lquake-MDT0000" with-rsc="jet1-1" score="INFINITY"/> <rsc_location id="jet2-1_loc_20" rsc="jet2-1" node="jet2" score="20"/> <rsc_location id="jet2-1_loc_10" rsc="jet2-1" node="jet1" score="10"/> <rsc_location id="lquake-MDT0001_loc_20" rsc="lquake-MDT0001" node="jet2" score="20"/> <rsc_location id="lquake-MDT0001_loc_10" rsc="lquake-MDT0001" node="jet1" score="10"/> <rsc_order id="lquake-MDT0001_order" first="jet2-1" then="lquake-MDT0001" kind="Mandatory"/> <rsc_colocation id="lquake-MDT0001_colocation" rsc="lquake-MDT0001" with-rsc="jet2-1" score="INFINITY"/> I wrote a script to generate the pacemaker cib.xml for an entire lustre server cluster starting from an ldev.conf file. It is also on github: https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/ldev2cib ldev2cib is also a work-in-progress. Currently stonith is entirely disabled. We'll work on adding stonith once the local powerman stonith agent is ready to be integrated into the whole. Yes, the lustre script is zfs-only at this time. It could be expanded to work with ldiskfs too if that makes sense. I'm also open to keeping ldiskfs and zfs lustre RA scripts separate.

Christopher Morrone (Inactive) added a comment - 09/Sep/16 11:01 PM

Oh, I see! The reboot isn't necessarily as bad as I was thinking because the node simply booting again does not, in theory, introduce usage of a lustre device. You are relying on pacemaker to not start the device on that node after reboot. Which is odd, because you trust pacemaker there, but not enough to necessarily start it only in when place else where (which is why you are attempting direct stonith control).

I guess it isn't quite as scary as I originally though. But the script needs to have options passed in to list the possible nodes that the resource can live on, rather than assuming there are only two nodes in the entire cluster and that the other node must be the only node that needs power cycling. As long as doing that, it definitely shouldn't be in any place but contrib.

Christopher Morrone (Inactive) added a comment - 09/Sep/16 11:01 PM Oh, I see! The reboot isn't necessarily as bad as I was thinking because the node simply booting again does not, in theory, introduce usage of a lustre device. You are relying on pacemaker to not start the device on that node after reboot. Which is odd, because you trust pacemaker there, but not enough to necessarily start it only in when place else where (which is why you are attempting direct stonith control). I guess it isn't quite as scary as I originally though. But the script needs to have options passed in to list the possible nodes that the resource can live on, rather than assuming there are only two nodes in the entire cluster and that the other node must be the only node that needs power cycling. As long as doing that, it definitely shouldn't be in any place but contrib.

Christopher Morrone (Inactive) added a comment - 08/Sep/16 6:30 PM

Could you elaborate on what went wrong in pacemaker?

You are leaving out that your script explicitly introduces a new failure mode with its racy, reboot appoach to atomic pool access. Does your new failure mode, and all the added complexity really balance out the rare failure mode in pacemaker?

I think I would rather rely on Pacemaker's reliability than a racy reboot mechanism.

Christopher Morrone (Inactive) added a comment - 08/Sep/16 6:30 PM Could you elaborate on what went wrong in pacemaker? You are leaving out that your script explicitly introduces a new failure mode with its racy, reboot appoach to atomic pool access. Does your new failure mode, and all the added complexity really balance out the rare failure mode in pacemaker? I think I would rather rely on Pacemaker's reliability than a racy reboot mechanism.

Gabriele Paciucci (Inactive) added a comment - 08/Sep/16 12:25 AM - edited

The second part of your comment should be placed into the pacemaker contest and logic and maybe we should review together all the possible scenarios.

Pacemaker logic: When a single resource fail to stop, pacemaker is fencing the node (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/s1-resourceoperate-HAAR.html). This behavior can be changed but this is the default.

Possible scenario:

Event	Pacemaker	ZFS RA	Notes
HW server failure	Scheduled fencing	Scheduled fencing	my script's fencing is not executed, all the resources moved
Single Pool Failure or Lustre failure	Stop clean the resource	Clean import	single resource moved, all the other resource moved
Single Pool Failure or Lustre failure	Can't stop the resource, fence is scheduled	Schedule fencing	my script's fencing is not executed, all the resources moved

So my stonith procedure looks like useless (and if you make the same exercise for LDISKFS also the MMP protection in theory could be considered useless), but during my stress tests I saw situation where something went wrong in pacemaker and thanks to the additional protection, the pool wasn't corrupted.

Clarified that I can improve the script implementing a variable (OCF_STONITH_ENABLE) to enable/disable this stonith protection into the script (for brave sys admin).

Gabriele Paciucci (Inactive) added a comment - 08/Sep/16 12:25 AM - edited The second part of your comment should be placed into the pacemaker contest and logic and maybe we should review together all the possible scenarios. Pacemaker logic: When a single resource fail to stop, pacemaker is fencing the node ( https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/s1-resourceoperate-HAAR.html ). This behavior can be changed but this is the default. Possible scenario: Event Pacemaker ZFS RA Notes HW server failure Scheduled fencing Scheduled fencing my script's fencing is not executed, all the resources moved Single Pool Failure or Lustre failure Stop clean the resource Clean import single resource moved, all the other resource moved Single Pool Failure or Lustre failure Can't stop the resource, fence is scheduled Schedule fencing my script's fencing is not executed, all the resources moved So my stonith procedure looks like useless (and if you make the same exercise for LDISKFS also the MMP protection in theory could be considered useless), but during my stress tests I saw situation where something went wrong in pacemaker and thanks to the additional protection, the pool wasn't corrupted. Clarified that I can improve the script implementing a variable (OCF_STONITH_ENABLE) to enable/disable this stonith protection into the script (for brave sys admin).

Gabriele Paciucci (Inactive) added a comment - 07/Sep/16 11:54 PM - edited

Hi morrone, to close the reboot discussion. The fence agent converts (second code) the "reboot" command in a power cycle (power off/on): not in an OS reboot. The "fencing" command is a power off only.

Gabriele Paciucci (Inactive) added a comment - 07/Sep/16 11:54 PM - edited Hi morrone , to close the reboot discussion. The fence agent converts (second code) the "reboot" command in a power cycle (power off/on): not in an OS reboot. The "fencing" command is a power off only.

Christopher Morrone (Inactive) added a comment - 07/Sep/16 6:32 PM

The man page for stonith_admin is misleading

I don't understand what is misleading about that. The man page says that -B is reboot, the code says that case 'B' is "reboot". The man page says -F is fence, the code says case 'F' is "off". Those statements appear to me to be in full agreement.

So I still don't understand how a racy power cycle (i.e. reboot) would be your preferred MMP stand-in.

The theory that I have implemented is really simple:
1. Try to import clean if this fail I try again and then if it fails again
2. I execute the stonith_admin command if this fail I try again then if fails again
3. I give up

But that isn't really so simple in practice. Just because one device is unclean, you are killing all devices on the other node and starting failover procedures.

Further, there is really no reason to do that unless you don't trust your own RA on that other node. Pacemaker has already run your RA on the other node and determined that the zpool is not running there. If Pacemaker could not run the RA on that other node, then it would have fenced the node on its own.

So why is your script second guessing your own script on the other node, and pacemaker itself? If you can't trust Pacemaker, then...well you can't trust it and things are going to go wrong in many other ways.

Your script is also introducing a lot of unnecessary service interruptions. If the other node is flat out powered off, then all of the services on that node now have to move, unnecessarily, and other nodes are now doing double-duty in hosting Lustre services. Now a sysadmin needs to get involved to fix the situation. Again, unnecessarily.

If instead your script does a reboot, the entire process is racy and dangerous. Your script has no information about what form the reboot/power cycle takes (hard power cycle? Linux clean "reboot" command?) and no information about the timing. Your script doesn't know how long to wait until the other node is no longer using the shared disk resource in question, and no idea how fast it needs to run to capture the disk resource before the other node grabs it again during its boot up process. I would argue that no fixed "sleep" numbers will ever be a good idea in that situation. I would suspect that there is a high degree of risk that your script will corrupt a filesystem in production if it is used long enough.

completely agree with you that a script shouldn't execute a stonith command but until we have MMP, this is the only way I can image to be sure to not corrupt the pool's metadata.

STONITH is totally reasonable, with or without MMP. I disagree that your solution is the only way to not corrupt the pool. I suspect that your approach (at least the one that employs reboots) is more likely to corrupt a pool at some point than the approach of writing a zpool RA with good monitor function, and letting Pacemaker ensure that the resource is never started on two nodes (zpool never imported on two nodes) at the same time. That is Pacemaker's job.

Christopher Morrone (Inactive) added a comment - 07/Sep/16 6:32 PM The man page for stonith_admin is misleading I don't understand what is misleading about that. The man page says that -B is reboot, the code says that case 'B' is "reboot". The man page says -F is fence, the code says case 'F' is "off". Those statements appear to me to be in full agreement. So I still don't understand how a racy power cycle (i.e. reboot) would be your preferred MMP stand-in. The theory that I have implemented is really simple: 1. Try to import clean if this fail I try again and then if it fails again 2. I execute the stonith_admin command if this fail I try again then if fails again 3. I give up But that isn't really so simple in practice. Just because one device is unclean, you are killing all devices on the other node and starting failover procedures. Further, there is really no reason to do that unless you don't trust your own RA on that other node. Pacemaker has already run your RA on the other node and determined that the zpool is not running there. If Pacemaker could not run the RA on that other node, then it would have fenced the node on its own. So why is your script second guessing your own script on the other node, and pacemaker itself? If you can't trust Pacemaker, then...well you can't trust it and things are going to go wrong in many other ways. Your script is also introducing a lot of unnecessary service interruptions. If the other node is flat out powered off, then all of the services on that node now have to move, unnecessarily, and other nodes are now doing double-duty in hosting Lustre services. Now a sysadmin needs to get involved to fix the situation. Again, unnecessarily. If instead your script does a reboot, the entire process is racy and dangerous. Your script has no information about what form the reboot/power cycle takes (hard power cycle? Linux clean "reboot" command?) and no information about the timing. Your script doesn't know how long to wait until the other node is no longer using the shared disk resource in question, and no idea how fast it needs to run to capture the disk resource before the other node grabs it again during its boot up process. I would argue that no fixed "sleep" numbers will ever be a good idea in that situation. I would suspect that there is a high degree of risk that your script will corrupt a filesystem in production if it is used long enough. completely agree with you that a script shouldn't execute a stonith command but until we have MMP, this is the only way I can image to be sure to not corrupt the pool's metadata. STONITH is totally reasonable, with or without MMP. I disagree that your solution is the only way to not corrupt the pool. I suspect that your approach (at least the one that employs reboots) is more likely to corrupt a pool at some point than the approach of writing a zpool RA with good monitor function, and letting Pacemaker ensure that the resource is never started on two nodes (zpool never imported on two nodes) at the same time. That is Pacemaker's job.

Gabriele Paciucci (Inactive) added a comment - 07/Sep/16 1:03 PM - edited

Thank you for your comments, let's start with the first part.

Pacemaker is managing STONITH at the same level of stonith_admin according this documentation: https://github.com/ClusterLabs/pacemaker/blob/e022430d4df217b6764ea3f79ddf63432f98fd66/fencing/README.md

In the broadest terms, stonith works like this:

1. The initiator (an external program such as stonith_admin, or the cluster itself via the crmd) asks the local stonithd, "Hey, can you fence this node?"
2. The local stonithd asks all the stonithd's in the cluster (including itself), "Hey, what fencing devices do you have access to that can fence this node?"
....

as I already mentioned stonith_admin is integrated in the stonith workflow.

The man page for stonith_admin is misleading the reboot option (-B) means actually power cycle for the fence agent and the fence (-F) option means power off, in fact in the stonith_admin source code https://github.com/ClusterLabs/pacemaker/blob/e022430d4df217b6764ea3f79ddf63432f98fd66/fencing/admin.c:

    case 'B':
            rc = mainloop_fencing(st, target, "reboot", timeout, tolerance);
            break;
        case 'F':
            rc = mainloop_fencing(st, target, "off", timeout, tolerance);
            break;

and in the fence_ipmilan python source:

def reboot_cycle(_, options):
	output = run_command(options, create_command(options, "cycle"))
	return bool(re.search('chassis power control: cycle', str(output).lower()))

The theory that I have implemented is really simple:
1. Try to import clean if this fail I try again and then if it fails again
2. I execute the stonith_admin command if this fail I try again then if fails again
3. I give up

We should notice if the stop command fail on the other node or the node crash, Pacemaker is scheduling a stonith by default. The second stonith command scheduled by my script is not executed because the stonithd daemon is smart enough to not execute parallel stonith commands for the same host.

I completely agree with you that a script shouldn't execute a stonith command but until we have MMP, this is the only way I can image to be sure to not corrupt the pool's metadata.

Gabriele Paciucci (Inactive) added a comment - 07/Sep/16 1:03 PM - edited Thank you for your comments, let's start with the first part. Pacemaker is managing STONITH at the same level of stonith_admin according this documentation: https://github.com/ClusterLabs/pacemaker/blob/e022430d4df217b6764ea3f79ddf63432f98fd66/fencing/README.md In the broadest terms, stonith works like this: 1. The initiator (an external program such as stonith_admin, or the cluster itself via the crmd) asks the local stonithd, "Hey, can you fence this node?" 2. The local stonithd asks all the stonithd's in the cluster (including itself), "Hey, what fencing devices do you have access to that can fence this node?" .... as I already mentioned stonith_admin is integrated in the stonith workflow. The man page for stonith_admin is misleading the reboot option (-B) means actually power cycle for the fence agent and the fence (-F) option means power off, in fact in the stonith_admin source code https://github.com/ClusterLabs/pacemaker/blob/e022430d4df217b6764ea3f79ddf63432f98fd66/fencing/admin.c: case 'B': rc = mainloop_fencing(st, target, "reboot", timeout, tolerance); break; case 'F': rc = mainloop_fencing(st, target, "off", timeout, tolerance); break; and in the fence_ipmilan python source: def reboot_cycle(_, options): output = run_command(options, create_command(options, "cycle")) return bool(re.search('chassis power control: cycle', str(output).lower())) The theory that I have implemented is really simple: 1. Try to import clean if this fail I try again and then if it fails again 2. I execute the stonith_admin command if this fail I try again then if fails again 3. I give up We should notice if the stop command fail on the other node or the node crash, Pacemaker is scheduling a stonith by default. The second stonith command scheduled by my script is not executed because the stonithd daemon is smart enough to not execute parallel stonith commands for the same host. I completely agree with you that a script shouldn't execute a stonith command but until we have MMP, this is the only way I can image to be sure to not corrupt the pool's metadata.

Christopher Morrone (Inactive) added a comment - 06/Sep/16 8:07 PM

1. the stonith hack was requested by Andreas Dilger due the fact we don't have MMP in place for ZFS and again this is perfectly managed by pacemaker (tested in production).

STONITH is an excellent idea. But Pacemaker handles that at higher levels, the resource agent script should not have anything to do with it. As far as I can tell, your script doesn't do real STONITH (at least it didn't before, and it is only optional now). It has a very racy reboot in place instead. Dangerous and scary if you ask me. We wouldn't run that in production.

Maybe you can explain the theory of operation a bit more to me to assuage my concern.

2. The script should be used to manage more Lustre services. This is the output crm_mon in production:

But your resource agent makes node assumptions and tries to trigger power control and such that will interfere with the higher level Pacemaker's own attempts to move services around. It seems especially racy and dangerous to have multiple services on a node.

6. The general lustre status (/proc/fs/lustre/health_status) is monitored by this agent in ~~LU-8458~~

I'm not talking about the global "health_status", I'm talking about looking for proc entries for the service being managed by the instance of the resource agent.

At the moment I am looking for the ZFS dataset in /proc/fs/lustre/osd-zfs/*/mntdev. If I find it, I know that Lustre still has some stake in the dataset. I can't say that it is entirely sufficient, but I know for a fact that lustre services are not always shut down when the devices disappear from /proc/mounts.

Although the more that I think about it, maybe .../mntdev isn't what I want either. I don't think we are terribly concerned in the lustre RA about whether the disk/dataset is in use. Before a dataset can be moved to another node, the zpool RA will have to be moved to that other node. zpool export can't succeed if the device is still in use. So I can most likely leave actual zpool usage state to the zpool RA.

So back to the lustre RA. I want the lustre RA to be able to detect the situation where the umount succeeds, but the lustre service has not stopped. My working assumption is that the /proc/fs/lustre/osd-zfs/*/mntdev will also stick around until the service has actually stopped. I assume that the service keeps a reference on that device internally until it completes shut down. But I could be wrong.

The advantage of looking at .../mndev is simpler configuration. It is not necessary to tell the lustre RA the name of the resource it is managing (e.g. MGS, lquake-MDT0000, lquake-OST000a). But if .../mntdev is not as reliable as I am hoping, then we would have to add a third required configuration parameter. I have dataset and mountpoint currently, and I would have to add servicename if .../mntdev will not suit our needs.

By the way, I am considering making mountpoint optional, because there is already a zfs "mountpoint" property. The RA could look the mountpoint up there if the admins didn't specify it the pacemaker configuration.

Christopher Morrone (Inactive) added a comment - 06/Sep/16 8:07 PM 1. the stonith hack was requested by Andreas Dilger due the fact we don't have MMP in place for ZFS and again this is perfectly managed by pacemaker (tested in production). STONITH is an excellent idea. But Pacemaker handles that at higher levels, the resource agent script should not have anything to do with it. As far as I can tell, your script doesn't do real STONITH (at least it didn't before, and it is only optional now). It has a very racy reboot in place instead. Dangerous and scary if you ask me. We wouldn't run that in production. Maybe you can explain the theory of operation a bit more to me to assuage my concern. 2. The script should be used to manage more Lustre services. This is the output crm_mon in production: But your resource agent makes node assumptions and tries to trigger power control and such that will interfere with the higher level Pacemaker's own attempts to move services around. It seems especially racy and dangerous to have multiple services on a node. 6. The general lustre status (/proc/fs/lustre/health_status) is monitored by this agent in LU-8458 I'm not talking about the global "health_status", I'm talking about looking for proc entries for the service being managed by the instance of the resource agent. At the moment I am looking for the ZFS dataset in /proc/fs/lustre/osd-zfs/*/mntdev. If I find it, I know that Lustre still has some stake in the dataset. I can't say that it is entirely sufficient, but I know for a fact that lustre services are not always shut down when the devices disappear from /proc/mounts. Although the more that I think about it, maybe .../mntdev isn't what I want either. I don't think we are terribly concerned in the lustre RA about whether the disk/dataset is in use. Before a dataset can be moved to another node, the zpool RA will have to be moved to that other node. zpool export can't succeed if the device is still in use. So I can most likely leave actual zpool usage state to the zpool RA. So back to the lustre RA. I want the lustre RA to be able to detect the situation where the umount succeeds, but the lustre service has not stopped. My working assumption is that the /proc/fs/lustre/osd-zfs/*/mntdev will also stick around until the service has actually stopped. I assume that the service keeps a reference on that device internally until it completes shut down. But I could be wrong. The advantage of looking at .../mndev is simpler configuration. It is not necessary to tell the lustre RA the name of the resource it is managing (e.g. MGS, lquake-MDT0000, lquake-OST000a). But if .../mntdev is not as reliable as I am hoping, then we would have to add a third required configuration parameter. I have dataset and mountpoint currently, and I would have to add servicename if .../mntdev will not suit our needs. By the way, I am considering making mountpoint optional, because there is already a zfs "mountpoint" property. The RA could look the mountpoint up there if the admins didn't specify it the pacemaker configuration.

People

Assignee:: Gabriele Paciucci (Inactive)

Reporter:: Gabriele Paciucci (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 01/Aug/16 8:37 AM

Updated:: 22/Oct/17 11:41 PM

Resolved:: 19/Sep/17 1:50 PM