[LU-8455] Pacemaker script for Lustre and ZFS Created: 01/Aug/16  Updated: 22/Oct/17  Resolved: 19/Sep/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: New Feature Priority: Minor
Reporter: Gabriele Paciucci (Inactive) Assignee: Gabriele Paciucci (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File Lustre-ZFS-RA-0.99.5-1.noarch.rpm    
Issue Links:
Duplicate
duplicates LU-8458 Pacemaker script to monitor Lustre se... Resolved
Related
Rank (Obsolete): 9223372036854775807

 Description   

A new script to be used in Pacemaker to manage ZFS pools and Lustre targets.

This RA is able to manage (import/export) ZFS pools and Lustre Target (mount/umount).

pcs resource create <Resource Name> ocf:heartbeat:LustreZFS \ 
pool="<ZFS Pool Name>" \
volume="<ZFS Volume Name>" \
mountpoint="<Mount Point" \
OCF_CHECK_LEVEL=10

where:

  • pool is the pool name of the ZFS resource created in advance
  • volume is the volume name created on the ZFS pool during the Lustre format (mkfs.lustre).
  • mount point is the mount point created in advance on both the Lustre servers
  • OCF_CHECK_LEVEL is optional and enable an extra monitor on the status of the pool

This script should be located in /usr/lib/ocf/resource.d/heartbeat/ of both the Lustre servers with permission 755.

The script provides protection from double imports of the pools. In order to activate this functionality is important to configure the hostid protection in ZFS using the genhostid command.

Default values:

  • no defaults

Default timeout:

  • start timeout 300s
  • stop timeout 300s
  • monitor timeout 300s interval 20s

Compatible and tested:

  • pacemaker 1.1.13
  • corosync 2.3.4
  • pcs 0.9.143
  • RHEL/CentOS 7.2


 Comments   
Comment by Gerrit Updater [ 08/Aug/16 ]

Gabriele Paciucci (gabriele.paciucci@intel.com) uploaded a new patch: http://review.whamcloud.com/21812
Subject: LU-8455 setup: Pacemaker script for Lustre and ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3b22f9c4ecf519f2b9334a2e6a362145d7cac4a4

Comment by Gabriele Paciucci (Inactive) [ 01/Sep/16 ]

Version 0.99.2
ChangeLog:

  • Created two variables to make easy to change the sleep time during the import and start/stop fase
  • Created a variable to select the stonith mode (power down/power cycle) to use
  • Added a disclaimer on the actual 2-way cluster design (and now thinking on a smart way to make this more generic) .
Comment by Gabriele Paciucci (Inactive) [ 01/Sep/16 ]

TBD:

  • remove the 2-way cluster limitation
  • add additional tests to monitor: "'I've seen too many times that the umount command succeeds and lustre is no longer listed as mounted, but the lustre service hasn't really stopped yet."
  • rpm integration and directory to host the script in the OCF dir
Comment by Christopher Morrone [ 01/Sep/16 ]

Other suggestions:

  • Purge all of the stonith related code. Resource agents shouldn't do that. It also seems to assume that there will only be one Lustre service per node...
  • Don't assume only one Lustre service per node
  • Don't assume only one Lustre service per zpool (zpools can host multiple datasets, and therefore multiple lustre services). But to support this, the cleanest way is to have separate zpool and lustre resource agent scripts.
  • Fix crazy indenting.

FYI, I hope to get our local resource agents in better shape by the end of next week. I'll share what I have when the scripts are more presentable. My thinking is that the zpool resource agent might be packaged along with ZFS, and the lustre resource agent might be packaged with Lustre. See this for resource agent packaging guidance:

http://www.linux-ha.org/doc/dev-guides/_installing_and_packaging_resource_agents.html

For our local lustre resource agent, I am taking the approach of having the user pass in the Lustre service name so that the script can monitor /proc/fs/lustre/ to see if the service is running.

Comment by Gabriele Paciucci (Inactive) [ 02/Sep/16 ]

Hi morrone,

1. the stonith hack was requested by adilger due the fact we don't have MMP in place for ZFS and again this is perfectly managed by pacemaker (tested in production).

2. The script should be used to manage more Lustre services. This is the output crm_mon in production:

Cluster name: kapollo_oss
Last updated: Thu Jun  2 03:55:58 2016          Last change: Wed May 25 04:42:25 2016 by root via cibadmin on kapollo01
Stack: corosync
Current DC: kapollo02 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with quorum
2 nodes and 16 resources configured

Online: [ kapollo01 kapollo02 ]

Full list of resources:

 kapollo01-ipmi (stonith:fence_ipmilan):        Started kapollo02
 kapollo02-ipmi (stonith:fence_ipmilan):        Started kapollo01
 ost00  (ocf::heartbeat:LustreZFS):     Started kapollo02
 ost01  (ocf::heartbeat:LustreZFS):     Started kapollo01
 ost02  (ocf::heartbeat:LustreZFS):     Started kapollo02
 ost03  (ocf::heartbeat:LustreZFS):     Started kapollo01
 ost04  (ocf::heartbeat:LustreZFS):     Started kapollo02
 ost05  (ocf::heartbeat:LustreZFS):     Started kapollo01
 ost06  (ocf::heartbeat:LustreZFS):     Started kapollo02
 ost07  (ocf::heartbeat:LustreZFS):     Started kapollo01
 ost08  (ocf::heartbeat:LustreZFS):     Started kapollo02
 ost09  (ocf::heartbeat:LustreZFS):     Started kapollo01
 Clone Set: healthLNET-clone [healthLNET]
     Started: [ kapollo01 kapollo02 ]
 Clone Set: healthLUSTRE-clone [healthLUSTRE]
     Started: [ kapollo01 kapollo02 ]

PCSD Status:
  kapollo01: Online
  kapollo02: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

3. Okay I started with this approach in my (not released but presented in LAD14) very first version of this effort: 1 RA to managing zpool, 1 RA to mount/umount Lustre. After several tests I decided to have a single script because the colocation constraints in pacemaker gave me funny results and are not solid enough. I can reconsider this in the future.

4. I'm on this

5. I'm interested in your approach and also I'm interested in see if your experiments with Pacemaker's constraints give you better results.

6. The general lustre status (/proc/fs/lustre/health_status) is monitored by this agent in LU-8458

Comment by Christopher Morrone [ 06/Sep/16 ]

1. the stonith hack was requested by Andreas Dilger due the fact we don't have MMP in place for ZFS and again this is perfectly managed by pacemaker (tested in production).

STONITH is an excellent idea. But Pacemaker handles that at higher levels, the resource agent script should not have anything to do with it. As far as I can tell, your script doesn't do real STONITH (at least it didn't before, and it is only optional now). It has a very racy reboot in place instead. Dangerous and scary if you ask me. We wouldn't run that in production.

Maybe you can explain the theory of operation a bit more to me to assuage my concern.

2. The script should be used to manage more Lustre services. This is the output crm_mon in production:

But your resource agent makes node assumptions and tries to trigger power control and such that will interfere with the higher level Pacemaker's own attempts to move services around. It seems especially racy and dangerous to have multiple services on a node.

6. The general lustre status (/proc/fs/lustre/health_status) is monitored by this agent in LU-8458

I'm not talking about the global "health_status", I'm talking about looking for proc entries for the service being managed by the instance of the resource agent.

At the moment I am looking for the ZFS dataset in /proc/fs/lustre/osd-zfs/*/mntdev. If I find it, I know that Lustre still has some stake in the dataset. I can't say that it is entirely sufficient, but I know for a fact that lustre services are not always shut down when the devices disappear from /proc/mounts.

Although the more that I think about it, maybe .../mntdev isn't what I want either. I don't think we are terribly concerned in the lustre RA about whether the disk/dataset is in use. Before a dataset can be moved to another node, the zpool RA will have to be moved to that other node. zpool export can't succeed if the device is still in use. So I can most likely leave actual zpool usage state to the zpool RA.

So back to the lustre RA. I want the lustre RA to be able to detect the situation where the umount succeeds, but the lustre service has not stopped. My working assumption is that the /proc/fs/lustre/osd-zfs/*/mntdev will also stick around until the service has actually stopped. I assume that the service keeps a reference on that device internally until it completes shut down. But I could be wrong.

The advantage of looking at .../mndev is simpler configuration. It is not necessary to tell the lustre RA the name of the resource it is managing (e.g. MGS, lquake-MDT0000, lquake-OST000a). But if .../mntdev is not as reliable as I am hoping, then we would have to add a third required configuration parameter. I have dataset and mountpoint currently, and I would have to add servicename if .../mntdev will not suit our needs.

By the way, I am considering making mountpoint optional, because there is already a zfs "mountpoint" property. The RA could look the mountpoint up there if the admins didn't specify it the pacemaker configuration.

Comment by Gabriele Paciucci (Inactive) [ 07/Sep/16 ]

Thank you for your comments, let's start with the first part.

Pacemaker is managing STONITH at the same level of stonith_admin according this documentation: https://github.com/ClusterLabs/pacemaker/blob/e022430d4df217b6764ea3f79ddf63432f98fd66/fencing/README.md

In the broadest terms, stonith works like this:

1. The initiator (an external program such as stonith_admin, or the cluster itself via the crmd) asks the local stonithd, "Hey, can you fence this node?"
2. The local stonithd asks all the stonithd's in the cluster (including itself), "Hey, what fencing devices do you have access to that can fence this node?"
....

as I already mentioned stonith_admin is integrated in the stonith workflow.

The man page for stonith_admin is misleading the reboot option (-B) means actually power cycle for the fence agent and the fence (-F) option means power off, in fact in the stonith_admin source code https://github.com/ClusterLabs/pacemaker/blob/e022430d4df217b6764ea3f79ddf63432f98fd66/fencing/admin.c:

    case 'B':
            rc = mainloop_fencing(st, target, "reboot", timeout, tolerance);
            break;
        case 'F':
            rc = mainloop_fencing(st, target, "off", timeout, tolerance);
            break;

and in the fence_ipmilan python source:

def reboot_cycle(_, options):
	output = run_command(options, create_command(options, "cycle"))
	return bool(re.search('chassis power control: cycle', str(output).lower()))

The theory that I have implemented is really simple:
1. Try to import clean if this fail I try again and then if it fails again
2. I execute the stonith_admin command if this fail I try again then if fails again
3. I give up

We should notice if the stop command fail on the other node or the node crash, Pacemaker is scheduling a stonith by default. The second stonith command scheduled by my script is not executed because the stonithd daemon is smart enough to not execute parallel stonith commands for the same host.

I completely agree with you that a script shouldn't execute a stonith command but until we have MMP, this is the only way I can image to be sure to not corrupt the pool's metadata.

Comment by Christopher Morrone [ 07/Sep/16 ]

The man page for stonith_admin is misleading

I don't understand what is misleading about that. The man page says that -B is reboot, the code says that case 'B' is "reboot". The man page says -F is fence, the code says case 'F' is "off". Those statements appear to me to be in full agreement.

So I still don't understand how a racy power cycle (i.e. reboot) would be your preferred MMP stand-in.

The theory that I have implemented is really simple:
1. Try to import clean if this fail I try again and then if it fails again
2. I execute the stonith_admin command if this fail I try again then if fails again
3. I give up

But that isn't really so simple in practice. Just because one device is unclean, you are killing all devices on the other node and starting failover procedures.

Further, there is really no reason to do that unless you don't trust your own RA on that other node. Pacemaker has already run your RA on the other node and determined that the zpool is not running there. If Pacemaker could not run the RA on that other node, then it would have fenced the node on its own.

So why is your script second guessing your own script on the other node, and pacemaker itself? If you can't trust Pacemaker, then...well you can't trust it and things are going to go wrong in many other ways.

Your script is also introducing a lot of unnecessary service interruptions. If the other node is flat out powered off, then all of the services on that node now have to move, unnecessarily, and other nodes are now doing double-duty in hosting Lustre services. Now a sysadmin needs to get involved to fix the situation. Again, unnecessarily.

If instead your script does a reboot, the entire process is racy and dangerous. Your script has no information about what form the reboot/power cycle takes (hard power cycle? Linux clean "reboot" command?) and no information about the timing. Your script doesn't know how long to wait until the other node is no longer using the shared disk resource in question, and no idea how fast it needs to run to capture the disk resource before the other node grabs it again during its boot up process. I would argue that no fixed "sleep" numbers will ever be a good idea in that situation. I would suspect that there is a high degree of risk that your script will corrupt a filesystem in production if it is used long enough.

completely agree with you that a script shouldn't execute a stonith command but until we have MMP, this is the only way I can image to be sure to not corrupt the pool's metadata.

STONITH is totally reasonable, with or without MMP. I disagree that your solution is the only way to not corrupt the pool. I suspect that your approach (at least the one that employs reboots) is more likely to corrupt a pool at some point than the approach of writing a zpool RA with good monitor function, and letting Pacemaker ensure that the resource is never started on two nodes (zpool never imported on two nodes) at the same time. That is Pacemaker's job.

Comment by Gabriele Paciucci (Inactive) [ 07/Sep/16 ]

Hi morrone, to close the reboot discussion. The fence agent converts (second code) the "reboot" command in a power cycle (power off/on): not in an OS reboot. The "fencing" command is a power off only.

Comment by Gabriele Paciucci (Inactive) [ 08/Sep/16 ]

The second part of your comment should be placed into the pacemaker contest and logic and maybe we should review together all the possible scenarios.

Pacemaker logic: When a single resource fail to stop, pacemaker is fencing the node (https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/High_Availability_Add-On_Reference/s1-resourceoperate-HAAR.html). This behavior can be changed but this is the default.

Possible scenario:

Event Pacemaker ZFS RA Notes
HW server failure Scheduled fencing Scheduled fencing my script's fencing is not executed, all the resources moved
Single Pool Failure or Lustre failure Stop clean the resource Clean import single resource moved, all the other resource moved
Single Pool Failure or Lustre failure Can't stop the resource, fence is scheduled Schedule fencing my script's fencing is not executed, all the resources moved

So my stonith procedure looks like useless (and if you make the same exercise for LDISKFS also the MMP protection in theory could be considered useless), but during my stress tests I saw situation where something went wrong in pacemaker and thanks to the additional protection, the pool wasn't corrupted.

Clarified that I can improve the script implementing a variable (OCF_STONITH_ENABLE) to enable/disable this stonith protection into the script (for brave sys admin).

Comment by Christopher Morrone [ 08/Sep/16 ]

Could you elaborate on what went wrong in pacemaker?

You are leaving out that your script explicitly introduces a new failure mode with its racy, reboot appoach to atomic pool access. Does your new failure mode, and all the added complexity really balance out the rare failure mode in pacemaker?

I think I would rather rely on Pacemaker's reliability than a racy reboot mechanism.

Comment by Christopher Morrone [ 09/Sep/16 ]

Oh, I see! The reboot isn't necessarily as bad as I was thinking because the node simply booting again does not, in theory, introduce usage of a lustre device. You are relying on pacemaker to not start the device on that node after reboot. Which is odd, because you trust pacemaker there, but not enough to necessarily start it only in when place else where (which is why you are attempting direct stonith control).

I guess it isn't quite as scary as I originally though. But the script needs to have options passed in to list the possible nodes that the resource can live on, rather than assuming there are only two nodes in the entire cluster and that the other node must be the only node that needs power cycling. As long as doing that, it definitely shouldn't be in any place but contrib.

Comment by Christopher Morrone [ 09/Sep/16 ]

I promised to share my version of pacemaker OCF resource agent scripts by the end of this week so here they are:

https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/zpool
https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/lustre

Keep in mind that these are still work-in-progress. They are certainly rough around the edges. The general idea is that zpool manages zpool inport/export, and lustre manages mount/umount of individual zfs datasets.

In pacemaker I have ordering and colocation contraints to express the relationship between the two resources. Here is an example set of constraints that are working on our testbed. This is for two MDS nodes with shared storage. The zpool of the first MDS is used by two lustre targets: the MGS and the first MDS. This is a real-world example of needed to support two lustre filesystems in the same pool (all of our production filesystems are like that, even though DNE exists only on testbeds for now).

      <rsc_location id="jet1-1_loc_20" rsc="jet1-1" node="jet1" score="20"/>
      <rsc_location id="jet1-1_loc_10" rsc="jet1-1" node="jet2" score="10"/>
      <rsc_location id="MGS_loc_20" rsc="MGS" node="jet1" score="20"/>
      <rsc_location id="MGS_loc_10" rsc="MGS" node="jet2" score="10"/>
      <rsc_order id="MGS_order" first="jet1-1" then="MGS" kind="Mandatory"/>
      <rsc_colocation id="MGS_colocation" rsc="MGS" with-rsc="jet1-1" score="INFINITY"/>
      <rsc_location id="lquake-MDT0000_loc_20" rsc="lquake-MDT0000" node="jet1" score="20"/>
      <rsc_location id="lquake-MDT0000_loc_10" rsc="lquake-MDT0000" node="jet2" score="10"/>
      <rsc_order id="lquake-MDT0000_order" first="jet1-1" then="lquake-MDT0000" kind="Mandatory"/>
      <rsc_colocation id="lquake-MDT0000_colocation" rsc="lquake-MDT0000" with-rsc="jet1-1" score="INFINITY"/>
      <rsc_location id="jet2-1_loc_20" rsc="jet2-1" node="jet2" score="20"/>
      <rsc_location id="jet2-1_loc_10" rsc="jet2-1" node="jet1" score="10"/>
      <rsc_location id="lquake-MDT0001_loc_20" rsc="lquake-MDT0001" node="jet2" score="20"/>
      <rsc_location id="lquake-MDT0001_loc_10" rsc="lquake-MDT0001" node="jet1" score="10"/>
      <rsc_order id="lquake-MDT0001_order" first="jet2-1" then="lquake-MDT0001" kind="Mandatory"/>
      <rsc_colocation id="lquake-MDT0001_colocation" rsc="lquake-MDT0001" with-rsc="jet2-1" score="INFINITY"/>

I wrote a script to generate the pacemaker cib.xml for an entire lustre server cluster starting from an ldev.conf file. It is also on github:

https://github.com/LLNL/lustre-tools-llnl/blob/1.8.2/scripts/ldev2cib

ldev2cib is also a work-in-progress. Currently stonith is entirely disabled. We'll work on adding stonith once the local powerman stonith agent is ready to be integrated into the whole.

Yes, the lustre script is zfs-only at this time. It could be expanded to work with ldiskfs too if that makes sense. I'm also open to keeping ldiskfs and zfs lustre RA scripts separate.

Comment by Gabriele Paciucci (Inactive) [ 10/Sep/16 ]

Hi morrone,
as I already mentioned in our discussion, I found problems (and weakness) in pacemaker to manage two resources colocated. One of the problem was to move (from your example) MGS from jet1 to jet2. The only way to do this is moving jet1-1, but this means that if the MGS resource failed, the other resource is not moved... I tried also using the resource group concept but without success.
I tested this in RHEL 6 maybe in RHEL 7 they fixed, this issue. Could you please test this?

I also suggest to use the resource stickness option to avoid any failback.

Comment by Andreas Dilger [ 11/Feb/17 ]

There is a considerable amount of duplication between this ticket and LU-8458. One of them should probably be closed.

Comment by Bradley Merchant [ 16/Feb/17 ]

We have been using this script and ran into an issue on manual failover (i.e. pcs resource disable/enable or move/relocate). On resource stop the script force unmounts Lustre targets (umount -f). In live tests this caused the recovery flags to not be set on OSTs, causing client eviction on remount. Removing the -f flag resolves the problem, but we are curious if removing it could present any other issues.

I do notice Christopher Morrone's script omits the -f flag.

Comment by Gabriele Paciucci (Inactive) [ 16/Feb/17 ]

Hi bmerchant thank you for this feedback. I would like to consult with adilger about this.
I'm using umount -f because I don't want wait long time to close the file system if several clients are connected. If I hit the stop timeout in pacemaker, pacemaker will perform a stonith of the node.
But at the same time the -f option is disabling the recovery and this could cause issues with data?
adilger could you advice please. Thank you.

Comment by Andreas Dilger [ 16/Feb/17 ]

Using "umount -f" will evict all of the clients at unmount time, which means they will get IO errors for any in-flight writes when the server is started again. This is useful if the administrator knows that the clients have already been stopped, for example.

I don't think that using the "-f" flag will significantly reduce the amount of time that the unmount will take, but if you have numbers to that effect it would be interesting to see. It may be possible to improve the performance of the non-f unmount if that is a real issue.

Comment by Gabriele Paciucci (Inactive) [ 20/Feb/17 ]

Added a new version 0.99.5 that fix the amount issue

Comment by Vaughn E. Clinton [ 03/Apr/17 ]

I've been trying to use the script to create the HA volumn/dataset resources with the following syntax:

pcs resource create hail-mgt ocf:heartbeat:Lustre-MDS-ZFS pool="ha.mds" volume="mgt" mountpoint="/lustre/hail/mgt"

Each attempt returns the following error:

Error: Unable to create resource 'ocf:heartbeat:Lustre-MDS-ZFS', it is not installed on this system (use --force to override)

I can see that the agent script, Lustre-MDS-ZFS, is dropped into the correct location when I run this syntax with the debug option enabled.  I also see the script being ran with a return 0 value. I'm not exactly sure what the problem is. Could it be missing some binary that I'm not seeing in the debug output? Anyway, I would greatly appreciate some guidance with solving this.

Here are the details about my configuration:

Red Hat Enterprise Linux Server release 7.3 (Maipo)
pcs-0.9.152-10.el7.x86_64
pacemaker-1.1.15-11.el7_3.2.x86_64
corosync-2.4.0-4.el7.x86_64

fence-agents-common-4.0.11-47.el7_3.2.x86_64
fence-agents-powerman-4.0.11-7.ch6.x86_64
libxshmfence-1.2-1.el7.x86_64

This is being deployed in a diskless 2 node HA Lustre environment. Please let me know if you require me to open a ticket concerning this issue.

Comment by Malcolm Cowe (Inactive) [ 03/Apr/17 ]

Hi Vaughn,

Try using the path ocf:pacemaker:Lustre-MDS-ZFS, instead of ocf:heartbeat:Lustre-MDS-ZFS. You can also verify the list of available RAs using the command pcs resource list. For example:

[root@ct66-mds2 ~]# pcs resource list ocf:pacemaker | awk 'tolower($0) ~ /lustre|lnet/'
ocf:pacemaker:Lustre-MDS-ZFS - Lustre and ZFS management when the MDT and MGT
ocf:pacemaker:LustreZFS - Lustre and ZFS management
ocf:pacemaker:healthLNET - LNet connectivity
ocf:pacemaker:healthLUSTRE - lustre servers healthy

 

Comment by Vaughn E. Clinton [ 04/Apr/17 ]

Malcolm,

Thanks for the response!  I really appreciate the help with this since I'm very new at PCS/Pacemaker/Corosync setups.

Anyway, I ran the following command with the syntax as you suggested.  Here's the return from the command:

pcs resource list ocf:pacemaker | awk 'tolower($0) ~ /lustre|lnet/'
Error: No resource agents matching the filter.

I even attempted with heartbeat and here's the return for that attempt:

pcs resource list ocf:heartbeat | awk 'tolower($0) ~ /lustre|lnet/'
Error: No resource agents matching the filter.

I did attempt to create the resources anyway and it failed as with the previous attempts:

pcs resource create hail-mgt ocf:pacemaker:LustreZFS pool="ha-mds" volume="mgt" mountpoint="/lustre/hail/mgmt"


Error: Unable to create resource 'ocf:pacemaker:LustreZFS', it is not installed on this system (use --force to override)

I forgot to add the version of the resource-agents RPM that installed in this environment:

resource-agents-3.9.5-82.el7_3.3.x86_64

**Again, thanks for the assistance

Comment by Malcolm Cowe (Inactive) [ 04/Apr/17 ]

From the output, it looks as though PCS cannot find almost any resources. Probably need to check that the packages are installed correctly.

For reference, the packages on my server are:

[root@ct66-mds2 ~]# rpm -qa resource-agents
resource-agents-3.9.5-82.el7_3.6.x86_64
[root@ct66-mds2 ~]# rpm -qa Lustre-ZFS-RA
Lustre-ZFS-RA-0.99.5-1.noarch

The RAs are installed in /usr/lib/ocf/resource.d, in subdirectories for each class. For example, the pacemaker directory on one of my servers looks like this:

[root@ct66-mds2 ~]# ls /usr/lib/ocf/resource.d/pacemaker
ClusterMon  Dummy      healthLNET    HealthSMART     LustreZFS  pingd   Stateful  SystemHealth
controld    HealthCPU  healthLUSTRE  Lustre-MDS-ZFS  ping       remote  SysInfo

The pcs resource list command scans these directories to assemble the list of available RAs. Running pcs resource list with no further arguments should return a large list of available resource agents.

If none of the RAs are showing up, but there are files listed in /usr/lib/ocf/resource.d/{heartbeat,pacemaker}, then it is possible that there is a permissions problem. All the RAs need to have the executable bit set, and on a default install will have mode 755 on all files and directories, owned by root. If they are correct, then perhaps something like SELinux is interfering, although I would hope that that is unlikely.

Comment by Vaughn E. Clinton [ 05/Apr/17 ]

Checked to see what the resource option could locate with respect to ZFS and here's what I got:

pcs resource list | grep -i zfs
ocf:heartbeat:Lustre-MDS-ZFS - Lustre and ZFS management when the MDT and MGT
ocf:heartbeat*:LustreZFS* - Lustre and ZFS management
ocf:llnl:lustre - Lustre ZFS OSD resource agent
ocf:llnl:zpool - ZFS zpool resource agent
ocf:pacemaker:Lustre-MDS-ZFS - Lustre and ZFS management when the MDT and MGT
ocf:pacemaker:LustreZFS - Lustre and ZFS management

ls /usr/lib/ocf/resource.d/pacemaker
ClusterMon Dummy healthLNET HealthSMART LustreZFS pingd Stateful SystemHealth
controld HealthCPU healthLUSTRE Lustre-MDS-ZFS ping remote SysInfo

ls /usr/lib/ocf/resource.d/heartbeat
apache Delay exportfs healthLUSTRE iSCSILogicalUnit LVM nfsnotify oralsnr redis Squid
clvm dhcpd Filesystem iface-vlan iSCSITarget MailTo nfsserver pgsql Route symlink
conntrackd docker galera IPaddr Lustre-MDS-ZFS mysql nginx portblock rsyncd tomcat
CTDB Dummy garbd IPaddr2 LustreZFS nagios ocf-rarun postfix SendArp VirtualDomain
db2 ethmonitor healthLNET IPsrcaddr named oracle rabbitmq-cluster slapd Xinetd

 

ls /usr/lib/ocf/resource.d/llnl/
lustre zpool

The LLNL agents were installed yesterday by another staff member and we were able to successfully create the resources using the LLNL RA scripts but not the Intel ones:

Online: [ mds00 mds01 ]

Full list of resources:

hammer_io6 (stonith:fence_powerman): Started mds00

  • hammer_io5 (stonith:fence_powerman): Started mds01*
  • lustreMDSPool (ocf::llnl:zpool): Started mds00*
  • lustreMGT (ocf::llnl:lustre): Started mds00*
  • lustreMDT (ocf::llnl:lustre): Started mds00*

Anyway, if you have any other suggestions, I'd welcome them because I'd prefer using a vendor RA but will settle with the LLNL one for the moment.

Thanks again for the support with this.

Cheers,

Comment by Andreas Dilger [ 16/Sep/17 ]

Is there more to be done here, or should this ticket be closed? I believe the ZFS RA scripts were landed upstream?

Comment by Nathaniel Clark [ 18/Sep/17 ]

I think this can be closed. ZFS RA was merged upstream, and the Lustre resource agents are available.

Comment by Malcolm Haak - NCI (Inactive) [ 18/Oct/17 ]

My apologies, I see LUSTREhealth has been merged in LU-8458, is the current state of affairs that the ZFS stuff is an exercise for the reader with the regular ZFS agents? Or have these other agents been up-streamed elsewhere?

Or do we use the RPM attached here?

Doesn't worry me what the answer is, it just seems to be a bit difficult to determine from the current state of the ticket/git repo

Comment by Malcolm Cowe (Inactive) [ 19/Oct/17 ]

mhaakddn: Take a look here:

http://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services

Nathaniel Clark has upstreamed the ZFS RA into the resource agents project on GitHub but it will take some time to filter into OS distros. The above-referenced page shows how to download and incorporate into a pacemaker cluster.

Comment by Malcolm Haak - NCI (Inactive) [ 22/Oct/17 ]

Thanks for that!

Generated at Sat Feb 10 02:17:41 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.