[LU-8457] Pacemaker script to monitor LNet Created: 01/Aug/16  Updated: 26/Apr/17  Resolved: 26/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.10.0

Type: New Feature Priority: Minor
Reporter: Gabriele Paciucci (Inactive) Assignee: Gabriele Paciucci (Inactive)
Resolution: Fixed Votes: 0
Labels: lnet, pacemaker

Issue Links:
Blocker
is blocking LU-9168 Add pacemaker resources to lustre rpms Resolved
Related
Rank (Obsolete): 9223372036854775807

 Description   

A new script to be used in Pacemaker to monitor LNet compatible with ZFS and LDISKFS based Lustre server installations.

This RA is able to monitor a single LNet device using the Pacemaker's clone technology.

pcs resource create [Resource Name] ocf:pacemaker:healthLNET \
dampen=[seconds 5s] \
multiplier=[number 1000] \
lctl=[ true | false] \ 
device=[device name ib0] \ 
host_list=[ list of NIDs, space separated, if lctl is true otherwise list of IPs] \
--clone 

where:

  • dampen The time to wait (dampening) further changes occur
  • multiplier The number by which to multiply the number of connected ping nodes by
  • attempts Number of ping attempts, per host, before declaring it dead
  • timeout How long, in seconds, to wait before declaring a ping lost
  • lctl Option to enable lctl ping instead of the normal ping. The default is true
  • device Device used for the LNET network. We assume the same device accross the cluster

This script should be located in /usr/lib/ocf/resource.d/heartbeat/ of both the Lustre servers with permission 755.

Default values:

  • dampen 5s
  • multiplier 1
  • attempts 3
  • timeout 5s
  • lctl true

Default timeout:

  • start timeout 60s
  • stop timeout 20s
  • monitor timeout 60s interval 10s

Compatible and tested:

  • pacemaker 1.1.13
  • corosync 2.3.4
  • pcs 0.9.143
  • RHEL/CentOS 7.2

Example of procedure to configure:

pcs resource create healthLNET ocf:pacemaker:healthLNET dampen=5s multiplier=1000 lctl=true device=eth1 host_list="10.10.130.1@tcp1 10.10.130.2@tcp1" --clone 

targets=`crm_mon -1|grep 'OST'| awk '{print $1}'` 

for i in $targets; do pcs constraint location $i rule score=-INFINITY pingd lt 1 or not_defined pingd; done 


 Comments   
Comment by Gerrit Updater [ 01/Sep/16 ]

Gabriele Paciucci (gabriele.paciucci@intel.com) uploaded a new patch: http://review.whamcloud.com/22266
Subject: LU-8457 subject: Pacemaker script to monitor LNet
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8b1ee4646a31818f73dfc18c46f5d38cd48156b7

Comment by Gerrit Updater [ 07/Feb/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/22266/
Subject: LU-8457 pacemaker: Pacemaker script to monitor LNet
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 9018f11cd5a1ab82353e79271163ef51db081e95

Comment by Peter Jones [ 07/Feb/17 ]

Landed for 2.10

Comment by Gerrit Updater [ 07/Feb/17 ]

Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: https://review.whamcloud.com/25297
Subject: LU-8457 pacemaker: Update healthLNET to 0.99.4
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 8594bde54d52a92452535c226ae9da21affc8f8d

Comment by Gerrit Updater [ 26/Apr/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/25297/
Subject: LU-8457 pacemaker: Update healthLNET to 0.99.4
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f5530a0faa24ad836a44bdd8d0ce86bf806fde87

Generated at Sat Feb 10 02:17:42 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.