Details
-
Question/Request
-
Resolution: Unresolved
-
Blocker
-
None
-
Lustre 2.10.2
-
None
-
Servers: Lustre-2.10.2, Kernel: 3.10.0-693.5.2.el7_lustre.x86_64
Clients: Lustre-2.10.3, Kernel: 3.10.0-693.21.1.el7.x86_64
Client/Server OS: CentOS Linux release 7.4.1708
-
9223372036854775807
Description
I tried running tunefs.lustre and successfully changed the failover NIDs to what they should be. This problem is happening on several OSTs, but fixing one should fix them all.
I'm assuming I forgot a step when I ran tunefs.lustre.
tunefs.lustre --erase-param failover.node --param failover.node=172.17.1.103@o2ib,172.16.1.103@tcp1 /dev/mapper/mpathg
The OST OST0017 is mounted on 172.17.1.103 with the following parameters:
[root@apslstr03 ~]# tunefs.lustre --dryrun /dev/mapper/mpathg
checking for existing Lustre data: found
Reading CONFIGS/mountdata
Read previous values:
Target: lustrefc-OST0017
Index: 23
Lustre FS: lustrefc
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: ,errors=remount-ro
Parameters: failover.node=172.17.1.103@o2ib,172.16.1.103@tcp1 mgsnode=172.17.1.112@o2ib,172.16.1.112@tcp1
mgsnode=172.17.1.113@o2ib,172.16.1.113@tcp1
Permanent disk data:
Target: lustrefc-OST0017
Index: 23
Lustre FS: lustrefc
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: ,errors=remount-ro
Parameters: failover.node=172.17.1.103@o2ib,172.16.1.103@tcp1 mgsnode=172.17.1.112@o2ib,172.16.1.112@tcp1
mgsnode=172.17.1.113@o2ib,172.16.1.113@tcp1
exiting before disk write.
[root@apslstr03 ~]#
However, the clients are still displaying errors like this:
May 8 11:43:33 localhost kernel: Lustre: 2028:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1557333772/real 0] req@ffff880bd9296f00 x1632920191594624/t0(0) o8->lustrefc-OST0017-osc-ffff8817ef372000@172.17.1.106@o2ib:28/4 lens 520/544 e 0 to 1 dl 1557333813 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
May 8 11:43:33 localhost kernel: Lustre: 2028:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 65 previous similar messages
May 8 11:45:26 localhost kernel: LNet: 1994:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.17.1.106@o2ib: 3 seconds
May 8 11:45:26 localhost kernel: LNet: 1994:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 39 previous similar messages
As the name suggests, failover node parameter serves the purpose of specifying failover NIDs for targets. It does not reflect the primary NID of a target.
So now that you mention that your cluster is down, I am wondering if your targets have been moved so that their primary NID is now different. If the targets did not move and they all run on their primary node, then this problem with the failover node change should not lead to any downtime.