Details
-
Question/Request
-
Resolution: Unresolved
-
Blocker
-
None
-
Lustre 2.10.2
-
None
-
Servers: Lustre-2.10.2, Kernel: 3.10.0-693.5.2.el7_lustre.x86_64
Clients: Lustre-2.10.3, Kernel: 3.10.0-693.21.1.el7.x86_64
Client/Server OS: CentOS Linux release 7.4.1708
-
9223372036854775807
Description
I tried running tunefs.lustre and successfully changed the failover NIDs to what they should be. This problem is happening on several OSTs, but fixing one should fix them all.
I'm assuming I forgot a step when I ran tunefs.lustre.
tunefs.lustre --erase-param failover.node --param failover.node=172.17.1.103@o2ib,172.16.1.103@tcp1 /dev/mapper/mpathg
The OST OST0017 is mounted on 172.17.1.103 with the following parameters:
[root@apslstr03 ~]# tunefs.lustre --dryrun /dev/mapper/mpathg
checking for existing Lustre data: found
Reading CONFIGS/mountdata
Read previous values:
Target: lustrefc-OST0017
Index: 23
Lustre FS: lustrefc
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: ,errors=remount-ro
Parameters: failover.node=172.17.1.103@o2ib,172.16.1.103@tcp1 mgsnode=172.17.1.112@o2ib,172.16.1.112@tcp1
mgsnode=172.17.1.113@o2ib,172.16.1.113@tcp1
Permanent disk data:
Target: lustrefc-OST0017
Index: 23
Lustre FS: lustrefc
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: ,errors=remount-ro
Parameters: failover.node=172.17.1.103@o2ib,172.16.1.103@tcp1 mgsnode=172.17.1.112@o2ib,172.16.1.112@tcp1
mgsnode=172.17.1.113@o2ib,172.16.1.113@tcp1
exiting before disk write.
[root@apslstr03 ~]#
However, the clients are still displaying errors like this:
May 8 11:43:33 localhost kernel: Lustre: 2028:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1557333772/real 0] req@ffff880bd9296f00 x1632920191594624/t0(0) o8->lustrefc-OST0017-osc-ffff8817ef372000@172.17.1.106@o2ib:28/4 lens 520/544 e 0 to 1 dl 1557333813 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
May 8 11:43:33 localhost kernel: Lustre: 2028:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 65 previous similar messages
May 8 11:45:26 localhost kernel: LNet: 1994:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Timed out tx for 172.17.1.106@o2ib: 3 seconds
May 8 11:45:26 localhost kernel: LNet: 1994:0:(o2iblnd_cb.c:3192:kiblnd_check_conns()) Skipped 39 previous similar messages
Good to hear!
If you need to change a primary nid, I would advise to follow the dedicated instructions in the Lustre Operations Manual:
http://doc.lustre.org/lustre_manual.xhtml#dbdoclet.changingservernid