Details
-
Bug
-
Resolution: Won't Fix
-
Blocker
-
None
-
Lustre 2.1.6
-
None
-
3
-
11068
Description
Hi,
At Meteo France, Lustre clients are randomly evicted with the following message:
1381739209 2013 Oct 14 08:26:49 beaufixoss9 kern err kernel LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.67.157@o2ib ns: filter-scratch-OST0024_UUID lock: ffff880215b3c6c0/0xe3c417cfbe584025 lrc: 3/0,0 mode: PW/PW res: 87670641/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x964c642755ae9fd expref: 4 pid: 20133 timeout 4581190301
Some attempts were made to increase the ldlm_timeout, but even with a timeout of 300 seconds, the timer finally expires and evicts the client.
After collecting some debug logs, it seems that there is a communication problem: the client sends the lock callback RPC, but it is never received by the OSS, leading it to timeout.
Here is an example:
[root@beaufixmngt0 beaufix1046]# grep evict lustre-logs-20131011-082535
00000100:00080000:17.0:1381479935.274633:0:4383:0:(import.c:915:ptlrpc_connect_interpret()) @@@ scratch-OST0017-osc-ffff880454e9b400: evicting (reconnect/recover flags not set: 4) req@ffff88087945e000 x1448401857359985/t0(0) o8->scratch-OST0017-osc-ffff880454e9b400@172.23.80.69@o2ib1:28/4 lens 368/264 e 0 to 0 dl 1381479946 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:02020000:17.0:1381479935.274641:0:4383:0:(import.c:1299:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by scratch-OST0017; in progress operations using this service will fail.
00000100:00080000:17.0:1381479935.285351:0:4383:0:(import.c:1303:ptlrpc_import_recovery_state_machine()) evicted from scratch-OST0017_UUID@172.23.80.69@o2ib1; invalidating
172.23.80.69 = beaufixoss15-ib1
[root@beaufixmngt0 beaufix1046]# head -n1 ../beaufixoss15/lustre-logs-20131011-082458
00010000:00000010:14.1F:1381479186.063701:0:0:0:(ldlm_lock.c:198:ldlm_lock_free()) slab-freed 'lock': 560 at ffff880067e6c480.
[root@beaufixmngt0 beaufix1046]# tail -n 2 ../beaufixoss15/lustre-logs-20131011-082458 | head -n1
00002000:00000001:2.0:1381479898.458969:0:20716:0:(filter_io_26.c:325:filter_do_bio()) Process entered
OSS debug log covers from 1381479186 to 1381479898
Here is the list of RPCs sent during this time (total of 5 RPCs):
[root@beaufixmngt0 beaufix1046]# grep -f requests lustre-logs-20131011-082535 | grep ptl_send_buf
00000100:00000200:22.0:1381479685.229939:0:4404:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857358474, offset 0
00000100:00000200:46.0:1381479697.155610:0:4424:0:(niobuf.c:87:ptl_send_buf()) Sending 296 bytes to portal 28, xid 1448401857358807, offset 0
00000100:00000200:6.0:1381479797.753246:0:4384:0:(niobuf.c:87:ptl_send_buf()) Sending 456 bytes to portal 6, xid 1448401857359681, offset 0
00000100:00000200:27.0:1381479835.230381:0:4411:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359731, offset 0
00000100:00000200:6.0:1381479885.229892:0:4409:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359816, offset 0
[root@beaufixmngt0 beaufix1046]# grep -f requests ../beaufixoss15/lustre-logs-20131011-082458 | grep incoming
00000100:00000040:0.0:1381479885.230810:0:10038:0:(events.c:284:request_in_callback()) incoming req@ffff88005b69f400 x1448401857359816 msgsize 192
-> Only 1 RPC received out of 5 sent by the clients
This example only shows what happens on top of LNet, but we can see no error at the Infiniband level yet.
This is a blocking problem as it is preventing the cluster from being used for production.
I will upload traces to help analysis.
Thanks for your help,
Sebastien.