Details
-
Bug
-
Resolution: Won't Fix
-
Blocker
-
None
-
Lustre 2.1.6
-
None
-
3
-
11068
Description
Hi,
At Meteo France, Lustre clients are randomly evicted with the following message:
1381739209 2013 Oct 14 08:26:49 beaufixoss9 kern err kernel LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.67.157@o2ib ns: filter-scratch-OST0024_UUID lock: ffff880215b3c6c0/0xe3c417cfbe584025 lrc: 3/0,0 mode: PW/PW res: 87670641/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x964c642755ae9fd expref: 4 pid: 20133 timeout 4581190301
Some attempts were made to increase the ldlm_timeout, but even with a timeout of 300 seconds, the timer finally expires and evicts the client.
After collecting some debug logs, it seems that there is a communication problem: the client sends the lock callback RPC, but it is never received by the OSS, leading it to timeout.
Here is an example:
[root@beaufixmngt0 beaufix1046]# grep evict lustre-logs-20131011-082535
00000100:00080000:17.0:1381479935.274633:0:4383:0:(import.c:915:ptlrpc_connect_interpret()) @@@ scratch-OST0017-osc-ffff880454e9b400: evicting (reconnect/recover flags not set: 4) req@ffff88087945e000 x1448401857359985/t0(0) o8->scratch-OST0017-osc-ffff880454e9b400@172.23.80.69@o2ib1:28/4 lens 368/264 e 0 to 0 dl 1381479946 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:02020000:17.0:1381479935.274641:0:4383:0:(import.c:1299:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by scratch-OST0017; in progress operations using this service will fail.
00000100:00080000:17.0:1381479935.285351:0:4383:0:(import.c:1303:ptlrpc_import_recovery_state_machine()) evicted from scratch-OST0017_UUID@172.23.80.69@o2ib1; invalidating
172.23.80.69 = beaufixoss15-ib1
[root@beaufixmngt0 beaufix1046]# head -n1 ../beaufixoss15/lustre-logs-20131011-082458
00010000:00000010:14.1F:1381479186.063701:0:0:0:(ldlm_lock.c:198:ldlm_lock_free()) slab-freed 'lock': 560 at ffff880067e6c480.
[root@beaufixmngt0 beaufix1046]# tail -n 2 ../beaufixoss15/lustre-logs-20131011-082458 | head -n1
00002000:00000001:2.0:1381479898.458969:0:20716:0:(filter_io_26.c:325:filter_do_bio()) Process entered
OSS debug log covers from 1381479186 to 1381479898
Here is the list of RPCs sent during this time (total of 5 RPCs):
[root@beaufixmngt0 beaufix1046]# grep -f requests lustre-logs-20131011-082535 | grep ptl_send_buf
00000100:00000200:22.0:1381479685.229939:0:4404:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857358474, offset 0
00000100:00000200:46.0:1381479697.155610:0:4424:0:(niobuf.c:87:ptl_send_buf()) Sending 296 bytes to portal 28, xid 1448401857358807, offset 0
00000100:00000200:6.0:1381479797.753246:0:4384:0:(niobuf.c:87:ptl_send_buf()) Sending 456 bytes to portal 6, xid 1448401857359681, offset 0
00000100:00000200:27.0:1381479835.230381:0:4411:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359731, offset 0
00000100:00000200:6.0:1381479885.229892:0:4409:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359816, offset 0
[root@beaufixmngt0 beaufix1046]# grep -f requests ../beaufixoss15/lustre-logs-20131011-082458 | grep incoming
00000100:00000040:0.0:1381479885.230810:0:10038:0:(events.c:284:request_in_callback()) incoming req@ffff88005b69f400 x1448401857359816 msgsize 192
-> Only 1 RPC received out of 5 sent by the clients
This example only shows what happens on top of LNet, but we can see no error at the Infiniband level yet.
This is a blocking problem as it is preventing the cluster from being used for production.
I will upload traces to help analysis.
Thanks for your help,
Sebastien.
Attachments
Activity
Resolution | New: Won't Fix [ 2 ] | |
Status | Original: Open [ 1 ] | New: Closed [ 6 ] |
End date | New: 05/Jun/14 | |
Start date | New: 16/Oct/13 |
Labels | Original: p4b |
Attachment | Original: baioli_ls-al_.02_10_2014.log [ 14146 ] |
Attachment | New: baioli_ls-al_.02_10_2014.log [ 14146 ] | |
Attachment | New: xfer_client_2_10_2014.log [ 14147 ] | |
Attachment | New: server_log_2_10_2014.log [ 14148 ] | |
Attachment | New: 2014-02-11-run6-client.out.gz [ 14149 ] |
Attachment | New: LU-4112-reproducer-v2.tar.gz [ 13822 ] |
Labels | New: p4b |
Comment |
[ When the issue is shown, before the client is evicted, all the processes are stucked with the following stack: [<ffffffffa055dd87>] cl_lock_mutex_get+0x77/0xe0 [obdclass] [<ffffffffa05610e6>] cl_lock_hold_mutex+0x96/0x6b0 [obdclass] [<ffffffffa056185e>] cl_lock_request+0x5e/0x1c0 [obdclass] [<ffffffffa0566bc5>] cl_io_lock+0x3d5/0x570 [obdclass] [<ffffffffa0566e1a>] cl_io_loop+0xba/0x190 [obdclass] [<ffffffffa0a2c097>] ll_file_io_generic+0x3a7/0x560 [lustre] [<ffffffffa0a2c389>] ll_file_aio_read+0x139/0x2c0 [lustre] [<ffffffffa0a2c849>] ll_file_read+0x169/0x2a0 [lustre] [<ffffffff811679b5>] vfs_read+0xb5/0x1a0 [<ffffffff81167af1>] sys_read+0x51/0x90 Except one that will be with this one: [<ffffffffa055ccce>] cl_lock_state_signal+0x7e/0x160 [obdclass] [<ffffffffa055ce34>] cl_lock_state_set+0x84/0x1d0 [obdclass] [<ffffffffa055e5da>] cl_lock_intransit+0x5a/0x150 [obdclass] [<ffffffffa055f53b>] cl_use_try+0x13b/0x300 [obdclass] [<ffffffffa055f8a5>] cl_enqueue_try+0x1a5/0x300 [obdclass] [<ffffffffa0560bfd>] cl_enqueue_locked+0x6d/0x200 [obdclass] [<ffffffffa056187e>] cl_lock_request+0x7e/0x1c0 [obdclass] [<ffffffffa0566bc5>] cl_io_lock+0x3d5/0x570 [obdclass] [<ffffffffa0566e1a>] cl_io_loop+0xba/0x190 [obdclass] [<ffffffffa09d2097>] ll_file_io_generic+0x3a7/0x560 [lustre] [<ffffffffa09d2389>] ll_file_aio_read+0x139/0x2c0 [lustre] [<ffffffffa09d2849>] ll_file_read+0x169/0x2a0 [lustre] [<ffffffff811679b5>] vfs_read+0xb5/0x1a0 [<ffffffff81167af1>] sys_read+0x51/0x90 Hope this helps. ] |
Attachment | New: LU-4112-reproducer.tar.gz [ 13693 ] |
Attachment | New: lustre-logs.tar.bz2.aa [ 13637 ] | |
Attachment | New: lustre-logs.tar.bz2.ab [ 13638 ] |