[LU-4112] Random eviction of clients on lock callback timeout - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Won't Fix
Priority: Blocker
Fix Version/s: None
Affects Version/s: Lustre 2.1.6
Labels:
None

Severity:
3
Rank (Obsolete):
11068

Description

Hi,

At Meteo France, Lustre clients are randomly evicted with the following message:

1381739209 2013 Oct 14 08:26:49 beaufixoss9 kern err kernel LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.23.67.157@o2ib ns: filter-scratch-OST0024_UUID lock: ffff880215b3c6c0/0xe3c417cfbe584025 lrc: 3/0,0 mode: PW/PW res: 87670641/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x20 remote: 0x964c642755ae9fd expref: 4 pid: 20133 timeout 4581190301

Some attempts were made to increase the ldlm_timeout, but even with a timeout of 300 seconds, the timer finally expires and evicts the client.

After collecting some debug logs, it seems that there is a communication problem: the client sends the lock callback RPC, but it is never received by the OSS, leading it to timeout.

Here is an example:

[root@beaufixmngt0 beaufix1046]# grep evict lustre-logs-20131011-082535
00000100:00080000:17.0:1381479935.274633:0:4383:0:(import.c:915:ptlrpc_connect_interpret()) @@@ scratch-OST0017-osc-ffff880454e9b400: evicting (reconnect/recover flags not set: 4) req@ffff88087945e000 x1448401857359985/t0(0) o8->scratch-OST0017-osc-ffff880454e9b400@172.23.80.69@o2ib1:28/4 lens 368/264 e 0 to 0 dl 1381479946 ref 1 fl Interpret:RN/0/0 rc 0/0
00000100:02020000:17.0:1381479935.274641:0:4383:0:(import.c:1299:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by scratch-OST0017; in progress operations using this service will fail.
00000100:00080000:17.0:1381479935.285351:0:4383:0:(import.c:1303:ptlrpc_import_recovery_state_machine()) evicted from scratch-OST0017_UUID@172.23.80.69@o2ib1; invalidating

172.23.80.69 = beaufixoss15-ib1

[root@beaufixmngt0 beaufix1046]# head -n1 ../beaufixoss15/lustre-logs-20131011-082458
00010000:00000010:14.1F:1381479186.063701:0:0:0:(ldlm_lock.c:198:ldlm_lock_free()) slab-freed 'lock': 560 at ffff880067e6c480.
[root@beaufixmngt0 beaufix1046]# tail -n 2 ../beaufixoss15/lustre-logs-20131011-082458 | head -n1
00002000:00000001:2.0:1381479898.458969:0:20716:0:(filter_io_26.c:325:filter_do_bio()) Process entered

OSS debug log covers from 1381479186 to 1381479898

Here is the list of RPCs sent during this time (total of 5 RPCs):
[root@beaufixmngt0 beaufix1046]# grep -f requests lustre-logs-20131011-082535 | grep ptl_send_buf
00000100:00000200:22.0:1381479685.229939:0:4404:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857358474, offset 0
00000100:00000200:46.0:1381479697.155610:0:4424:0:(niobuf.c:87:ptl_send_buf()) Sending 296 bytes to portal 28, xid 1448401857358807, offset 0
00000100:00000200:6.0:1381479797.753246:0:4384:0:(niobuf.c:87:ptl_send_buf()) Sending 456 bytes to portal 6, xid 1448401857359681, offset 0
00000100:00000200:27.0:1381479835.230381:0:4411:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359731, offset 0
00000100:00000200:6.0:1381479885.229892:0:4409:0:(niobuf.c:87:ptl_send_buf()) Sending 192 bytes to portal 28, xid 1448401857359816, offset 0

[root@beaufixmngt0 beaufix1046]# grep -f requests ../beaufixoss15/lustre-logs-20131011-082458 | grep incoming
00000100:00000040:0.0:1381479885.230810:0:10038:0:(events.c:284:request_in_callback()) incoming req@ffff88005b69f400 x1448401857359816 msgsize 192

-> Only 1 RPC received out of 5 sent by the clients

This example only shows what happens on top of LNet, but we can see no error at the Infiniband level yet.

This is a blocking problem as it is preventing the cluster from being used for production.

I will upload traces to help analysis.

Thanks for your help,
Sebastien.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

2014-02-11-run6-client.out.gz
970 kB
21/Feb/14 7:59 PM
LU-4112-reproducer.tar.gz
2 kB
25/Oct/13 11:33 AM
LU-4112-reproducer-v2.tar.gz
7 kB
12/Nov/13 11:12 PM
lustre-logs.tar.bz2.aa
0.2 kB
16/Oct/13 9:25 AM
lustre-logs.tar.bz2.ab
0.2 kB
16/Oct/13 9:25 AM
server_log_2_10_2014.log
11 kB
21/Feb/14 7:59 PM
xfer_client_2_10_2014.log
13 kB
21/Feb/14 7:59 PM

Activity

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Sebastien Buisson (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 16/Oct/13 9:13 AM

Updated:: 14/Aug/16 5:24 PM

Resolved:: 14/Aug/16 5:24 PM