[LU-1496] Client evicted frequently due to lock callback timer expiration Created: 08/Jun/12  Updated: 08/Oct/13  Resolved: 27/Aug/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 1.8.x (1.8.0 - 1.8.5)
Fix Version/s: Lustre 1.8.9

Type: Bug Priority: Minor
Reporter: Mitsuhiro Nishizawa Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 6385

 Description   

Our customer is seeing client eviction due to lock callback timer expiration relatively frequently.
The client is not always same, but it occurred 3 times on Jun 3rd. As far as the customer checked
the network, there is no error reported.

<< OSS >>
2012/06/03 17:32:06 kern.err@oss4 kernel[-]: [261289.824573] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.17.11.9@o2ib ns: filter-data-OST0007_UUID lock: ffff8100debd3000/0x2babc16b9ceef3a2 lrc: 3/0,0 mode: PW/PW res: 345385/0 rrc: 8 type: EXT [32768->159743] (req 32768->36863) flags: 0x20 remote: 0xd4d1e1a63a1900b8 expref: 14 pid: 2520 timeout 4556489496
2012/06/03 17:32:08 kern.err@oss4 kernel[-]: [261292.335250] LustreError: 26916:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (107) req@ffff81049eabb000 x1403470492409861/t0 o13><?>@<?>:0/0 lens 192/0 e 0 to 0 dl 1338712334 ref 1 fl Interpret:/0/0 rc -107/0
2012/06/03 17:32:22 kern.err@oss4 kernel[-]: [261306.299203] LustreError: 2435:0:(ldlm_lib.c:1914:target_send_reply_msg()) @@@ processing error (114) req@ffff8102ac2e3c00 x1403470492409878/t0 o8><?>@<?>:0/0 lens 368/264 e 0 to 0 dl 1338712442 ref 1 fl Interpret:/0/0 rc -114/0

<< client >>
2012/06/03 17:32:08 kern.err@cnode009 kernel[-]: kernel: LustreError: 11-0: an error occurred while communicating with 172.17.13.36@o2ib. The ost_statfs operation failed with -107
2012/06/03 17:32:08 kern.warning@cnode009 kernel[-]: kernel: Lustre: data-OST0007-osc-ffff81063bc60800: Connection to service data-OST0007 via nid 172.17.13.36@o2ib was lost; in progress operations using this service will wait for recovery to complete.
2012/06/03 17:32:14 kern.warning@cnode009 kernel[-]: kernel: Lustre: 3960:0:(client.c:1482:ptlrpc_expire_one_request()) @@@ Request x1403470492409862 sent from data-OST0007-osc-ffff81063bc60800 to NID 172.17.13.36@o2ib 6s ago has timed out (6s prior to deadline).
2012/06/03 17:32:14 kern.warning@cnode009 kernel[-]: kernel: req@ffff810256624800 x1403470492409862/t0 o8->data-OST0007_UUID@172.17.13.36@o2ib:28/4 lens 368/584 e 0 to 1 dl 1338712334 ref 1 fl Rpc:N/0/0 rc 0/0



 Comments   
Comment by Mitsuhiro Nishizawa [ 08/Jun/12 ]

Uploaded log file to the FTP server: /uploads/LU-1496/20120604.tar.gz

Thanks,

Comment by Peter Jones [ 08/Jun/12 ]

Bobijam

Could you please advise on this one?

Thanks

Peter

Comment by Zhenyu Xu [ 10/Jun/12 ]

from uploaded logs, it looks like ost0007 node had some congestions which caused client evictions, can you check it (disk, network interface, etc) and upload lustre debug logs of it here?

Comment by Mitsuhiro Nishizawa [ 14/Jun/12 ]

We asked the customer to check the network and capture debug log when eviction occur next time.
We will upload once we receive it. Thanks,

Comment by Mitsuhiro Nishizawa [ 19/Jun/12 ]

The issue occurred again and we captured debug log. We uploaded it to the FTP server: /uploads/LU-1496/LU-1496_20120620.tar.gz
$ tar tzf LU-1496_20120620.tar.gz
debug_kernel_20120619_235205 ... debug log on the OSS
debug_kernel_cnode050 ... debug log on the evicted client
log-20120620/
log-20120620/cnode050_kernel.txt ... messages on the evicted client
log-20120620/oss4_kernel.txt ... messages on the OSS

2012/06/19 23:52:05 kern.err@oss4 kernel[-]: [1662880.375982] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.17.11.50@o2ib ns: filter-data-OST0007_UUID lock: ffff8101a5e2fa00/0x2babc16ba243f737 lrc: 3/0,0 mode: PW/PW res: 430494/0 rrc: 8 type: EXT [73728->159743] (req 73728->77823) flags: 0x20 remote: 0x225685b920b04720 expref: 8 pid: 2438 timeout 5961707167

Comment by Zhenyu Xu [ 20/Jun/12 ]

The OSS4 is heavy loaded, and many clients are compete writing to same files. The write request from cnode050 to OST0007 (on OSS4) was delayed too much for the heavy load and got evicted.

OSS4's heavy load

2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.218357] Lustre: data-OST0007: slow quota init 150s due to heavy IO load
2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.218364] Lustre: Skipped 4 previous similar messages
2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.218535] Lustre: data-OST0007: slow quota init 150s due to heavy IO load
2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.218541] Lustre: Skipped 1 previous similar message
2012/06/19 13:36:27 kern.warning@oss4 kernel[-]: [1626037.224625] Lustre: data-OST0007: slow direct_io 166s due to heavy IO load
2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.964476] Lustre: data-OST0006: slow quota init 101s due to heavy IO load
2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.964479] Lustre: data-OST0006: slow quota init 80s due to heavy IO load
2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.964485] Lustre: Skipped 48 previous similar messages
2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.964489] Lustre: Skipped 48 previous similar messages
2012/06/19 21:42:41 kern.warning@oss4 kernel[-]: [1655136.965584] Lustre: data-OST0006: slow direct_io 131s due to heavy IO load
2012/06/19 21:42:41 kern.info@oss4 kernel[-]: [1655136.966417] Lustre: data-OST0006: slow quota init 31s due to heavy IO load
2012/06/19 21:42:41 kern.info@oss4 kernel[-]: [1655136.966421] Lustre: Skipped 7 previous similar messages

cnode050's write request got evicted due to lock's timeout

00000100:00000200:23:1340117424.398493:0:4625:0:(service.c:1046:ptlrpc_hpreq_reorder_nolock()) @@@ high priority req req@ffff8101b6316800 x1403470546287529/t0 o4->2208c67a-8f53-db44-9cbd-e65dba8f887c@NET_0x50000ac110b32_UUID:0/0 lens 448/0 e 0 to 0 dl 1340117467 ref 1 fl New:H/0/0 rc 0/0
00000100:00100000:23:1340117424.398534:0:4625:0:(service.c:1369:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ll_ost_io_171:2208c67a-8f53-db44-9cbd-e65dba8f887c+9:3937:x1403470546287529:12345-172.17.11.50@o2ib:4

got request on 1340117424 and start to handle

00000100:00001000:7:1340117462.705473:0:2587:0:(service.c:762:ptlrpc_at_send_early_reply()) @@@ sending early reply (deadline +5s, margin 63s) for 68+30 req@ffff8101b6316800 x1403470546287529/t0 o4>2208c67a-8f53-db44-9cbd-e65dba8f887c@NET_0x50000ac110b32_UUID:0/0 lens 448/416 e 0 to 0 dl 1340117467 ref 2 fl Interpret:H/0/0 rc 0/0
00000100:00001000:7:1340117462.705480:0:2587:0:(import.c:1468:at_measured()) add 68 to ffff8102ee7ab860 time=60 v=68 (68 1 31 1)

38 seconds passed, asked client give OST extra 68 seconds

00010000:00020000:23:1340117525.229707:0:0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 101s: evicting client at 172.17.11.50@o2ib ns: filter-data-OST0007_UUID lock: ffff8101a5e2fa00/0x2babc16ba243f737 lrc: 3/0,0 mode: PW/PW res: 430494/0 rrc: 8 type: EXT [73728->159743] (req 73728->77823) flags: 0x20 remote: 0x225685b920b04720 expref: 8 pid: 2438 timeout 5961707167

the extent lock protecting the write has timed out and evicted the client

Comment by Zhenyu Xu [ 20/Jun/12 ]

I think LU-918 and LU-874 addressed this issue.

Comment by Mitsuhiro Nishizawa [ 20/Jun/12 ]

The customer's lustre version is 1.8.4. Is the issue in LU-874 applicable to 1.8.4?
Can the patched released there be applied?

Comment by Zhenyu Xu [ 20/Jun/12 ]

not directly, we need a patch port for 1.8.x

Comment by Zhenyu Xu [ 21/Jun/12 ]

port patch tracking at http://review.whamcloud.com/3157

Comment by Mitsuhiro Nishizawa [ 21/Jun/12 ]

Thanks, Zhenyu. We will discuss if we can apply the patch.

Today, the customer encoutered another instance of the issue.
2012/06/21 01:04:29 kern.err@oss2 kernel[-]: [1753382.589040] LustreError: 0:0:(ldlm_lockd.c:305:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 172.17.11.29@o2ib ns: filter-data-OST0003_UUID lock: ffff8101127e8c00/0xb2893ade320a8743 lrc: 3/0,0 mode: PW/PW res: 431401/0 rrc: 8 type: EXT [86016->155647] (req 86016->90111) flags: 0x20 remote: 0xa4f5ddef5f4b10d8 expref: 10 pid: 2462 timeout 6052442992

We uploaded the debug log to the FTP server:/uploads/LU-1496/LU-1496_20120621.tar.gz
debug_kernel_20120621_010429.gz ... oss2 debug log
debug_kernel_cnode029.gz ... client debug log
log-20120621/cnode029_kernel.txt ... client messages
log-20120621/oss2_kernel.txt ... oss2 messages

It looks like this is again due to high I/O load on the OSS.
Would you please check if this is the same cause?

Comment by Zhenyu Xu [ 06/Jul/12 ]

Sorry for the late response, yes, I think the phenomenon is the same.

Comment by Mitsuhiro Nishizawa [ 09/Jul/12 ]

Thank for checking the log. The customer is currently considering to upgrade lustre version with the patch applied.
Please close this ticket. Thank you.

Comment by Peter Jones [ 27/Aug/12 ]

Landed to b1_8 and already landed to master under LU-874

Comment by Melinda McDade [ 08/Oct/13 ]

I am running into this problem while running lustre 1.8.8. Is there a simple patch or do I have to upgrade to 1.8.9? Have already increased the osc..max_rpcs_in_flight=32, turned off checksums, and set osc..max_dirty_mb=512. This helped with some evictions. However, I tend to think the timeout=100 still needs to be increased even with adaptive timeouts. Notice the 100sec timeout below when we have at_min=0 and at_max=800. Interesting expiration time since timeout=100. Could also be related to an actual network socket timeout. Please help.

Melinda

LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 34.239.17.202@tcp ns: filter-dlustre-OST0002_UUID lock: ffff8102636cde00/0x392a313d071a080e lrc: 3/0,0 mode: PW/PW res: 961326/0 rrc: 2 type: EXT [0->18446744073709551615] (req 71299072->71303167) flags: 0x20 remote: 0xa1584afedda7b6c expref: 197 pid: 5369 timeout 10277973712
LustreError: 0:0:(ldlm_lockd.c:313:waiting_locks_callback()) Skipped 39 previous similar messages
LustreError: 4057:0:(ldlm_lib.c:1919:target_send_reply_msg())

Generated at Sat Feb 10 01:17:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.