[LU-14863] an LustreError "ldlm_request.c:129:ldlm_expired_completion_wait()" in messages log Created: 19/Jul/21  Updated: 21/Jul/21

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: liziyan Assignee: Peter Jones
Resolution: Unresolved Votes: 0
Labels: None
Environment:

OS is CentOS 7.6, lustre servers version is 2.12.4, clients: 2.12.4


Epic/Theme: dne, ldiskfs, mgs
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

 Lustre Client cannot access Server, in hang.Restarting all MDTs resolved (temporarily) the issue,or swiching the mdt to the other mds, resloved the issue also.

MDS some messages:

Jul 19 08:14:01 hwmds1 kernel: LustreError: 55147:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 172.18.0.163@o2ib) returned error from blocking AST (req@ffff9d26f9ae9200 x1705516290544576 status -107 rc -107), evict it ns: mdt-sjtu-MDT0000_UUID lock: ffff9d35ecd87a80/0x338e75bad6f63e8b lrc: 4/0,0 mode: PR/PR res: [0x2000004e9:0x6bd4:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 172.18.0.163@o2ib remote: 0xd3df166cbd51cf50 expref: 632 pid: 55004 timeout: 147526 lvb_type: 0
Jul 19 08:14:01 hwmds1 kernel: LustreError: 55147:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) Skipped 65 previous similar messages
Jul 19 08:14:01 hwmds1 kernel: LustreError: 138-a: sjtu-MDT0000: A client on nid 172.18.0.163@o2ib was evicted due to a lock blocking callback time out: rc -107
Jul 19 08:14:01 hwmds1 kernel: LustreError: Skipped 65 previous similar messages
Jul 19 08:14:01 hwmds1 kernel: LustreError: 24226:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 172.18.0.163@o2ib ns: mdt-sjtu-MDT0000_UUID lock: ffff9d35ecd87a80/0x338e75bad6f63e8b lrc: 3/0,0 mode: PR/PR res: [0x2000004e9:0x6bd4:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 172.18.0.163@o2ib remote: 0xd3df166cbd51cf50 expref: 630 pid: 55004 timeout: 0 lvb_type: 0
Jul 19 08:20:42 hwmds1 kernel: LustreError: 55060:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 172.18.0.162@o2ib) failed to reply to blocking AST (req@ffff9d26faac7980 x1705516291328320 status 0 rc -110), evict it ns: mdt-sjtu-MDT0000_UUID lock: ffff9d0990fa98c0/0x338e75cfcbaa5a9b lrc: 4/0,0 mode: PR/PR res: [0x200011971:0x5c:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 172.18.0.162@o2ib remote: 0x54f767b737123f3d expref: 840 pid: 55082 timeout: 147912 lvb_type: 0
Jul 19 08:20:42 hwmds1 kernel: LustreError: 138-a: sjtu-MDT0000: A client on nid 172.18.0.162@o2ib was evicted due to a lock blocking callback time out: rc -110
Jul 19 08:20:42 hwmds1 kernel: LustreError: 24226:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 115s: evicting client at 172.18.0.162@o2ib ns: mdt-sjtu-MDT0000_UUID lock: ffff9d0990fa98c0/0x338e75cfcbaa5a9b lrc: 3/0,0 mode: PR/PR res: [0x200011971:0x5c:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 172.18.0.162@o2ib remote: 0x54f767b737123f3d expref: 841 pid: 55082 timeout: 0 lvb_type: 0
Jul 19 08:21:06 hwmds1 kernel: LustreError: 28297:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626653766, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d349c3933c0/0x338e75cfec19dd8c lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 232 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 28297 timeout: 0 lvb_type: 0
Jul 19 08:25:15 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1626654315.55113
Jul 19 08:25:42 hwmds1 kernel: LustreError: 55012:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626654042, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d3459739680/0x338e75cfeea551c2 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 239 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 55012 timeout: 0 lvb_type: 0
Jul 19 08:26:55 hwmds1 kernel: LustreError: 55113:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626654115, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d09f0b3eac0/0x338e75cfef52e402 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 240 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 55113 timeout: 0 lvb_type: 0
Jul 19 08:27:12 hwmds1 kernel: LustreError: 55138:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626654132, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d38a630d680/0x338e75cfef80008f lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 240 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 55138 timeout: 0 lvb_type: 0
Jul 19 08:27:48 hwmds1 kernel: LustreError: 55131:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626654168, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d33f4bec480/0x338e75cfefda3fe5 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 240 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 55131 timeout: 0 lvb_type: 0



 Comments   
Comment by Peter Jones [ 19/Jul/21 ]

James

As you opened this ticket in the community LU project rather than the support CTCH project does that mean that this issue is aa general community issue rather than an issue reported by one of our mutual customers?

Peter

Comment by liziyan [ 21/Jul/21 ]

Hi Perter:

Yes, this is a community issue.

thanks

Generated at Sat Feb 10 03:13:25 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.