Details
Description
Lustre Client cannot access Server, in hang.Restarting all MDTs resolved (temporarily) the issue,or swiching the mdt to the other mds, resloved the issue also.
MDS some messages:
Jul 19 08:14:01 hwmds1 kernel: LustreError: 55147:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 172.18.0.163@o2ib) returned error from blocking AST (req@ffff9d26f9ae9200 x1705516290544576 status -107 rc -107), evict it ns: mdt-sjtu-MDT0000_UUID lock: ffff9d35ecd87a80/0x338e75bad6f63e8b lrc: 4/0,0 mode: PR/PR res: [0x2000004e9:0x6bd4:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 172.18.0.163@o2ib remote: 0xd3df166cbd51cf50 expref: 632 pid: 55004 timeout: 147526 lvb_type: 0
Jul 19 08:14:01 hwmds1 kernel: LustreError: 55147:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) Skipped 65 previous similar messages
Jul 19 08:14:01 hwmds1 kernel: LustreError: 138-a: sjtu-MDT0000: A client on nid 172.18.0.163@o2ib was evicted due to a lock blocking callback time out: rc -107
Jul 19 08:14:01 hwmds1 kernel: LustreError: Skipped 65 previous similar messages
Jul 19 08:14:01 hwmds1 kernel: LustreError: 24226:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 0s: evicting client at 172.18.0.163@o2ib ns: mdt-sjtu-MDT0000_UUID lock: ffff9d35ecd87a80/0x338e75bad6f63e8b lrc: 3/0,0 mode: PR/PR res: [0x2000004e9:0x6bd4:0x0].0x0 bits 0x13/0x0 rrc: 3 type: IBT flags: 0x60200400000020 nid: 172.18.0.163@o2ib remote: 0xd3df166cbd51cf50 expref: 630 pid: 55004 timeout: 0 lvb_type: 0
Jul 19 08:20:42 hwmds1 kernel: LustreError: 55060:0:(ldlm_lockd.c:681:ldlm_handle_ast_error()) ### client (nid 172.18.0.162@o2ib) failed to reply to blocking AST (req@ffff9d26faac7980 x1705516291328320 status 0 rc -110), evict it ns: mdt-sjtu-MDT0000_UUID lock: ffff9d0990fa98c0/0x338e75cfcbaa5a9b lrc: 4/0,0 mode: PR/PR res: [0x200011971:0x5c:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 172.18.0.162@o2ib remote: 0x54f767b737123f3d expref: 840 pid: 55082 timeout: 147912 lvb_type: 0
Jul 19 08:20:42 hwmds1 kernel: LustreError: 138-a: sjtu-MDT0000: A client on nid 172.18.0.162@o2ib was evicted due to a lock blocking callback time out: rc -110
Jul 19 08:20:42 hwmds1 kernel: LustreError: 24226:0:(ldlm_lockd.c:256:expired_lock_main()) ### lock callback timer expired after 115s: evicting client at 172.18.0.162@o2ib ns: mdt-sjtu-MDT0000_UUID lock: ffff9d0990fa98c0/0x338e75cfcbaa5a9b lrc: 3/0,0 mode: PR/PR res: [0x200011971:0x5c:0x0].0x0 bits 0x13/0x0 rrc: 4 type: IBT flags: 0x60200400000020 nid: 172.18.0.162@o2ib remote: 0x54f767b737123f3d expref: 841 pid: 55082 timeout: 0 lvb_type: 0
Jul 19 08:21:06 hwmds1 kernel: LustreError: 28297:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626653766, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d349c3933c0/0x338e75cfec19dd8c lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 232 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 28297 timeout: 0 lvb_type: 0
Jul 19 08:25:15 hwmds1 kernel: LustreError: dumping log to /tmp/lustre-log.1626654315.55113
Jul 19 08:25:42 hwmds1 kernel: LustreError: 55012:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626654042, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d3459739680/0x338e75cfeea551c2 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 239 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 55012 timeout: 0 lvb_type: 0
Jul 19 08:26:55 hwmds1 kernel: LustreError: 55113:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626654115, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d09f0b3eac0/0x338e75cfef52e402 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 240 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 55113 timeout: 0 lvb_type: 0
Jul 19 08:27:12 hwmds1 kernel: LustreError: 55138:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626654132, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d38a630d680/0x338e75cfef80008f lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 240 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 55138 timeout: 0 lvb_type: 0
Jul 19 08:27:48 hwmds1 kernel: LustreError: 55131:0:(ldlm_request.c:129:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1626654168, 300s ago); not entering recovery in server code, just going back to sleep ns: mdt-sjtu-MDT0000_UUID lock: ffff9d33f4bec480/0x338e75cfefda3fe5 lrc: 3/0,1 mode: --/EX res: [0x200000004:0x1:0x0].0x0 bits 0x2/0x0 rrc: 240 type: IBT flags: 0x40210400000020 nid: local remote: 0x0 expref: -99 pid: 55131 timeout: 0 lvb_type: 0