Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for Andreas Dilger <adilger@whamcloud.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/3941729f-3585-42e1-bd98-ffe10ef937c1
test_26 failed with the following error:
Timeout occurred after 136 minutes, last suite running was replay-dual
Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/105764 - 4.18.0-513.24.1.el8_9.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/105764 - 4.18.0-513.24.1.el8_lustre.x86_64
In the MDS logs:
[Thu Jun 27 14:41:32 2024] Lustre: Failing over lustre-MDT0003 [Thu Jun 27 14:41:32 2024] LustreError: 63748:0:(client.c:1300:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req@ffff99eaa3893740 x1803023090591488/t0(0) o105->lustre-MDT0001@0@lo:15/16 lens 336/224 e 0 to 0 dl 0 ref 1 fl Rpc:QU/0/ffffffff rc 0/-1 job:'' uid:4294967295 gid:4294967295 [Thu Jun 27 14:41:33 2024] LustreError: 68222:0:(ldlm_resource.c:1128:ldlm_resource_complain()) mdt-lustre-MDT0003_UUID: namespace resource [0x2c0000400:0x81:0x0].0x0 (ffff99ea823e7840) refcount nonzero (1) after lock cleanup; forcing cleanup. [Thu Jun 27 14:41:33 2024] LustreError: Forced cleanup waiting for mdt-lustre-MDT0003_UUID namespace with 2 resources in use, (rc=-110) [Thu Jun 27 14:41:33 2024] Lustre: lustre-MDT0003: Not available for connect from 10.240.28.164@tcp (stopping) : : CLIENT LOG ========== [Thu Jun 27 14:42:27 2024] Lustre: 118988:0:(client.c:2361:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1719499292/real 1719499292] req@ffff9c97d3755040 x1803022797924608/t0(0) o36->lustre-MDT0003-mdc-ffff9c97e3d0c000@10.240.26.108@tcp:12/10 lens 488/512 e 0 to 1 dl 1719499347 ref 2 fl Rpc:XQr/200/ffffffff rc 0/-1 job:'tar.0' uid:0 gid:0 : : MDS LOG ======= [Thu Jun 27 14:42:47 2024] Lustre: mdt00_000: service thread pid 10781 was inactive for 74.166 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [Thu Jun 27 14:42:47 2024] Pid: 10781, comm: mdt00_000 4.18.0-513.24.1.el8_lustre.x86_64 #1 SMP Tue Jun 25 03:59:58 UTC 2024 [Thu Jun 27 14:42:47 2024] Call Trace TBD: [Thu Jun 27 14:42:47 2024] [<0>] ldlm_completion_ast+0x984/0xbd0 [ptlrpc] [Thu Jun 27 14:42:47 2024] [<0>] ldlm_cli_enqueue_fini+0xa84/0xf80 [ptlrpc] [Thu Jun 27 14:42:47 2024] [<0>] ldlm_cli_enqueue+0x607/0xa50 [ptlrpc] [Thu Jun 27 14:42:47 2024] [<0>] osp_md_object_lock+0x1c2/0x2b0 [osp] [Thu Jun 27 14:42:47 2024] [<0>] lod_object_lock+0x57d/0x800 [lod] [Thu Jun 27 14:42:47 2024] [<0>] mdt_object_stripes_lock+0x35c/0x4f0 [mdt] [Thu Jun 27 14:42:47 2024] [<0>] mdt_reint_setattr+0x45e/0x1670 [mdt] [Thu Jun 27 14:42:47 2024] [<0>] mdt_reint_rec+0x123/0x270 [mdt] [Thu Jun 27 14:42:47 2024] [<0>] mdt_reint_internal+0x4c6/0x820 [mdt] [Thu Jun 27 14:42:47 2024] [<0>] mdt_reint+0x5d/0x110 [mdt] [Thu Jun 27 14:42:47 2024] [<0>] tgt_request_handle+0x3f4/0x1a60 [ptlrpc] [Thu Jun 27 14:42:47 2024] [<0>] ptlrpc_server_handle_request+0x3ca/0xbf0 [ptlrpc] [Thu Jun 27 14:42:47 2024] [<0>] ptlrpc_main+0xc9e/0x15c0 [ptlrpc] : [Thu Jun 27 14:46:40 2024] LustreError: 10781:0:(ldlm_request.c:140:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1719499292, 308s ago), entering recovery for lustre-MDT0001_UUID@10.240.26.108@tcp ns: lustre-MDT0001-osp-MDT0003 lock: ffff99ea996f18c0/0xa41be56238432e27 lrc: 4/0,1 mode: --/PW res: [0x240000403:0x64:0x0].0x0 bits 0x12/0x0 rrc: 2 type: IBT gid 0 flags: 0x1000000000000 nid: local remote: 0xa41be56238432e2e expref: -99 pid: 10781 timeout: 0 lvb_type: 0
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-dual test_26 - Timeout occurred after 136 minutes, last suite running was replay-dual