[LU-16539] replay-dual test_32: Timeout occurred after x minutes, last suite running was replay-dual Created: 08/Feb/23 Updated: 14/Jun/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Timothy Day <timday@amazon.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5ae4d92c-fc89-4e8c-bf83-4b09c9e7ead2 test_32 failed with the following error: Timeout occurred after 132 minutes, last suite running was replay-dual Test session details: <<Please provide additional information about the failure here>> Maloo auto-associated EX-6786 (this isn't community viewable, but I it put here for the benefit of those that can see it). VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |
| Comments |
| Comment by Serguei Smirnov [ 01/Mar/23 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/c9a4a588-8f2d-4d3e-8129-33b0f89ad07a |
| Comment by Etienne Aujames [ 07/Jun/23 ] |
|
+1 on master: https://testing.whamcloud.com/test_sets/40925114-446c-4b59-a1d2-001ad058c500 The 'ost' service seems to hang: [Mon Jun 5 11:32:14 2023] Lustre: DEBUG MARKER: == replay-dual test 32: gap in update llog shouldn't break recovery ========================================================== 11:32:14 (1685964734) [Mon Jun 5 11:32:30 2023] Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.25.55@tcp) was lost; in progress operations using this service will wait for recovery to complete [Mon Jun 5 11:32:30 2023] Lustre: Skipped 71 previous similar messages [Mon Jun 5 11:32:35 2023] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-61vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4 [Mon Jun 5 11:32:35 2023] LustreError: 166-1: MGC10.240.25.55@tcp: Connection to MGS (at 10.240.25.55@tcp) was lost; in progress operations using this service will fail [Mon Jun 5 11:32:35 2023] LustreError: Skipped 4 previous similar messages [Mon Jun 5 11:38:00 2023] LustreError: 94760:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1685964753, 328s ago); not entering recovery in server code, just going back to sleep ns: filter-lustre-OST0004_UUID lock: 000000002d1bbe55/0xec7df65557708c7c lrc: 3/0,1 mode: --/PW res: [0x400000404:0x5c4:0x0].0x0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 94760 timeout: 0 lvb_type: 0 [Mon Jun 5 11:38:00 2023] LustreError: 93962:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) Skipped 4 previous similar messages [Mon Jun 5 11:39:05 2023] ptlrpc_watchdog_fire: 69 callbacks suppressed [Mon Jun 5 11:39:05 2023] Lustre: ll_ost00_068: service thread pid 93962 was inactive for 393.001 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [Mon Jun 5 11:39:05 2023] Pid: 94760, comm: ll_ost00_074 4.18.0-425.10.1.el8_lustre.x86_64 #1 SMP Wed May 3 16:22:26 UTC 2023 [Mon Jun 5 11:39:05 2023] Call Trace TBD: [Mon Jun 5 11:39:05 2023] Lustre: Skipped 2 previous similar messages [Mon Jun 5 11:39:05 2023] [<0>] ldlm_completion_ast+0x57e/0x900 [ptlrpc] [Mon Jun 5 11:39:05 2023] [<0>] ldlm_cli_enqueue_local+0x30b/0x860 [ptlrpc] [Mon Jun 5 11:39:05 2023] [<0>] ofd_destroy_by_fid+0x239/0x4a0 [ofd] [Mon Jun 5 11:39:05 2023] [<0>] ofd_destroy_hdl+0x263/0xa00 [ofd] [Mon Jun 5 11:39:05 2023] [<0>] tgt_request_handle+0xcc3/0x1920 [ptlrpc] [Mon Jun 5 11:39:05 2023] [<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc] [Mon Jun 5 11:39:05 2023] [<0>] ptlrpc_main+0xc52/0x1510 [ptlrpc] [Mon Jun 5 11:39:05 2023] [<0>] kthread+0x10b/0x130 [Mon Jun 5 11:39:05 2023] [<0>] ret_from_fork+0x35/0x40 [Mon Jun 5 12:26:36 2023] Lustre: 'ost' is processing requests too slowly, client may timeout. Late by 13s, missed 0 early replies (reqs waiting=0 active=78, at_estimate=600, delay=0ms) |
| Comment by Andreas Dilger [ 13/Jun/23 ] |
|
Failed 10x on master in the past week. |