[LU-16539] replay-dual test_32: Timeout occurred after x minutes, last suite running was replay-dual Created: 08/Feb/23  Updated: 14/Jun/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
Related
is related to LU-15809 replay-dual test_29: timeout llog_ver... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Timothy Day <timday@amazon.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/5ae4d92c-fc89-4e8c-bf83-4b09c9e7ead2

test_32 failed with the following error:

Timeout occurred after 132 minutes, last suite running was replay-dual

Test session details:
clients: https://build.whamcloud.com/job/lustre-reviews/92186 - 4.18.0-372.32.1.el8_6.x86_64
servers: https://build.whamcloud.com/job/lustre-reviews/92186 - 4.18.0-372.32.1.el8_lustre.x86_64

<<Please provide additional information about the failure here>>

Maloo auto-associated EX-6786 (this isn't community viewable, but I it put here for the benefit of those that can see it).

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
replay-dual test_32 - Timeout occurred after 132 minutes, last suite running was replay-dual



 Comments   
Comment by Serguei Smirnov [ 01/Mar/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/c9a4a588-8f2d-4d3e-8129-33b0f89ad07a

Comment by Etienne Aujames [ 07/Jun/23 ]

+1 on master: https://testing.whamcloud.com/test_sets/40925114-446c-4b59-a1d2-001ad058c500

The 'ost' service seems to hang:

[Mon Jun  5 11:32:14 2023] Lustre: DEBUG MARKER: == replay-dual test 32: gap in update llog shouldn't break recovery ========================================================== 11:32:14 (1685964734)
[Mon Jun  5 11:32:30 2023] Lustre: lustre-MDT0000-lwp-OST0000: Connection to lustre-MDT0000 (at 10.240.25.55@tcp) was lost; in progress operations using this service will wait for recovery to complete
[Mon Jun  5 11:32:30 2023] Lustre: Skipped 71 previous similar messages
[Mon Jun  5 11:32:35 2023] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-61vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4
[Mon Jun  5 11:32:35 2023] LustreError: 166-1: MGC10.240.25.55@tcp: Connection to MGS (at 10.240.25.55@tcp) was lost; in progress operations using this service will fail
[Mon Jun  5 11:32:35 2023] LustreError: Skipped 4 previous similar messages

[Mon Jun  5 11:38:00 2023] LustreError: 94760:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1685964753, 328s ago); not entering recovery in server code, just going back to sleep ns: filter-lustre-OST0004_UUID lock: 000000002d1bbe55/0xec7df65557708c7c lrc: 3/0,1 mode: --/PW res: [0x400000404:0x5c4:0x0].0x0 rrc: 5 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) gid 0 flags: 0x40010080000000 nid: local remote: 0x0 expref: -99 pid: 94760 timeout: 0 lvb_type: 0
[Mon Jun  5 11:38:00 2023] LustreError: 93962:0:(ldlm_request.c:124:ldlm_expired_completion_wait()) Skipped 4 previous similar messages
[Mon Jun  5 11:39:05 2023] ptlrpc_watchdog_fire: 69 callbacks suppressed
[Mon Jun  5 11:39:05 2023] Lustre: ll_ost00_068: service thread pid 93962 was inactive for 393.001 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[Mon Jun  5 11:39:05 2023] Pid: 94760, comm: ll_ost00_074 4.18.0-425.10.1.el8_lustre.x86_64 #1 SMP Wed May 3 16:22:26 UTC 2023
[Mon Jun  5 11:39:05 2023] Call Trace TBD:
[Mon Jun  5 11:39:05 2023] Lustre: Skipped 2 previous similar messages
[Mon Jun  5 11:39:05 2023] [<0>] ldlm_completion_ast+0x57e/0x900 [ptlrpc]
[Mon Jun  5 11:39:05 2023] [<0>] ldlm_cli_enqueue_local+0x30b/0x860 [ptlrpc]
[Mon Jun  5 11:39:05 2023] [<0>] ofd_destroy_by_fid+0x239/0x4a0 [ofd]
[Mon Jun  5 11:39:05 2023] [<0>] ofd_destroy_hdl+0x263/0xa00 [ofd]
[Mon Jun  5 11:39:05 2023] [<0>] tgt_request_handle+0xcc3/0x1920 [ptlrpc]
[Mon Jun  5 11:39:05 2023] [<0>] ptlrpc_server_handle_request+0x31d/0xbc0 [ptlrpc]
[Mon Jun  5 11:39:05 2023] [<0>] ptlrpc_main+0xc52/0x1510 [ptlrpc]
[Mon Jun  5 11:39:05 2023] [<0>] kthread+0x10b/0x130
[Mon Jun  5 11:39:05 2023] [<0>] ret_from_fork+0x35/0x40

[Mon Jun  5 12:26:36 2023] Lustre: 'ost' is processing requests too slowly, client may timeout. Late by 13s, missed 0 early replies (reqs waiting=0 active=78, at_estimate=600, delay=0ms)
Comment by Andreas Dilger [ 13/Jun/23 ]

Failed 10x on master in the past week.

Generated at Sat Feb 10 03:27:53 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.