[LU-15737] recovery-small: ll_ost00 - service thread hangs. Created: 12/Apr/22  Updated: 28/Jun/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-10632 recovery-small test 26a fails with ‘c... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for Cliff White <cwhite@whamcloud.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/4784602c-4d77-42e2-919b-a194a0137d91

Test fails due to client timing out waiting on FULL state.
Appears to be due to thread hanging on one node:

[ 1927.612591] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == recovery-small test 26a: evict dead exports =========== 09:10:53 \(1649063453\)
[ 1928.007926] Lustre: DEBUG MARKER: == recovery-small test 26a: evict dead exports =========== 09:10:53 (1649063453)
[ 1974.360682] Lustre: ll_ost00_004: service thread pid 11008 was inactive for 43.187 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 1974.364718] Pid: 11008, comm: ll_ost00_004 4.18.0-348.2.1.el8_lustre.x86_64 #1 SMP Sun Apr 3 16:16:31 UTC 2022
[ 1974.366773] Call Trace TBD:
[ 1974.367500] [<0>] ldlm_completion_ast+0x7ac/0x900 [ptlrpc]
[ 1974.368739] [<0>] ldlm_cli_enqueue_local+0x307/0x860 [ptlrpc]
[ 1974.369924] [<0>] ofd_destroy_by_fid+0x235/0x4a0 [ofd]
[ 1974.370992] [<0>] ofd_destroy_hdl+0x263/0xa10 [ofd]
[ 1974.372045] [<0>] tgt_request_handle+0xc93/0x1a40 [ptlrpc]
[ 1974.373224] [<0>] ptlrpc_server_handle_request+0x323/0xbd0 [ptlrpc]
[ 1974.374523] [<0>] ptlrpc_main+0xc06/0x1560 [ptlrpc]
[ 1974.375548] [<0>] kthread+0x116/0x130
[ 1974.376336] [<0>] ret_from_fork+0x35/0x40
[ 1974.664781] Lustre: lustre-OST0005: haven't heard from client 0b624cdc-fcec-4fec-b859-486e2bb9b84b (at 10.240.40.108@tcp) in 47 seconds. I think it's dead, and I am evicting it. exp 000000003f573f19, cur 1649063501 expire 1649063471 last 1649063454
[ 2020.832883] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_26a: @@@@@@ FAIL: lustre-OST0000-osc-ffff8f8645de7800 state is not FULL 
[ 2021.200277] Lustre: DEBUG MARKER: recovery-small test_26a: @@@@@@ FAIL: lustre-OST0000-osc-ffff8f8645de7800 state is not FULL

Generated at Sat Feb 10 03:20:52 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.