[LU-15488] conf-sanity test_6: client_up failed (MDS hangs cv_wait_common) Created: 27/Jan/22  Updated: 27/Jan/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for eaujames <eaujames@ddn.com>

This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2df62cb6-8f14-49b5-a26f-3c278cefa18b

test_6 failed with the following error:

client_up failed

This seems to be link to ZFS with b2_12 branch:

Client:

[ 2474.629870] LustreError: 11-0: lustre-MDT0003-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.128@tcp failed: rc = -114
[ 2479.642804] LustreError: 11-0: lustre-MDT0003-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.128@tcp failed: rc = -114
[ 2479.644961] LustreError: Skipped 2 previous similar messages
[ 2484.650711] LustreError: 11-0: lustre-MDT0002-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.127@tcp failed: rc = -114
[ 2484.652857] LustreError: Skipped 2 previous similar messages
[ 2489.658307] LustreError: 11-0: lustre-MDT0003-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.128@tcp failed: rc = -114
[ 2489.660527] LustreError: Skipped 2 previous similar messages
[ 2495.674146] LustreError: 11-0: lustre-MDT0001-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.128@tcp failed: rc = -114
[ 2495.676310] LustreError: Skipped 6 previous similar messages
[ 2505.689787] LustreError: 11-0: lustre-MDT0000-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.127@tcp failed: rc = -114

MDS hangs on txg_wait_synced:

[ 2381.629691] Lustre: lustre-MDT0003: Export ffff902853b4b000 already connecting from 10.240.24.235@tcp
[ 2391.652112] Lustre: lustre-MDT0003: Export ffff902853b4b000 already connecting from 10.240.24.235@tcp
[ 2391.653888] Lustre: Skipped 1 previous similar message
[ 2407.683636] Lustre: lustre-MDT0001: Export ffff9027c0051c00 already connecting from 10.240.24.235@tcp
[ 2407.685496] Lustre: Skipped 6 previous similar messages
[ 2416.767892] LNet: Service thread pid 23710 was inactive for 40.13s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[ 2416.770934] Pid: 23710, comm: mdt00_002 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Thu Dec 2 08:52:07 UTC 2021
[ 2416.772714] Call Trace:
[ 2416.773264]  [<ffffffffc058b2d5>] cv_wait_common+0x125/0x150 [spl]
[ 2416.774498]  [<ffffffffc058b315>] __cv_wait+0x15/0x20 [spl]
[ 2416.775568]  [<ffffffffc09992ff>] txg_wait_synced+0xef/0x140 [zfs]
[ 2416.776933]  [<ffffffffc141a94b>] osd_trans_stop+0x53b/0x5e0 [osd_zfs]
[ 2416.778189]  [<ffffffffc1238051>] tgt_server_data_update+0x201/0x510 [ptlrpc]
[ 2416.779819]  [<ffffffffc1239144>] tgt_client_new+0x494/0x610 [ptlrpc]
[ 2416.781095]  [<ffffffffc1552495>] mdt_obd_connect+0x465/0x850 [mdt]
[ 2416.782353]  [<ffffffffc119d49b>] target_handle_connect+0xecb/0x2b60 [ptlrpc]
[ 2416.783748]  [<ffffffffc124690a>] tgt_request_handle+0x4fa/0x1570 [ptlrpc]
[ 2416.785101]  [<ffffffffc11ebbcb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[ 2416.786551]  [<ffffffffc11ef534>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
[ 2416.787749]  [<ffffffffa46c5e61>] kthread+0xd1/0xe0
[ 2416.788754]  [<ffffffffa4d95df7>] ret_from_fork_nospec_end+0x0/0x39
[ 2416.789990]  [<ffffffffffffffff>] 0xffffffffffffffff
...
[ 2427.026534] LNet: Service thread pid 23710 completed after 50.39s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).

This seems similar to LU-12510 or LU-10223

VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_6 - client_up failed


Generated at Sat Feb 10 03:18:44 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.