Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for eaujames <eaujames@ddn.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/2df62cb6-8f14-49b5-a26f-3c278cefa18b
test_6 failed with the following error:
client_up failed
This seems to be link to ZFS with b2_12 branch:
Client:
[ 2474.629870] LustreError: 11-0: lustre-MDT0003-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.128@tcp failed: rc = -114 [ 2479.642804] LustreError: 11-0: lustre-MDT0003-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.128@tcp failed: rc = -114 [ 2479.644961] LustreError: Skipped 2 previous similar messages [ 2484.650711] LustreError: 11-0: lustre-MDT0002-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.127@tcp failed: rc = -114 [ 2484.652857] LustreError: Skipped 2 previous similar messages [ 2489.658307] LustreError: 11-0: lustre-MDT0003-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.128@tcp failed: rc = -114 [ 2489.660527] LustreError: Skipped 2 previous similar messages [ 2495.674146] LustreError: 11-0: lustre-MDT0001-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.128@tcp failed: rc = -114 [ 2495.676310] LustreError: Skipped 6 previous similar messages [ 2505.689787] LustreError: 11-0: lustre-MDT0000-mdc-ffffa0037a786000: operation mds_connect to node 10.240.22.127@tcp failed: rc = -114
MDS hangs on txg_wait_synced:
[ 2381.629691] Lustre: lustre-MDT0003: Export ffff902853b4b000 already connecting from 10.240.24.235@tcp [ 2391.652112] Lustre: lustre-MDT0003: Export ffff902853b4b000 already connecting from 10.240.24.235@tcp [ 2391.653888] Lustre: Skipped 1 previous similar message [ 2407.683636] Lustre: lustre-MDT0001: Export ffff9027c0051c00 already connecting from 10.240.24.235@tcp [ 2407.685496] Lustre: Skipped 6 previous similar messages [ 2416.767892] LNet: Service thread pid 23710 was inactive for 40.13s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 2416.770934] Pid: 23710, comm: mdt00_002 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Thu Dec 2 08:52:07 UTC 2021 [ 2416.772714] Call Trace: [ 2416.773264] [<ffffffffc058b2d5>] cv_wait_common+0x125/0x150 [spl] [ 2416.774498] [<ffffffffc058b315>] __cv_wait+0x15/0x20 [spl] [ 2416.775568] [<ffffffffc09992ff>] txg_wait_synced+0xef/0x140 [zfs] [ 2416.776933] [<ffffffffc141a94b>] osd_trans_stop+0x53b/0x5e0 [osd_zfs] [ 2416.778189] [<ffffffffc1238051>] tgt_server_data_update+0x201/0x510 [ptlrpc] [ 2416.779819] [<ffffffffc1239144>] tgt_client_new+0x494/0x610 [ptlrpc] [ 2416.781095] [<ffffffffc1552495>] mdt_obd_connect+0x465/0x850 [mdt] [ 2416.782353] [<ffffffffc119d49b>] target_handle_connect+0xecb/0x2b60 [ptlrpc] [ 2416.783748] [<ffffffffc124690a>] tgt_request_handle+0x4fa/0x1570 [ptlrpc] [ 2416.785101] [<ffffffffc11ebbcb>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [ 2416.786551] [<ffffffffc11ef534>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [ 2416.787749] [<ffffffffa46c5e61>] kthread+0xd1/0xe0 [ 2416.788754] [<ffffffffa4d95df7>] ret_from_fork_nospec_end+0x0/0x39 [ 2416.789990] [<ffffffffffffffff>] 0xffffffffffffffff ... [ 2427.026534] LNet: Service thread pid 23710 completed after 50.39s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
This seems similar to LU-12510 or LU-10223
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
conf-sanity test_6 - client_up failed