Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.15.2
-
None
-
3
-
9223372036854775807
Description
This issue was created by maloo for sarah <sarah@whamcloud.com>
This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e76c370c-50f6-449d-a1ff-7c6e1dd6389a
test_144a failed with the following error:
MDT failover took 252 seconds
Interop between 2.14 server and 2.15.2 client
client 1 demsg
[15447.569384] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == recovery-small test 144a: MDT failover should stop precreation threads ========================================================== 08:27:49 \(1670833669\) [15447.946594] Lustre: DEBUG MARKER: == recovery-small test 144a: MDT failover should stop precreation threads ========================================================== 08:27:49 (1670833669) [15455.634777] LustreError: 11-0: lustre-OST0000-osc-ffff8bba43bf3800: operation ost_setattr to node 10.240.25.240@tcp failed: rc = -19 [15455.638069] LustreError: Skipped 9 previous similar messages [15455.639023] Lustre: lustre-OST0000-osc-ffff8bba43bf3800: Connection to lustre-OST0000 (at 10.240.25.240@tcp) was lost; in progress operations using this service will wait for recovery to complete [15455.641714] Lustre: Skipped 9 previous similar messages [15480.556202] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us [15483.390342] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null [15483.788979] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-70vm7.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid [15484.158321] Lustre: DEBUG MARKER: onyx-70vm7.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid [15484.257913] Lustre: DEBUG MARKER: lctl get_param -n at_max [15484.644084] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid in FULL state after 0 sec [15485.024198] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid in FULL state after 0 sec [15574.629865] Lustre: 1011245:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670833677/real 1670833677] req@0000000031b8d95f x1751984860062976/t0(0) o101->lustre-MDT0000-mdc-ffff8bba43bf3800@10.240.25.241@tcp:12/10 lens 4616/70232 e 1 to 1 dl 1670833796 ref 2 fl Rpc:XPQr/0/ffffffff rc 0/-1 job:'touch.0' [15574.634758] Lustre: 1011245:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 20 previous similar messages [15605.861791] INFO: task touch:1011244 blocked for more than 120 seconds. [15605.867361] Tainted: G OE --------- - - 4.18.0-372.32.1.el8_6.x86_64 #1 [15605.868711] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [15605.869986] task:touch state:D stack: 0 pid:1011244 ppid:1010843 flags:0x00000084 [15605.871366] Call Trace: [15605.871840] __schedule+0x2d1/0x840 [15605.872494] schedule+0x35/0xa0 [15605.873061] rwsem_down_write_slowpath+0x30c/0x5c0 [15605.873902] path_openat+0x34b/0x14f0 [15605.874570] do_filp_open+0x93/0x100 [15605.875200] ? getname_flags+0x4a/0x1e0 [15605.875863] ? __check_object_size+0xac/0x173 [15605.876612] do_sys_open+0x188/0x220 [15605.877245] do_syscall_64+0x5b/0x1b0 [15605.877906] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [15605.878763] RIP: 0033:0x7f460cf5e272 [15605.879394] Code: Unable to access opcode bytes at RIP 0x7f460cf5e248. [15605.880454] RSP: 002b:00007ffd1250d590 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 [15605.881684] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f460cf5e272 [15605.882843] RDX: 0000000000000941 RSI: 00007ffd1250e40d RDI: 00000000ffffff9c [15605.883998] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [15605.885152] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001 [15605.886307] R13: 0000000000000001 R14: 00007ffd1250e40d R15: 00007f460d1fd374 [15728.741044] INFO: task touch:1011244 blocked for more than 120 seconds. [15728.742186] Tainted: G OE --------- - - 4.18.0-372.32.1.el8_6.x86_64 #1 [15728.743540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [15728.744802] task:touch state:D stack: 0 pid:1011244 ppid:1010843 flags:0x00000084 [15728.746176] Call Trace: [15728.746643] __schedule+0x2d1/0x840 [15728.747256] schedule+0x35/0xa0 [15728.747813] rwsem_down_write_slowpath+0x30c/0x5c0 [15728.748625] path_openat+0x34b/0x14f0 [15728.749264] do_filp_open+0x93/0x100 [15728.749889] ? getname_flags+0x4a/0x1e0 [15728.750551] ? __check_object_size+0xac/0x173 [15728.751295] do_sys_open+0x188/0x220 [15728.751916] do_syscall_64+0x5b/0x1b0 [15728.752556] entry_SYSCALL_64_after_hwframe+0x61/0xc6 [15728.753405] RIP: 0033:0x7f460cf5e272 [15728.754051] Code: Unable to access opcode bytes at RIP 0x7f460cf5e248. [15728.755123] RSP: 002b:00007ffd1250d590 EFLAGS: 00000246 ORIG_RAX: 0000000000000101 [15728.756352] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f460cf5e272 [15728.757513] RDX: 0000000000000941 RSI: 00007ffd1250e40d RDI: 00000000ffffff9c [15728.758670] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 [15728.759833] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001 [15728.760992] R13: 0000000000000001 R14: 00007ffd1250e40d R15: 00007f460d1fd374 [15767.588855] LustreError: 166-1: MGC10.240.25.241@tcp: Connection to MGS (at 10.240.25.241@tcp) was lost; in progress operations using this service will fail [15767.591087] LustreError: Skipped 5 previous similar messages [15784.933984] Lustre: Evicted from MGS (at 10.240.25.241@tcp) after server handle changed from 0x7461a63767ab7b63 to 0x7461a63767ab8ac8 [15784.935919] Lustre: Skipped 3 previous similar messages [15784.937420] Lustre: MGC10.240.25.241@tcp: Connection restored to 10.240.25.241@tcp (at 10.240.25.241@tcp) [15784.938977] Lustre: Skipped 14 previous similar messages [15792.511591] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us [15795.354174] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null [15795.750211] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-70vm7.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid [15796.139101] Lustre: DEBUG MARKER: onyx-70vm7.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid [15796.227757] Lustre: DEBUG MARKER: lctl get_param -n at_max [15796.603890] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec [15797.004999] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec [15798.902782] Lustre: DEBUG MARKER: /usr/sbin/lctl mark recovery-small test_144a: @@@@@@ FAIL: MDT failover took 252 seconds [15799.270104] Lustre: DEBUG MARKER: recovery-small test_144a: @@@@@@ FAIL: MDT failover took 252 seconds [15799.693930] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2022-12-11/lustre-b2_15_full-part-1_44_61_9513f440-0dd9-4552-b71e-4aba705566d8//recovery-small.test_144a.debug_log.$(hostname -s).1670834021.log; dmesg > /autotest/autotest-1/2022-12-11/lustre-b2_15_full-part-1_4
VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
recovery-small test_144a - MDT failover took 252 seconds