Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16425

Interop recovery-small test_144a: MDT failover took 252 seconds

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.15.2
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for sarah <sarah@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/e76c370c-50f6-449d-a1ff-7c6e1dd6389a

      test_144a failed with the following error:

      MDT failover took 252 seconds
      

      Interop between 2.14 server and 2.15.2 client
      client 1 demsg

      [15447.569384] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == recovery-small test 144a: MDT failover should stop precreation threads ========================================================== 08:27:49 \(1670833669\)
      [15447.946594] Lustre: DEBUG MARKER: == recovery-small test 144a: MDT failover should stop precreation threads ========================================================== 08:27:49 (1670833669)
      [15455.634777] LustreError: 11-0: lustre-OST0000-osc-ffff8bba43bf3800: operation ost_setattr to node 10.240.25.240@tcp failed: rc = -19
      [15455.638069] LustreError: Skipped 9 previous similar messages
      [15455.639023] Lustre: lustre-OST0000-osc-ffff8bba43bf3800: Connection to lustre-OST0000 (at 10.240.25.240@tcp) was lost; in progress operations using this service will wait for recovery to complete
      [15455.641714] Lustre: Skipped 9 previous similar messages
      [15480.556202] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
      [15483.390342] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
      [15483.788979] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-70vm7.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
      [15484.158321] Lustre: DEBUG MARKER: onyx-70vm7.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid
      [15484.257913] Lustre: DEBUG MARKER: lctl get_param -n at_max
      [15484.644084] Lustre: DEBUG MARKER: /usr/sbin/lctl mark osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid in FULL state after 0 sec
      [15485.024198] Lustre: DEBUG MARKER: osc.lustre-OST0000-osc-[-0-9a-f]*.ost_server_uuid in FULL state after 0 sec
      [15574.629865] Lustre: 1011245:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670833677/real 1670833677]  req@0000000031b8d95f x1751984860062976/t0(0) o101->lustre-MDT0000-mdc-ffff8bba43bf3800@10.240.25.241@tcp:12/10 lens 4616/70232 e 1 to 1 dl 1670833796 ref 2 fl Rpc:XPQr/0/ffffffff rc 0/-1 job:'touch.0'
      [15574.634758] Lustre: 1011245:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 20 previous similar messages
      [15605.861791] INFO: task touch:1011244 blocked for more than 120 seconds.
      [15605.867361]       Tainted: G           OE    --------- -  - 4.18.0-372.32.1.el8_6.x86_64 #1
      [15605.868711] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [15605.869986] task:touch           state:D stack:    0 pid:1011244 ppid:1010843 flags:0x00000084
      [15605.871366] Call Trace:
      [15605.871840]  __schedule+0x2d1/0x840
      [15605.872494]  schedule+0x35/0xa0
      [15605.873061]  rwsem_down_write_slowpath+0x30c/0x5c0
      [15605.873902]  path_openat+0x34b/0x14f0
      [15605.874570]  do_filp_open+0x93/0x100
      [15605.875200]  ? getname_flags+0x4a/0x1e0
      [15605.875863]  ? __check_object_size+0xac/0x173
      [15605.876612]  do_sys_open+0x188/0x220
      [15605.877245]  do_syscall_64+0x5b/0x1b0
      [15605.877906]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
      [15605.878763] RIP: 0033:0x7f460cf5e272
      [15605.879394] Code: Unable to access opcode bytes at RIP 0x7f460cf5e248.
      [15605.880454] RSP: 002b:00007ffd1250d590 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
      [15605.881684] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f460cf5e272
      [15605.882843] RDX: 0000000000000941 RSI: 00007ffd1250e40d RDI: 00000000ffffff9c
      [15605.883998] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [15605.885152] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001
      [15605.886307] R13: 0000000000000001 R14: 00007ffd1250e40d R15: 00007f460d1fd374
      [15728.741044] INFO: task touch:1011244 blocked for more than 120 seconds.
      [15728.742186]       Tainted: G           OE    --------- -  - 4.18.0-372.32.1.el8_6.x86_64 #1
      [15728.743540] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [15728.744802] task:touch           state:D stack:    0 pid:1011244 ppid:1010843 flags:0x00000084
      [15728.746176] Call Trace:
      [15728.746643]  __schedule+0x2d1/0x840
      [15728.747256]  schedule+0x35/0xa0
      [15728.747813]  rwsem_down_write_slowpath+0x30c/0x5c0
      [15728.748625]  path_openat+0x34b/0x14f0
      [15728.749264]  do_filp_open+0x93/0x100
      [15728.749889]  ? getname_flags+0x4a/0x1e0
      [15728.750551]  ? __check_object_size+0xac/0x173
      [15728.751295]  do_sys_open+0x188/0x220
      [15728.751916]  do_syscall_64+0x5b/0x1b0
      [15728.752556]  entry_SYSCALL_64_after_hwframe+0x61/0xc6
      [15728.753405] RIP: 0033:0x7f460cf5e272
      [15728.754051] Code: Unable to access opcode bytes at RIP 0x7f460cf5e248.
      [15728.755123] RSP: 002b:00007ffd1250d590 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
      [15728.756352] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f460cf5e272
      [15728.757513] RDX: 0000000000000941 RSI: 00007ffd1250e40d RDI: 00000000ffffff9c
      [15728.758670] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [15728.759833] R10: 00000000000001b6 R11: 0000000000000246 R12: 0000000000000001
      [15728.760992] R13: 0000000000000001 R14: 00007ffd1250e40d R15: 00007f460d1fd374
      [15767.588855] LustreError: 166-1: MGC10.240.25.241@tcp: Connection to MGS (at 10.240.25.241@tcp) was lost; in progress operations using this service will fail
      [15767.591087] LustreError: Skipped 5 previous similar messages
      [15784.933984] Lustre: Evicted from MGS (at 10.240.25.241@tcp) after server handle changed from 0x7461a63767ab7b63 to 0x7461a63767ab8ac8
      [15784.935919] Lustre: Skipped 3 previous similar messages
      [15784.937420] Lustre: MGC10.240.25.241@tcp: Connection restored to 10.240.25.241@tcp (at 10.240.25.241@tcp)
      [15784.938977] Lustre: Skipped 14 previous similar messages
      [15792.511591] Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/opt/iozone/bin:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lustre/tests/racer:/usr/lib64/lustre/../lustre-iokit/sgpdd-survey:/usr/lib64/lustre/tests:/us
      [15795.354174] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n version 2>/dev/null
      [15795.750211] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-70vm7.onyx.whamcloud.com: executing wait_import_state_mount \(FULL\|IDLE\) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [15796.139101] Lustre: DEBUG MARKER: onyx-70vm7.onyx.whamcloud.com: executing wait_import_state_mount (FULL|IDLE) mdc.lustre-MDT0000-mdc-*.mds_server_uuid
      [15796.227757] Lustre: DEBUG MARKER: lctl get_param -n at_max
      [15796.603890] Lustre: DEBUG MARKER: /usr/sbin/lctl mark mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      [15797.004999] Lustre: DEBUG MARKER: mdc.lustre-MDT0000-mdc-*.mds_server_uuid in FULL state after 0 sec
      [15798.902782] Lustre: DEBUG MARKER: /usr/sbin/lctl mark  recovery-small test_144a: @@@@@@ FAIL: MDT failover took 252 seconds 
      [15799.270104] Lustre: DEBUG MARKER: recovery-small test_144a: @@@@@@ FAIL: MDT failover took 252 seconds
      [15799.693930] Lustre: DEBUG MARKER: /usr/sbin/lctl dk > /autotest/autotest-1/2022-12-11/lustre-b2_15_full-part-1_44_61_9513f440-0dd9-4552-b71e-4aba705566d8//recovery-small.test_144a.debug_log.$(hostname -s).1670834021.log;
                             dmesg > /autotest/autotest-1/2022-12-11/lustre-b2_15_full-part-1_4

      VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV
      recovery-small test_144a - MDT failover took 252 seconds

      Attachments

        Issue Links

          Activity

            People

              sarah Sarah Liu
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: