Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-15761

cannot finish MDS recovery

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.15.0
    • Lustre 2.15.0
    • 2.15.0-rc3
    • 3
    • 9223372036854775807

    Description

      Hit the same issue twice during soak testing on the same MDS node

      trying to connect to soak-10 ...
      2022-04-15 00:35:34,865:fsmgmt.fsmgmt:INFO     soak-10 is up!!!
      2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0002 ...
      2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0002 on soak-11 ...
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 mounted successfully on soak-11
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 failed over
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     Wait for recovery to complete
      2022-04-15 00:36:43,502:fsmgmt.fsmgmt:DEBUG    Recovery Result Record: {'soak-11': {'soaked-MDT0003': 'COMPLETE', 'soaked-MDT0002': 'WAITING'}}
      2022-04-15 00:36:43,503:fsmgmt.fsmgmt:INFO     soaked-MDT0002 in status 'WAITING'.
      

      on soak-11

      [136090.466173] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [136538.516253] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
      [136538.528122] Lustre: Skipped 8 previous similar messages
      [136570.792064] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [136570.804339] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36771 previous similar messages
      [136690.448382] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, exten
      d: 1
      [136690.462972] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [137139.504191] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
      [137139.516055] Lustre: Skipped 10 previous similar messages
      [137170.777941] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [137170.790212] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36789 previous similar messages
      [137180.761466] Lustre: mdt00_016: service thread pid 12073 was inactive for 200.281 seconds. The thread might be hung, or it might only be slow
       and will resume later. Dumping the stack trace for debugging purposes:
      [137180.782540] Pid: 12073, comm: mdt00_016 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Mon Dec 13 22:00:59 UTC 2021
      [137180.793815] Call Trace:
      [137180.796783] [<0>] top_trans_stop+0x882/0xfa0 [ptlrpc]
      [137180.802560] [<0>] lod_trans_stop+0x25c/0x340 [lod]
      [137180.808065] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
      [137180.813433] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
      [137180.818740] [<0>] mdo_create+0x46/0x48 [mdt]
      [137180.823632] [<0>] mdt_create+0xab1/0xe40 [mdt]
      [137180.828720] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
      [137180.834388] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
      [137180.839662] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
      [137180.845516] [<0>] mdt_reint+0x67/0x150 [mdt]
      [137180.850456] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
      [137180.856729] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
      [137180.863875] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
      [137180.869420] [<0>] kthread+0xd1/0xe0
      [137180.873423] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [137180.879286] [<0>] 0xfffffffffffffffe
      [137290.446156] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
      [137290.460730] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [137575.430810] Lustre: 7111:0:(service.c:1436:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff8f5f6cb94380 x1729527267995904/t176094400030(0) o36->f5a20605-7e7a-45f1-bf01-daf0664944ec@192.168.1.118@o2ib:514/0 lens 552/448 e 24 to 0 dl 1649907299 ref 2 fl Interpret:/0/0 rc 0/0 job:''
      [137740.477658] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
      [137740.489550] Lustre: Skipped 9 previous similar messages
      [137770.755234] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [137770.767489] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36738 previous similar messages
      [137890.442958] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
      [137890.457533] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [138341.465763] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
      [138341.477619] Lustre: Skipped 11 previous similar messages
      

      have to run abort_recovery

      [188290.202475] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [188327.367445] LustreError: 37182:0:(mdt_handler.c:7419:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device
      [188327.379129] LustreError: 37182:0:(ldlm_lib.c:2876:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
      [188327.390814] Lustre: 43762:0:(ldlm_lib.c:2283:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [188327.392289] LustreError: 43760:0:(lod_dev.c:424:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 get update log failed: rc = -11
      [188327.418009] Lustre: 43762:0:(ldlm_lib.c:2273:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout
      [188327.486462] Lustre: soaked-MDT0002-osp-MDT0003: Connection restored to 192.168.1.111@o2ib (at 0@lo)
      [188327.487906] Lustre: soaked-MDT0002: Recovery over after 2739:44, of 22 clients 22 recovered and 0 were evicted.
      [188345.146528] Lustre: Failing over soaked-MDT0002
      [188345.224943] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) soaked-MDT0002: cannot lock for rename: rc = -5
      [188345.236924] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) Skipped 1 previous similar message
      [188345.238945] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.126@o2ib (stopping)
      [188345.238946] Lustre: Skipped 2 previous similar messages
      

      Attachments

        1. lu-14495-soak-8debug
          15.14 MB
          Sarah Liu
        2. lu-14495-soak-8dmesg
          381 kB
          Sarah Liu
        3. lu-14495-soak-8trace
          994 kB
          Sarah Liu
        4. lustre-log.1651263727.22864
          102.86 MB
          Sarah Liu
        5. lustre-log.1651732904.9736
          146.86 MB
          Sarah Liu
        6. soak-10-debuglog
          133.07 MB
          Sarah Liu
        7. soak-10-umount-hang-trace
          1016 kB
          Sarah Liu
        8. soak-8dmesg
          337 kB
          Sarah Liu
        9. soak-8trace
          995 kB
          Sarah Liu

        Issue Links

          Activity

            People

              tappro Mikhail Pershin
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: