Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.15.0
    • Lustre 2.15.0
    • 2.15.0-rc3
    • 3
    • 9223372036854775807

    Description

      Hit the same issue twice during soak testing on the same MDS node

      trying to connect to soak-10 ...
      2022-04-15 00:35:34,865:fsmgmt.fsmgmt:INFO     soak-10 is up!!!
      2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0002 ...
      2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0002 on soak-11 ...
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 mounted successfully on soak-11
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 failed over
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     Wait for recovery to complete
      2022-04-15 00:36:43,502:fsmgmt.fsmgmt:DEBUG    Recovery Result Record: {'soak-11': {'soaked-MDT0003': 'COMPLETE', 'soaked-MDT0002': 'WAITING'}}
      2022-04-15 00:36:43,503:fsmgmt.fsmgmt:INFO     soaked-MDT0002 in status 'WAITING'.
      

      on soak-11

      [136090.466173] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [136538.516253] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
      [136538.528122] Lustre: Skipped 8 previous similar messages
      [136570.792064] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [136570.804339] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36771 previous similar messages
      [136690.448382] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, exten
      d: 1
      [136690.462972] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [137139.504191] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
      [137139.516055] Lustre: Skipped 10 previous similar messages
      [137170.777941] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [137170.790212] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36789 previous similar messages
      [137180.761466] Lustre: mdt00_016: service thread pid 12073 was inactive for 200.281 seconds. The thread might be hung, or it might only be slow
       and will resume later. Dumping the stack trace for debugging purposes:
      [137180.782540] Pid: 12073, comm: mdt00_016 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Mon Dec 13 22:00:59 UTC 2021
      [137180.793815] Call Trace:
      [137180.796783] [<0>] top_trans_stop+0x882/0xfa0 [ptlrpc]
      [137180.802560] [<0>] lod_trans_stop+0x25c/0x340 [lod]
      [137180.808065] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
      [137180.813433] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
      [137180.818740] [<0>] mdo_create+0x46/0x48 [mdt]
      [137180.823632] [<0>] mdt_create+0xab1/0xe40 [mdt]
      [137180.828720] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
      [137180.834388] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
      [137180.839662] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
      [137180.845516] [<0>] mdt_reint+0x67/0x150 [mdt]
      [137180.850456] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
      [137180.856729] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
      [137180.863875] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
      [137180.869420] [<0>] kthread+0xd1/0xe0
      [137180.873423] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [137180.879286] [<0>] 0xfffffffffffffffe
      [137290.446156] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
      [137290.460730] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [137575.430810] Lustre: 7111:0:(service.c:1436:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff8f5f6cb94380 x1729527267995904/t176094400030(0) o36->f5a20605-7e7a-45f1-bf01-daf0664944ec@192.168.1.118@o2ib:514/0 lens 552/448 e 24 to 0 dl 1649907299 ref 2 fl Interpret:/0/0 rc 0/0 job:''
      [137740.477658] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
      [137740.489550] Lustre: Skipped 9 previous similar messages
      [137770.755234] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [137770.767489] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36738 previous similar messages
      [137890.442958] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
      [137890.457533] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [138341.465763] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
      [138341.477619] Lustre: Skipped 11 previous similar messages
      

      have to run abort_recovery

      [188290.202475] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [188327.367445] LustreError: 37182:0:(mdt_handler.c:7419:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device
      [188327.379129] LustreError: 37182:0:(ldlm_lib.c:2876:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
      [188327.390814] Lustre: 43762:0:(ldlm_lib.c:2283:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [188327.392289] LustreError: 43760:0:(lod_dev.c:424:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 get update log failed: rc = -11
      [188327.418009] Lustre: 43762:0:(ldlm_lib.c:2273:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout
      [188327.486462] Lustre: soaked-MDT0002-osp-MDT0003: Connection restored to 192.168.1.111@o2ib (at 0@lo)
      [188327.487906] Lustre: soaked-MDT0002: Recovery over after 2739:44, of 22 clients 22 recovered and 0 were evicted.
      [188345.146528] Lustre: Failing over soaked-MDT0002
      [188345.224943] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) soaked-MDT0002: cannot lock for rename: rc = -5
      [188345.236924] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) Skipped 1 previous similar message
      [188345.238945] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.126@o2ib (stopping)
      [188345.238946] Lustre: Skipped 2 previous similar messages
      

      Attachments

        1. lu-14495-soak-8debug
          15.14 MB
        2. lu-14495-soak-8dmesg
          381 kB
        3. lu-14495-soak-8trace
          994 kB
        4. lustre-log.1651263727.22864
          102.86 MB
        5. lustre-log.1651732904.9736
          146.86 MB
        6. soak-10-debuglog
          133.07 MB
        7. soak-10-umount-hang-trace
          1016 kB
        8. soak-8dmesg
          337 kB
        9. soak-8trace
          995 kB

        Issue Links

          Activity

            [LU-15761] cannot finish MDS recovery

            "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49470
            Subject: LU-15761 obdclass: fix locking in llog_cat_refresh()
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: cb2086c2f9268db6f65ed2d31384806057a0ee30

            gerrit Gerrit Updater added a comment - "Etienne AUJAMES <eaujames@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49470 Subject: LU-15761 obdclass: fix locking in llog_cat_refresh() Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: cb2086c2f9268db6f65ed2d31384806057a0ee30

            Alex Zhuravlev, please check the log lustre-log.1651732904.9736, this is with full debug, hope it helps

            yes, it did, thanks a lot. I've got rough understanding, trying to construct a reproducer.

            bzzz Alex Zhuravlev added a comment - Alex Zhuravlev, please check the log lustre-log.1651732904.9736, this is with full debug, hope it helps yes, it did, thanks a lot. I've got rough understanding, trying to construct a reproducer.
            pjones Peter Jones added a comment -

            LU-15761 fix has landed for 2.15. I have reopened LU-14495 to track the remaining issue

            pjones Peter Jones added a comment - LU-15761 fix has landed for 2.15. I have reopened LU-14495 to track the remaining issue

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47185/
            Subject: LU-15761 obdclass: fix locking in llog_cat_refresh()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 69cdcbe3cf732124b3b5be1a1f235aa1b78f6c85

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/47185/ Subject: LU-15761 obdclass: fix locking in llog_cat_refresh() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 69cdcbe3cf732124b3b5be1a1f235aa1b78f6c85
            sarah Sarah Liu added a comment -

            bzzz, please check the log lustre-log.1651732904.9736, this is with full debug, hope it helps

            [33368.708107] LustreError: 8542:0:(llog_cat.c:789:llog_cat_cancel_records()) Skipped 343 previous similar messages
            [33384.336498] Lustre: soaked-MDT0003: haven't heard from client soaked-MDT0003-lwp-OST0005_UUID (at 192.168.1.105@o2ib) in 227 seconds. I think
             it's dead, and I am evicting it. exp ffff916a71261400, cur 1651732900 expire 1651732750 last 1651732673
            [33384.360772] Lustre: Skipped 3 previous similar messages
            [33388.209816] LustreError: 9736:0:(llog_osd.c:626:llog_osd_write_rec()) soaked-MDT0001-osp-MDT0003: index 40453 already set in llog bitmap [0x2
            40000403:0x3:0x0]
            [33388.225651] LustreError: 9736:0:(llog_osd.c:628:llog_osd_write_rec()) LBUG
            [33388.233370] Pid: 9736, comm: mdt00_011 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Sun Apr 3 16:20:30 UTC 2022
            [33388.244363] Call Trace:
            [33388.247123] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
            [33388.252872] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [33388.258268] [<0>] llog_osd_write_rec+0x1a3a/0x1a70 [obdclass]
            [33388.264714] [<0>] llog_write_rec+0x293/0x590 [obdclass]
            [33388.270573] [<0>] llog_cat_add_rec+0x1d9/0x980 [obdclass]
            [33388.276617] [<0>] llog_add+0x182/0x1f0 [obdclass]
            [33388.281964] [<0>] sub_updates_write+0x302/0xe3b [ptlrpc]
            [33388.287970] [<0>] top_trans_stop+0x4a2/0xfa0 [ptlrpc]
            [33388.293652] [<0>] lod_trans_stop+0x25c/0x340 [lod]
            [33388.299032] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
            [33388.304308] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
            [33388.309491] [<0>] mdo_create+0x46/0x48 [mdt]
            [33388.314301] [<0>] mdt_create+0xab1/0xe40 [mdt]
            [33388.319269] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
            [33388.324829] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
            [33388.329993] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
            [33388.335733] [<0>] mdt_reint+0x67/0x150 [mdt]
            [33388.340562] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
            [33388.346728] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
            [33388.353774] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
            [33388.359226] [<0>] kthread+0xd1/0xe0
            [33388.363125] [<0>] ret_from_fork_nospec_begin+0x21/0x21
            [33388.368893] [<0>] 0xfffffffffffffffe
            [33388.373074] LustreError: dumping log to /tmp/lustre-log.1651732904.9736
            
            sarah Sarah Liu added a comment - bzzz , please check the log lustre-log.1651732904.9736, this is with full debug, hope it helps [33368.708107] LustreError: 8542:0:(llog_cat.c:789:llog_cat_cancel_records()) Skipped 343 previous similar messages [33384.336498] Lustre: soaked-MDT0003: haven't heard from client soaked-MDT0003-lwp-OST0005_UUID (at 192.168.1.105@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff916a71261400, cur 1651732900 expire 1651732750 last 1651732673 [33384.360772] Lustre: Skipped 3 previous similar messages [33388.209816] LustreError: 9736:0:(llog_osd.c:626:llog_osd_write_rec()) soaked-MDT0001-osp-MDT0003: index 40453 already set in llog bitmap [0x2 40000403:0x3:0x0] [33388.225651] LustreError: 9736:0:(llog_osd.c:628:llog_osd_write_rec()) LBUG [33388.233370] Pid: 9736, comm: mdt00_011 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Sun Apr 3 16:20:30 UTC 2022 [33388.244363] Call Trace: [33388.247123] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [33388.252872] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [33388.258268] [<0>] llog_osd_write_rec+0x1a3a/0x1a70 [obdclass] [33388.264714] [<0>] llog_write_rec+0x293/0x590 [obdclass] [33388.270573] [<0>] llog_cat_add_rec+0x1d9/0x980 [obdclass] [33388.276617] [<0>] llog_add+0x182/0x1f0 [obdclass] [33388.281964] [<0>] sub_updates_write+0x302/0xe3b [ptlrpc] [33388.287970] [<0>] top_trans_stop+0x4a2/0xfa0 [ptlrpc] [33388.293652] [<0>] lod_trans_stop+0x25c/0x340 [lod] [33388.299032] [<0>] mdd_trans_stop+0x2e/0x174 [mdd] [33388.304308] [<0>] mdd_create+0x154a/0x1cd0 [mdd] [33388.309491] [<0>] mdo_create+0x46/0x48 [mdt] [33388.314301] [<0>] mdt_create+0xab1/0xe40 [mdt] [33388.319269] [<0>] mdt_reint_create+0x3a0/0x460 [mdt] [33388.324829] [<0>] mdt_reint_rec+0x8a/0x240 [mdt] [33388.329993] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt] [33388.335733] [<0>] mdt_reint+0x67/0x150 [mdt] [33388.340562] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc] [33388.346728] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc] [33388.353774] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc] [33388.359226] [<0>] kthread+0xd1/0xe0 [33388.363125] [<0>] ret_from_fork_nospec_begin+0x21/0x21 [33388.368893] [<0>] 0xfffffffffffffffe [33388.373074] LustreError: dumping log to /tmp/lustre-log.1651732904.9736
            sarah Sarah Liu added a comment -

            okay, I will restart with full debug

            sarah Sarah Liu added a comment - okay, I will restart with full debug

            I don't understand where this problem comes from:

            [ 5814.198337] LustreError: 12981:0:(out_handler.c:910:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-ZEMyr9OQ/BUILD/lustre-2.15.0_RC3_3_gf161c9d/lustre/ptlrpc/../../lustre/target/out_handler.c:445: rc = -524

            want to debug this..
            would it be possible to run the test with full debug enabled and dump Lustre upon this message?

            bzzz Alex Zhuravlev added a comment - I don't understand where this problem comes from: [ 5814.198337] LustreError: 12981:0:(out_handler.c:910:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-ZEMyr9OQ/BUILD/lustre-2.15.0_RC3_3_gf161c9d/lustre/ptlrpc/../../lustre/target/out_handler.c:445: rc = -524 want to debug this.. would it be possible to run the test with full debug enabled and dump Lustre upon this message?

            thanks for the update. I'm going to check the logs. in the mean time, can you please restart?

            bzzz Alex Zhuravlev added a comment - thanks for the update. I'm going to check the logs. in the mean time, can you please restart?
            sarah Sarah Liu added a comment - - edited

            A quick update on soak, it doesn't hit the umount issue during failover, but it did hit LU-14495 on MDS0, test is still ongoing since disabled panic-on-lbug. I will upload the trace and debug log here since LU-14495 is closed.

            sarah Sarah Liu added a comment - - edited A quick update on soak, it doesn't hit the umount issue during failover, but it did hit LU-14495 on MDS0, test is still ongoing since disabled panic-on-lbug. I will upload the trace and debug log here since LU-14495 is closed.
            sarah Sarah Liu added a comment -

            Hi Andreas, last time the failure occurred after 6/7 hours running, so far soak has been up over 10 hours and it is doing well.

            sarah Sarah Liu added a comment - Hi Andreas, last time the failure occurred after 6/7 hours running, so far soak has been up over 10 hours and it is doing well.

            People

              tappro Mikhail Pershin
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: