Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.15.0
    • Lustre 2.15.0
    • 2.15.0-rc3
    • 3
    • 9223372036854775807

    Description

      Hit the same issue twice during soak testing on the same MDS node

      trying to connect to soak-10 ...
      2022-04-15 00:35:34,865:fsmgmt.fsmgmt:INFO     soak-10 is up!!!
      2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Failing over soaked-MDT0002 ...
      2022-04-15 00:35:45,877:fsmgmt.fsmgmt:INFO     Mounting soaked-MDT0002 on soak-11 ...
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 mounted successfully on soak-11
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     ... soaked-MDT0002 failed over
      2022-04-15 00:36:42,914:fsmgmt.fsmgmt:INFO     Wait for recovery to complete
      2022-04-15 00:36:43,502:fsmgmt.fsmgmt:DEBUG    Recovery Result Record: {'soak-11': {'soaked-MDT0003': 'COMPLETE', 'soaked-MDT0002': 'WAITING'}}
      2022-04-15 00:36:43,503:fsmgmt.fsmgmt:INFO     soaked-MDT0002 in status 'WAITING'.
      

      on soak-11

      [136090.466173] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [136538.516253] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
      [136538.528122] Lustre: Skipped 8 previous similar messages
      [136570.792064] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [136570.804339] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36771 previous similar messages
      [136690.448382] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, exten
      d: 1
      [136690.462972] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [137139.504191] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
      [137139.516055] Lustre: Skipped 10 previous similar messages
      [137170.777941] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [137170.790212] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36789 previous similar messages
      [137180.761466] Lustre: mdt00_016: service thread pid 12073 was inactive for 200.281 seconds. The thread might be hung, or it might only be slow
       and will resume later. Dumping the stack trace for debugging purposes:
      [137180.782540] Pid: 12073, comm: mdt00_016 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Mon Dec 13 22:00:59 UTC 2021
      [137180.793815] Call Trace:
      [137180.796783] [<0>] top_trans_stop+0x882/0xfa0 [ptlrpc]
      [137180.802560] [<0>] lod_trans_stop+0x25c/0x340 [lod]
      [137180.808065] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
      [137180.813433] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
      [137180.818740] [<0>] mdo_create+0x46/0x48 [mdt]
      [137180.823632] [<0>] mdt_create+0xab1/0xe40 [mdt]
      [137180.828720] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
      [137180.834388] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
      [137180.839662] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
      [137180.845516] [<0>] mdt_reint+0x67/0x150 [mdt]
      [137180.850456] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
      [137180.856729] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
      [137180.863875] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
      [137180.869420] [<0>] kthread+0xd1/0xe0
      [137180.873423] [<0>] ret_from_fork_nospec_begin+0x21/0x21
      [137180.879286] [<0>] 0xfffffffffffffffe
      [137290.446156] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
      [137290.460730] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [137575.430810] Lustre: 7111:0:(service.c:1436:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff8f5f6cb94380 x1729527267995904/t176094400030(0) o36->f5a20605-7e7a-45f1-bf01-daf0664944ec@192.168.1.118@o2ib:514/0 lens 552/448 e 24 to 0 dl 1649907299 ref 2 fl Interpret:/0/0 rc 0/0 job:''
      [137740.477658] Lustre: soaked-MDT0003: Client ec0f61fb-4e34-4a0b-b633-b84f3e77a68b (at 192.168.1.138@o2ib) reconnecting
      [137740.489550] Lustre: Skipped 9 previous similar messages
      [137770.755234] LustreError: 43760:0:(llog.c:781:llog_process_thread()) soaked-MDT0001-osp-MDT0002 retry remote llog process
      [137770.767489] LustreError: 43760:0:(llog.c:781:llog_process_thread()) Skipped 36738 previous similar messages
      [137890.442958] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) soaked-MDT0002: extended recovery timer reached hard limit: 900, extend: 1
      [137890.457533] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [138341.465763] Lustre: soaked-MDT0003: Client 6734fd00-3e57-422d-a9ca-e8a57bcfd8f1 (at 192.168.1.135@o2ib) reconnecting
      [138341.477619] Lustre: Skipped 11 previous similar messages
      

      have to run abort_recovery

      [188290.202475] Lustre: 43762:0:(ldlm_lib.c:1965:extend_recovery_timer()) Skipped 29 previous similar messages
      [188327.367445] LustreError: 37182:0:(mdt_handler.c:7419:mdt_iocontrol()) soaked-MDT0002: Aborting recovery for device
      [188327.379129] LustreError: 37182:0:(ldlm_lib.c:2876:target_stop_recovery_thread()) soaked-MDT0002: Aborting recovery
      [188327.390814] Lustre: 43762:0:(ldlm_lib.c:2283:target_recovery_overseer()) recovery is aborted, evict exports in recovery
      [188327.392289] LustreError: 43760:0:(lod_dev.c:424:lod_sub_recovery_thread()) soaked-MDT0001-osp-MDT0002 get update log failed: rc = -11
      [188327.418009] Lustre: 43762:0:(ldlm_lib.c:2273:target_recovery_overseer()) soaked-MDT0002 recovery is aborted by hard timeout
      [188327.486462] Lustre: soaked-MDT0002-osp-MDT0003: Connection restored to 192.168.1.111@o2ib (at 0@lo)
      [188327.487906] Lustre: soaked-MDT0002: Recovery over after 2739:44, of 22 clients 22 recovered and 0 were evicted.
      [188345.146528] Lustre: Failing over soaked-MDT0002
      [188345.224943] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) soaked-MDT0002: cannot lock for rename: rc = -5
      [188345.236924] LustreError: 3633:0:(mdt_reint.c:2741:mdt_reint_rename()) Skipped 1 previous similar message
      [188345.238945] Lustre: soaked-MDT0002: Not available for connect from 192.168.1.126@o2ib (stopping)
      [188345.238946] Lustre: Skipped 2 previous similar messages
      

      Attachments

        1. lu-14495-soak-8debug
          15.14 MB
        2. lu-14495-soak-8dmesg
          381 kB
        3. lu-14495-soak-8trace
          994 kB
        4. lustre-log.1651263727.22864
          102.86 MB
        5. lustre-log.1651732904.9736
          146.86 MB
        6. soak-10-debuglog
          133.07 MB
        7. soak-10-umount-hang-trace
          1016 kB
        8. soak-8dmesg
          337 kB
        9. soak-8trace
          995 kB

        Issue Links

          Activity

            [LU-15761] cannot finish MDS recovery
            sarah Sarah Liu added a comment -

            bzzz, please check the log lustre-log.1651732904.9736, this is with full debug, hope it helps

            [33368.708107] LustreError: 8542:0:(llog_cat.c:789:llog_cat_cancel_records()) Skipped 343 previous similar messages
            [33384.336498] Lustre: soaked-MDT0003: haven't heard from client soaked-MDT0003-lwp-OST0005_UUID (at 192.168.1.105@o2ib) in 227 seconds. I think
             it's dead, and I am evicting it. exp ffff916a71261400, cur 1651732900 expire 1651732750 last 1651732673
            [33384.360772] Lustre: Skipped 3 previous similar messages
            [33388.209816] LustreError: 9736:0:(llog_osd.c:626:llog_osd_write_rec()) soaked-MDT0001-osp-MDT0003: index 40453 already set in llog bitmap [0x2
            40000403:0x3:0x0]
            [33388.225651] LustreError: 9736:0:(llog_osd.c:628:llog_osd_write_rec()) LBUG
            [33388.233370] Pid: 9736, comm: mdt00_011 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Sun Apr 3 16:20:30 UTC 2022
            [33388.244363] Call Trace:
            [33388.247123] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs]
            [33388.252872] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs]
            [33388.258268] [<0>] llog_osd_write_rec+0x1a3a/0x1a70 [obdclass]
            [33388.264714] [<0>] llog_write_rec+0x293/0x590 [obdclass]
            [33388.270573] [<0>] llog_cat_add_rec+0x1d9/0x980 [obdclass]
            [33388.276617] [<0>] llog_add+0x182/0x1f0 [obdclass]
            [33388.281964] [<0>] sub_updates_write+0x302/0xe3b [ptlrpc]
            [33388.287970] [<0>] top_trans_stop+0x4a2/0xfa0 [ptlrpc]
            [33388.293652] [<0>] lod_trans_stop+0x25c/0x340 [lod]
            [33388.299032] [<0>] mdd_trans_stop+0x2e/0x174 [mdd]
            [33388.304308] [<0>] mdd_create+0x154a/0x1cd0 [mdd]
            [33388.309491] [<0>] mdo_create+0x46/0x48 [mdt]
            [33388.314301] [<0>] mdt_create+0xab1/0xe40 [mdt]
            [33388.319269] [<0>] mdt_reint_create+0x3a0/0x460 [mdt]
            [33388.324829] [<0>] mdt_reint_rec+0x8a/0x240 [mdt]
            [33388.329993] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt]
            [33388.335733] [<0>] mdt_reint+0x67/0x150 [mdt]
            [33388.340562] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc]
            [33388.346728] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc]
            [33388.353774] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc]
            [33388.359226] [<0>] kthread+0xd1/0xe0
            [33388.363125] [<0>] ret_from_fork_nospec_begin+0x21/0x21
            [33388.368893] [<0>] 0xfffffffffffffffe
            [33388.373074] LustreError: dumping log to /tmp/lustre-log.1651732904.9736
            
            sarah Sarah Liu added a comment - bzzz , please check the log lustre-log.1651732904.9736, this is with full debug, hope it helps [33368.708107] LustreError: 8542:0:(llog_cat.c:789:llog_cat_cancel_records()) Skipped 343 previous similar messages [33384.336498] Lustre: soaked-MDT0003: haven't heard from client soaked-MDT0003-lwp-OST0005_UUID (at 192.168.1.105@o2ib) in 227 seconds. I think it's dead, and I am evicting it. exp ffff916a71261400, cur 1651732900 expire 1651732750 last 1651732673 [33384.360772] Lustre: Skipped 3 previous similar messages [33388.209816] LustreError: 9736:0:(llog_osd.c:626:llog_osd_write_rec()) soaked-MDT0001-osp-MDT0003: index 40453 already set in llog bitmap [0x2 40000403:0x3:0x0] [33388.225651] LustreError: 9736:0:(llog_osd.c:628:llog_osd_write_rec()) LBUG [33388.233370] Pid: 9736, comm: mdt00_011 3.10.0-1160.49.1.el7_lustre.x86_64 #1 SMP Sun Apr 3 16:20:30 UTC 2022 [33388.244363] Call Trace: [33388.247123] [<0>] libcfs_call_trace+0x90/0xf0 [libcfs] [33388.252872] [<0>] lbug_with_loc+0x4c/0xa0 [libcfs] [33388.258268] [<0>] llog_osd_write_rec+0x1a3a/0x1a70 [obdclass] [33388.264714] [<0>] llog_write_rec+0x293/0x590 [obdclass] [33388.270573] [<0>] llog_cat_add_rec+0x1d9/0x980 [obdclass] [33388.276617] [<0>] llog_add+0x182/0x1f0 [obdclass] [33388.281964] [<0>] sub_updates_write+0x302/0xe3b [ptlrpc] [33388.287970] [<0>] top_trans_stop+0x4a2/0xfa0 [ptlrpc] [33388.293652] [<0>] lod_trans_stop+0x25c/0x340 [lod] [33388.299032] [<0>] mdd_trans_stop+0x2e/0x174 [mdd] [33388.304308] [<0>] mdd_create+0x154a/0x1cd0 [mdd] [33388.309491] [<0>] mdo_create+0x46/0x48 [mdt] [33388.314301] [<0>] mdt_create+0xab1/0xe40 [mdt] [33388.319269] [<0>] mdt_reint_create+0x3a0/0x460 [mdt] [33388.324829] [<0>] mdt_reint_rec+0x8a/0x240 [mdt] [33388.329993] [<0>] mdt_reint_internal+0x76c/0xb50 [mdt] [33388.335733] [<0>] mdt_reint+0x67/0x150 [mdt] [33388.340562] [<0>] tgt_request_handle+0x92f/0x19c0 [ptlrpc] [33388.346728] [<0>] ptlrpc_server_handle_request+0x253/0xc30 [ptlrpc] [33388.353774] [<0>] ptlrpc_main+0xbf4/0x15e0 [ptlrpc] [33388.359226] [<0>] kthread+0xd1/0xe0 [33388.363125] [<0>] ret_from_fork_nospec_begin+0x21/0x21 [33388.368893] [<0>] 0xfffffffffffffffe [33388.373074] LustreError: dumping log to /tmp/lustre-log.1651732904.9736
            sarah Sarah Liu added a comment -

            okay, I will restart with full debug

            sarah Sarah Liu added a comment - okay, I will restart with full debug

            I don't understand where this problem comes from:

            [ 5814.198337] LustreError: 12981:0:(out_handler.c:910:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-ZEMyr9OQ/BUILD/lustre-2.15.0_RC3_3_gf161c9d/lustre/ptlrpc/../../lustre/target/out_handler.c:445: rc = -524

            want to debug this..
            would it be possible to run the test with full debug enabled and dump Lustre upon this message?

            bzzz Alex Zhuravlev added a comment - I don't understand where this problem comes from: [ 5814.198337] LustreError: 12981:0:(out_handler.c:910:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-ZEMyr9OQ/BUILD/lustre-2.15.0_RC3_3_gf161c9d/lustre/ptlrpc/../../lustre/target/out_handler.c:445: rc = -524 want to debug this.. would it be possible to run the test with full debug enabled and dump Lustre upon this message?

            thanks for the update. I'm going to check the logs. in the mean time, can you please restart?

            bzzz Alex Zhuravlev added a comment - thanks for the update. I'm going to check the logs. in the mean time, can you please restart?
            sarah Sarah Liu added a comment - - edited

            A quick update on soak, it doesn't hit the umount issue during failover, but it did hit LU-14495 on MDS0, test is still ongoing since disabled panic-on-lbug. I will upload the trace and debug log here since LU-14495 is closed.

            sarah Sarah Liu added a comment - - edited A quick update on soak, it doesn't hit the umount issue during failover, but it did hit LU-14495 on MDS0, test is still ongoing since disabled panic-on-lbug. I will upload the trace and debug log here since LU-14495 is closed.
            sarah Sarah Liu added a comment -

            Hi Andreas, last time the failure occurred after 6/7 hours running, so far soak has been up over 10 hours and it is doing well.

            sarah Sarah Liu added a comment - Hi Andreas, last time the failure occurred after 6/7 hours running, so far soak has been up over 10 hours and it is doing well.

            Sarah replied on Skype: I have loaded soak with it yesterday, will check the status a little later and let you know

            bzzz Alex Zhuravlev added a comment - Sarah replied on Skype: I have loaded soak with it yesterday, will check the status a little later and let you know

            Sarah, are you able to restart Soak with Alex's patch on master? You were previously able to hit this issue easily, so hopefully it will tell us quickly if the patch fixes it.

            adilger Andreas Dilger added a comment - Sarah, are you able to restart Soak with Alex's patch on master? You were previously able to hit this issue easily, so hopefully it will tell us quickly if the patch fixes it.

            This may be related to several llog-related test failures that were previously reported, all of them with the "retry remote llog process" message printed.

            Looking at the callers of llog_cat_refresh(), it looks like this could happen in combination with the patch https://review.whamcloud.com/40742 "LU-13974 llog: check stale osp object" landed in 2.15.0 that returns -ESTALE more frequently:

            int llog_cat_declare_add_rec()
            {
                    :
            start:
                    down_read_nested(&cathandle->lgh_lock, LLOGH_CAT);
                    :  
                    rc = llog_declare_write_rec(env, cathandle->u.chd.chd_current_log,
                                                rec, -1, th);
                    if (rc == -ESTALE && dt_object_remote(cathandle->lgh_obj)) {
                            up_read(&cathandle->lgh_lock);
                         rc = llog_cat_refresh(env, cathandle);
                         if (rc)
                                    RETURN(rc);
                         goto start;    
                    }
            
            adilger Andreas Dilger added a comment - This may be related to several llog-related test failures that were previously reported, all of them with the " retry remote llog process " message printed. Looking at the callers of llog_cat_refresh() , it looks like this could happen in combination with the patch https://review.whamcloud.com/40742 " LU-13974 llog: check stale osp object " landed in 2.15.0 that returns -ESTALE more frequently: int llog_cat_declare_add_rec() { : start: down_read_nested(&cathandle->lgh_lock, LLOGH_CAT); : rc = llog_declare_write_rec(env, cathandle->u.chd.chd_current_log, rec, -1, th); if (rc == -ESTALE && dt_object_remote(cathandle->lgh_obj)) { up_read(&cathandle->lgh_lock); rc = llog_cat_refresh(env, cathandle);             if (rc) RETURN(rc); goto start; }

            It looks like this locking imbalance was introduced by patch https://review.whamcloud.com/33401 "LU-11418 llog: refresh remote llog upon -ESTALE".

            The very strange thing is that patch was landed in 2.12.0-RC1, but has not been noticed until now. It is possible that some other recent landing has made this codepath more likely to be hit, because otherwise it seems like the code would always be broken.

            adilger Andreas Dilger added a comment - It looks like this locking imbalance was introduced by patch https://review.whamcloud.com/33401 " LU-11418 llog: refresh remote llog upon -ESTALE ". The very strange thing is that patch was landed in 2.12.0-RC1, but has not been noticed until now. It is possible that some other recent landing has made this codepath more likely to be hit, because otherwise it seems like the code would always be broken.

            "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47185
            Subject: LU-15761 obdclass: fix locking in llog_cat_refresh()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: f161c9dd6fcc11dbb044225a05d6d283d665424f

            gerrit Gerrit Updater added a comment - "Alex Zhuravlev <bzzz@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/47185 Subject: LU-15761 obdclass: fix locking in llog_cat_refresh() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: f161c9dd6fcc11dbb044225a05d6d283d665424f

            People

              tappro Mikhail Pershin
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: