Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13599

LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.6
    • Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      We hit the following crash on one of a MDS of Fir last night, running Lustre 2.12.4. Same problem occurred after re-mount of MDT and recovery. I had to kill the robinhood client that was running purge but also MDT-to-MDT migration (a single lfs migrate -m 0). Then I was able to remount this MDT. It looks a bit like LU-5185. Unfortunately, we lost the vmcore this time. But if it happens again, I'll let you know and will attach it.

      [1579975.369592] Lustre: fir-MDT0003: haven't heard from client 6fb18a53-0376-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9150778e2400, cur 1590281140 expire 1590280990 last 1590280913
      [1580039.870825] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2)
      [1600137.924189] Lustre: fir-MDT0003: haven't heard from client ed7f1c7c-f5de-4 (at 10.50.4.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f6360b800, cur 1590301302 expire 1590301152 last 1590301075
      [1619876.303011] Lustre: fir-MDT0003: Connection restored to 796a800c-02e4-4 (at 10.49.20.10@o2ib1)
      [1639768.911234] Lustre: fir-MDT0003: Connection restored to 83415c02-51ff-4 (at 10.49.20.5@o2ib1)
      [1639821.000702] Lustre: fir-MDT0003: haven't heard from client 83415c02-51ff-4 (at 10.49.20.5@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f32f56800, cur 1590340984 expire 1590340834 last 1590340757
      [1647672.215034] Lustre: fir-MDT0003: haven't heard from client 19e3d49f-43e4-4 (at 10.50.9.37@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f6240bc00, cur 1590348835 expire 1590348685 last 1590348608
      [1647717.069200] Lustre: fir-MDT0003: Connection restored to a4c7b337-bfab-4 (at 10.50.9.37@o2ib2)
      [1667613.717650] Lustre: fir-MDT0003: haven't heard from client 20e68a82-bbdb-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff914d1725c400, cur 1590368776 expire 1590368626 last 1590368549
      [1667717.713398] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2)
      [1692403.249073] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: 
      [1692403.258985] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) LBUG
      [1692403.265867] Pid: 30166, comm: mdt00_002 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      [1692403.276224] Call Trace:
      [1692403.278866]  [<ffffffffc0aff7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [1692403.285611]  [<ffffffffc0aff87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [1692403.292002]  [<ffffffffc0fb8851>] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc]
      [1692403.298695]  [<ffffffffc14b4bab>] mdt_save_lock+0x20b/0x360 [mdt]
      [1692403.305003]  [<ffffffffc14b4d5c>] mdt_object_unlock+0x5c/0x3c0 [mdt]
      [1692403.311572]  [<ffffffffc14b82e7>] mdt_object_unlock_put+0x17/0x120 [mdt]
      [1692403.318479]  [<ffffffffc150c4fc>] mdt_unlock_list+0x54/0x174 [mdt]
      [1692403.324876]  [<ffffffffc14d3fd3>] mdt_reint_migrate+0xa03/0x1310 [mdt]
      [1692403.331619]  [<ffffffffc14d4963>] mdt_reint_rec+0x83/0x210 [mdt]
      [1692403.337841]  [<ffffffffc14b1273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [1692403.344586]  [<ffffffffc14bc6e7>] mdt_reint+0x67/0x140 [mdt]
      [1692403.350463]  [<ffffffffc101c64a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [1692403.357586]  [<ffffffffc0fbf43b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1692403.365483]  [<ffffffffc0fc2da4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [1692403.371991]  [<ffffffffb98c2e81>] kthread+0xd1/0xe0
      [1692403.377079]  [<ffffffffb9f77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1692403.383725]  [<ffffffffffffffff>] 0xffffffffffffffff
      [1692403.388918] Kernel panic - not syncing: LBUG
      [1692403.393363] CPU: 44 PID: 30166 Comm: mdt00_002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
      [1692403.406214] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.10.6 08/15/2019
      [1692403.414040] Call Trace:
      [1692403.416670]  [<ffffffffb9f65147>] dump_stack+0x19/0x1b
      [1692403.421981]  [<ffffffffb9f5e850>] panic+0xe8/0x21f
      [1692403.426954]  [<ffffffffc0aff8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [1692403.433351]  [<ffffffffc0fb8851>] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc]
      [1692403.439978]  [<ffffffffc14b4bab>] mdt_save_lock+0x20b/0x360 [mdt]
      [1692403.446259]  [<ffffffffc14b4d5c>] mdt_object_unlock+0x5c/0x3c0 [mdt]
      [1692403.452796]  [<ffffffffc14b82e7>] mdt_object_unlock_put+0x17/0x120 [mdt]
      [1692403.459687]  [<ffffffffc150c4fc>] mdt_unlock_list+0x54/0x174 [mdt]
      [1692403.466057]  [<ffffffffc14d3fd3>] mdt_reint_migrate+0xa03/0x1310 [mdt]
      [1692403.472794]  [<ffffffffc0d3cfa9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
      [1692403.479942]  [<ffffffffc14d4963>] mdt_reint_rec+0x83/0x210 [mdt]
      [1692403.486134]  [<ffffffffc14b1273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [1692403.492843]  [<ffffffffc14bc6e7>] mdt_reint+0x67/0x140 [mdt]
      [1692403.498721]  [<ffffffffc101c64a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [1692403.505806]  [<ffffffffc0ff40b1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [1692403.513553]  [<ffffffffc0affbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [1692403.520811]  [<ffffffffc0fbf43b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1692403.528673]  [<ffffffffc0fbb565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [1692403.535632]  [<ffffffffb98cfeb4>] ? __wake_up+0x44/0x50
      [1692403.541065]  [<ffffffffc0fc2da4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [1692403.547537]  [<ffffffffc0fc2270>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [1692403.555106]  [<ffffffffb98c2e81>] kthread+0xd1/0xe0
      [1692403.560158]  [<ffffffffb98c2db0>] ? insert_kthread_work+0x40/0x40
      [1692403.566424]  [<ffffffffb9f77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1692403.573037]  [<ffffffffb98c2db0>] ? insert_kthread_work+0x40/0x40
      [root@fir-md1-s4 127.0.0.1-2020-05-25-00:59:34]# 
      Message from syslogd@fir-md1-s4 at May 25 09:46:42 ...
       kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: 
      
      Message from syslogd@fir-md1-s4 at May 25 09:46:42 ...
       kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) LBUG
      
      May 25 09:55:43 fir-md1-s1 kernel: Lustre: fir-MDT0003: Recovery over after 2:45, of 1302 clients 1301 recovered and 1 was evicted.
      

      Attachments

        Issue Links

          Activity

            [LU-13599] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed
            pjones Peter Jones added a comment -

            Landed for 2.12.6. Fixed on master as part of a larger change (LU-11025)

            pjones Peter Jones added a comment - Landed for 2.12.6. Fixed on master as part of a larger change ( LU-11025 )

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39194/
            Subject: LU-13599 mdt: add test for rs_lock limit exceeding
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 3448cdc16e361d2504f2f5b0982c92d7a0de933d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39194/ Subject: LU-13599 mdt: add test for rs_lock limit exceeding Project: fs/lustre-release Branch: master Current Patch Set: Commit: 3448cdc16e361d2504f2f5b0982c92d7a0de933d

            Thanks y'all!  

            sthiell Stephane Thiell added a comment - Thanks y'all!  

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39521/
            Subject: LU-13599 mdt: fix mti_big_lmm buffer usage
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: b09b533b6f443c359e671e7b65208355d5c201dd

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39521/ Subject: LU-13599 mdt: fix mti_big_lmm buffer usage Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: b09b533b6f443c359e671e7b65208355d5c201dd

            Just checking regarding https://review.whamcloud.com/#/c/39521/

            This patch is critical to avoid MDS crashes and has not landed into b2_12 yet.

            Thanks!

            sthiell Stephane Thiell added a comment - Just checking regarding  https://review.whamcloud.com/#/c/39521/ This patch is critical to avoid MDS crashes and has not landed into b2_12 yet. Thanks!

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39191/
            Subject: LU-13599 mdt: fix logic of skipping local locks in reply_state
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: dec36101852a8d300a6fdcc28c8d723989544aaa

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39191/ Subject: LU-13599 mdt: fix logic of skipping local locks in reply_state Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: dec36101852a8d300a6fdcc28c8d723989544aaa
            pjones Peter Jones added a comment -

            ofaaland I think that we're all set now - just need things to land

            pjones Peter Jones added a comment - ofaaland I think that we're all set now - just need things to land
            ofaaland Olaf Faaland added a comment - - edited

            Hi Peter or Mike,
            Can you talk appropriate folks into reviewing the patches? Thanks

            ofaaland Olaf Faaland added a comment - - edited Hi Peter or Mike, Can you talk appropriate folks into reviewing the patches? Thanks

            Hi Mike,

            I'm glad to report that all 4 MDSes on Fir have now the two patches and so far have been running without any issue, even with multiple parallel lfs migrate -m  running on a client. I'll let you know if I see any issue but it's very promising!

            Thanks so much!

            sthiell Stephane Thiell added a comment - Hi Mike, I'm glad to report that all 4 MDSes on Fir have now the two patches and so far have been running without any issue, even with multiple parallel lfs migrate -m   running on a client. I'll let you know if I see any issue but it's very promising! Thanks so much!
            sthiell Stephane Thiell added a comment - - edited

            Thanks Mike!
            I've applied the two patches on one of our MDS (on top of 2.12.5) which is running now while some MDT-MDT migrations are going on. Will keep you posted!

            $ git log --oneline | head -3
            8ac362a LU-13599 mdt: fix mti_big_lmm buffer usage
            1324114 LU-13599 mdt: fix logic of skipping local locks in reply_state
            78d712a New release 2.12.5
            
            sthiell Stephane Thiell added a comment - - edited Thanks Mike! I've applied the two patches on one of our MDS (on top of 2.12.5) which is running now while some MDT-MDT migrations are going on. Will keep you posted! $ git log --oneline | head -3 8ac362a LU-13599 mdt: fix mti_big_lmm buffer usage 1324114 LU-13599 mdt: fix logic of skipping local locks in reply_state 78d712a New release 2.12.5

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: