Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13599

LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.14.0, Lustre 2.12.6
    • Lustre 2.12.4
    • None
    • 3
    • 9223372036854775807

    Description

      We hit the following crash on one of a MDS of Fir last night, running Lustre 2.12.4. Same problem occurred after re-mount of MDT and recovery. I had to kill the robinhood client that was running purge but also MDT-to-MDT migration (a single lfs migrate -m 0). Then I was able to remount this MDT. It looks a bit like LU-5185. Unfortunately, we lost the vmcore this time. But if it happens again, I'll let you know and will attach it.

      [1579975.369592] Lustre: fir-MDT0003: haven't heard from client 6fb18a53-0376-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9150778e2400, cur 1590281140 expire 1590280990 last 1590280913
      [1580039.870825] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2)
      [1600137.924189] Lustre: fir-MDT0003: haven't heard from client ed7f1c7c-f5de-4 (at 10.50.4.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f6360b800, cur 1590301302 expire 1590301152 last 1590301075
      [1619876.303011] Lustre: fir-MDT0003: Connection restored to 796a800c-02e4-4 (at 10.49.20.10@o2ib1)
      [1639768.911234] Lustre: fir-MDT0003: Connection restored to 83415c02-51ff-4 (at 10.49.20.5@o2ib1)
      [1639821.000702] Lustre: fir-MDT0003: haven't heard from client 83415c02-51ff-4 (at 10.49.20.5@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f32f56800, cur 1590340984 expire 1590340834 last 1590340757
      [1647672.215034] Lustre: fir-MDT0003: haven't heard from client 19e3d49f-43e4-4 (at 10.50.9.37@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f6240bc00, cur 1590348835 expire 1590348685 last 1590348608
      [1647717.069200] Lustre: fir-MDT0003: Connection restored to a4c7b337-bfab-4 (at 10.50.9.37@o2ib2)
      [1667613.717650] Lustre: fir-MDT0003: haven't heard from client 20e68a82-bbdb-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff914d1725c400, cur 1590368776 expire 1590368626 last 1590368549
      [1667717.713398] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2)
      [1692403.249073] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: 
      [1692403.258985] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) LBUG
      [1692403.265867] Pid: 30166, comm: mdt00_002 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      [1692403.276224] Call Trace:
      [1692403.278866]  [<ffffffffc0aff7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [1692403.285611]  [<ffffffffc0aff87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [1692403.292002]  [<ffffffffc0fb8851>] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc]
      [1692403.298695]  [<ffffffffc14b4bab>] mdt_save_lock+0x20b/0x360 [mdt]
      [1692403.305003]  [<ffffffffc14b4d5c>] mdt_object_unlock+0x5c/0x3c0 [mdt]
      [1692403.311572]  [<ffffffffc14b82e7>] mdt_object_unlock_put+0x17/0x120 [mdt]
      [1692403.318479]  [<ffffffffc150c4fc>] mdt_unlock_list+0x54/0x174 [mdt]
      [1692403.324876]  [<ffffffffc14d3fd3>] mdt_reint_migrate+0xa03/0x1310 [mdt]
      [1692403.331619]  [<ffffffffc14d4963>] mdt_reint_rec+0x83/0x210 [mdt]
      [1692403.337841]  [<ffffffffc14b1273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [1692403.344586]  [<ffffffffc14bc6e7>] mdt_reint+0x67/0x140 [mdt]
      [1692403.350463]  [<ffffffffc101c64a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [1692403.357586]  [<ffffffffc0fbf43b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1692403.365483]  [<ffffffffc0fc2da4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [1692403.371991]  [<ffffffffb98c2e81>] kthread+0xd1/0xe0
      [1692403.377079]  [<ffffffffb9f77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1692403.383725]  [<ffffffffffffffff>] 0xffffffffffffffff
      [1692403.388918] Kernel panic - not syncing: LBUG
      [1692403.393363] CPU: 44 PID: 30166 Comm: mdt00_002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
      [1692403.406214] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.10.6 08/15/2019
      [1692403.414040] Call Trace:
      [1692403.416670]  [<ffffffffb9f65147>] dump_stack+0x19/0x1b
      [1692403.421981]  [<ffffffffb9f5e850>] panic+0xe8/0x21f
      [1692403.426954]  [<ffffffffc0aff8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [1692403.433351]  [<ffffffffc0fb8851>] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc]
      [1692403.439978]  [<ffffffffc14b4bab>] mdt_save_lock+0x20b/0x360 [mdt]
      [1692403.446259]  [<ffffffffc14b4d5c>] mdt_object_unlock+0x5c/0x3c0 [mdt]
      [1692403.452796]  [<ffffffffc14b82e7>] mdt_object_unlock_put+0x17/0x120 [mdt]
      [1692403.459687]  [<ffffffffc150c4fc>] mdt_unlock_list+0x54/0x174 [mdt]
      [1692403.466057]  [<ffffffffc14d3fd3>] mdt_reint_migrate+0xa03/0x1310 [mdt]
      [1692403.472794]  [<ffffffffc0d3cfa9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
      [1692403.479942]  [<ffffffffc14d4963>] mdt_reint_rec+0x83/0x210 [mdt]
      [1692403.486134]  [<ffffffffc14b1273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [1692403.492843]  [<ffffffffc14bc6e7>] mdt_reint+0x67/0x140 [mdt]
      [1692403.498721]  [<ffffffffc101c64a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [1692403.505806]  [<ffffffffc0ff40b1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [1692403.513553]  [<ffffffffc0affbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [1692403.520811]  [<ffffffffc0fbf43b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1692403.528673]  [<ffffffffc0fbb565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [1692403.535632]  [<ffffffffb98cfeb4>] ? __wake_up+0x44/0x50
      [1692403.541065]  [<ffffffffc0fc2da4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [1692403.547537]  [<ffffffffc0fc2270>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [1692403.555106]  [<ffffffffb98c2e81>] kthread+0xd1/0xe0
      [1692403.560158]  [<ffffffffb98c2db0>] ? insert_kthread_work+0x40/0x40
      [1692403.566424]  [<ffffffffb9f77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1692403.573037]  [<ffffffffb98c2db0>] ? insert_kthread_work+0x40/0x40
      [root@fir-md1-s4 127.0.0.1-2020-05-25-00:59:34]# 
      Message from syslogd@fir-md1-s4 at May 25 09:46:42 ...
       kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: 
      
      Message from syslogd@fir-md1-s4 at May 25 09:46:42 ...
       kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) LBUG
      
      May 25 09:55:43 fir-md1-s1 kernel: Lustre: fir-MDT0003: Recovery over after 2:45, of 1302 clients 1301 recovered and 1 was evicted.
      

      Attachments

        Issue Links

          Activity

            [LU-13599] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed

            Just checking regarding https://review.whamcloud.com/#/c/39521/

            This patch is critical to avoid MDS crashes and has not landed into b2_12 yet.

            Thanks!

            sthiell Stephane Thiell added a comment - Just checking regarding  https://review.whamcloud.com/#/c/39521/ This patch is critical to avoid MDS crashes and has not landed into b2_12 yet. Thanks!

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39191/
            Subject: LU-13599 mdt: fix logic of skipping local locks in reply_state
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: dec36101852a8d300a6fdcc28c8d723989544aaa

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39191/ Subject: LU-13599 mdt: fix logic of skipping local locks in reply_state Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: dec36101852a8d300a6fdcc28c8d723989544aaa
            pjones Peter Jones added a comment -

            ofaaland I think that we're all set now - just need things to land

            pjones Peter Jones added a comment - ofaaland I think that we're all set now - just need things to land
            ofaaland Olaf Faaland added a comment - - edited

            Hi Peter or Mike,
            Can you talk appropriate folks into reviewing the patches? Thanks

            ofaaland Olaf Faaland added a comment - - edited Hi Peter or Mike, Can you talk appropriate folks into reviewing the patches? Thanks

            Hi Mike,

            I'm glad to report that all 4 MDSes on Fir have now the two patches and so far have been running without any issue, even with multiple parallel lfs migrate -m  running on a client. I'll let you know if I see any issue but it's very promising!

            Thanks so much!

            sthiell Stephane Thiell added a comment - Hi Mike, I'm glad to report that all 4 MDSes on Fir have now the two patches and so far have been running without any issue, even with multiple parallel lfs migrate -m   running on a client. I'll let you know if I see any issue but it's very promising! Thanks so much!
            sthiell Stephane Thiell added a comment - - edited

            Thanks Mike!
            I've applied the two patches on one of our MDS (on top of 2.12.5) which is running now while some MDT-MDT migrations are going on. Will keep you posted!

            $ git log --oneline | head -3
            8ac362a LU-13599 mdt: fix mti_big_lmm buffer usage
            1324114 LU-13599 mdt: fix logic of skipping local locks in reply_state
            78d712a New release 2.12.5
            
            sthiell Stephane Thiell added a comment - - edited Thanks Mike! I've applied the two patches on one of our MDS (on top of 2.12.5) which is running now while some MDT-MDT migrations are going on. Will keep you posted! $ git log --oneline | head -3 8ac362a LU-13599 mdt: fix mti_big_lmm buffer usage 1324114 LU-13599 mdt: fix logic of skipping local locks in reply_state 78d712a New release 2.12.5

            Hello Stephane, this fix had no separate ticket in master branch but was made as side work in LU-11025
            I've extracted related things from it for 2.12. Please check if it helps

            tappro Mikhail Pershin added a comment - Hello Stephane, this fix had no separate ticket in master branch but was made as side work in LU-11025 I've extracted related things from it for 2.12. Please check if it helps

            Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39521
            Subject: LU-13599 mdt: fix mti_big_lmm buffer usage
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: f2cac96782dc729cfe93272db51d670284b6d7aa

            gerrit Gerrit Updater added a comment - Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/39521 Subject: LU-13599 mdt: fix mti_big_lmm buffer usage Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: f2cac96782dc729cfe93272db51d670284b6d7aa

            Hi Mike,
            Do you have a separate ticket for the problem of ASSERTION( info->mti_big_lmm_used == 0 )? Just wanted to check if you had the time to backport the patch? We would like to try both patches to avoid MDT crashes during MDT migration, which we're doing quite a lot (got another crash last night). Thanks!
            Stephane

            sthiell Stephane Thiell added a comment - Hi Mike, Do you have a separate ticket for the problem of ASSERTION( info->mti_big_lmm_used == 0 ) ? Just wanted to check if you had the time to backport the patch? We would like to try both patches to avoid MDT crashes during MDT migration, which we're doing quite a lot (got another crash last night). Thanks! Stephane

            Great, thanks Mike!

            sthiell Stephane Thiell added a comment - Great, thanks Mike!

            Stephane, this assertion was seen on master and was fixed, I will prepare patch for 2.12

            tappro Mikhail Pershin added a comment - Stephane, this assertion was seen on master and was fixed, I will prepare patch for 2.12

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: