Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13599

LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: Lustre 2.12.4
    • Fix Version/s: None
    • Labels:
      None
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      We hit the following crash on one of a MDS of Fir last night, running Lustre 2.12.4. Same problem occurred after re-mount of MDT and recovery. I had to kill the robinhood client that was running purge but also MDT-to-MDT migration (a single lfs migrate -m 0). Then I was able to remount this MDT. It looks a bit like LU-5185. Unfortunately, we lost the vmcore this time. But if it happens again, I'll let you know and will attach it.

      [1579975.369592] Lustre: fir-MDT0003: haven't heard from client 6fb18a53-0376-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9150778e2400, cur 1590281140 expire 1590280990 last 1590280913
      [1580039.870825] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2)
      [1600137.924189] Lustre: fir-MDT0003: haven't heard from client ed7f1c7c-f5de-4 (at 10.50.4.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f6360b800, cur 1590301302 expire 1590301152 last 1590301075
      [1619876.303011] Lustre: fir-MDT0003: Connection restored to 796a800c-02e4-4 (at 10.49.20.10@o2ib1)
      [1639768.911234] Lustre: fir-MDT0003: Connection restored to 83415c02-51ff-4 (at 10.49.20.5@o2ib1)
      [1639821.000702] Lustre: fir-MDT0003: haven't heard from client 83415c02-51ff-4 (at 10.49.20.5@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f32f56800, cur 1590340984 expire 1590340834 last 1590340757
      [1647672.215034] Lustre: fir-MDT0003: haven't heard from client 19e3d49f-43e4-4 (at 10.50.9.37@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f6240bc00, cur 1590348835 expire 1590348685 last 1590348608
      [1647717.069200] Lustre: fir-MDT0003: Connection restored to a4c7b337-bfab-4 (at 10.50.9.37@o2ib2)
      [1667613.717650] Lustre: fir-MDT0003: haven't heard from client 20e68a82-bbdb-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff914d1725c400, cur 1590368776 expire 1590368626 last 1590368549
      [1667717.713398] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2)
      [1692403.249073] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: 
      [1692403.258985] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) LBUG
      [1692403.265867] Pid: 30166, comm: mdt00_002 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019
      [1692403.276224] Call Trace:
      [1692403.278866]  [<ffffffffc0aff7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
      [1692403.285611]  [<ffffffffc0aff87c>] lbug_with_loc+0x4c/0xa0 [libcfs]
      [1692403.292002]  [<ffffffffc0fb8851>] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc]
      [1692403.298695]  [<ffffffffc14b4bab>] mdt_save_lock+0x20b/0x360 [mdt]
      [1692403.305003]  [<ffffffffc14b4d5c>] mdt_object_unlock+0x5c/0x3c0 [mdt]
      [1692403.311572]  [<ffffffffc14b82e7>] mdt_object_unlock_put+0x17/0x120 [mdt]
      [1692403.318479]  [<ffffffffc150c4fc>] mdt_unlock_list+0x54/0x174 [mdt]
      [1692403.324876]  [<ffffffffc14d3fd3>] mdt_reint_migrate+0xa03/0x1310 [mdt]
      [1692403.331619]  [<ffffffffc14d4963>] mdt_reint_rec+0x83/0x210 [mdt]
      [1692403.337841]  [<ffffffffc14b1273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [1692403.344586]  [<ffffffffc14bc6e7>] mdt_reint+0x67/0x140 [mdt]
      [1692403.350463]  [<ffffffffc101c64a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [1692403.357586]  [<ffffffffc0fbf43b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1692403.365483]  [<ffffffffc0fc2da4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [1692403.371991]  [<ffffffffb98c2e81>] kthread+0xd1/0xe0
      [1692403.377079]  [<ffffffffb9f77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1692403.383725]  [<ffffffffffffffff>] 0xffffffffffffffff
      [1692403.388918] Kernel panic - not syncing: LBUG
      [1692403.393363] CPU: 44 PID: 30166 Comm: mdt00_002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1
      [1692403.406214] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.10.6 08/15/2019
      [1692403.414040] Call Trace:
      [1692403.416670]  [<ffffffffb9f65147>] dump_stack+0x19/0x1b
      [1692403.421981]  [<ffffffffb9f5e850>] panic+0xe8/0x21f
      [1692403.426954]  [<ffffffffc0aff8cb>] lbug_with_loc+0x9b/0xa0 [libcfs]
      [1692403.433351]  [<ffffffffc0fb8851>] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc]
      [1692403.439978]  [<ffffffffc14b4bab>] mdt_save_lock+0x20b/0x360 [mdt]
      [1692403.446259]  [<ffffffffc14b4d5c>] mdt_object_unlock+0x5c/0x3c0 [mdt]
      [1692403.452796]  [<ffffffffc14b82e7>] mdt_object_unlock_put+0x17/0x120 [mdt]
      [1692403.459687]  [<ffffffffc150c4fc>] mdt_unlock_list+0x54/0x174 [mdt]
      [1692403.466057]  [<ffffffffc14d3fd3>] mdt_reint_migrate+0xa03/0x1310 [mdt]
      [1692403.472794]  [<ffffffffc0d3cfa9>] ? check_unlink_entry+0x19/0xd0 [obdclass]
      [1692403.479942]  [<ffffffffc14d4963>] mdt_reint_rec+0x83/0x210 [mdt]
      [1692403.486134]  [<ffffffffc14b1273>] mdt_reint_internal+0x6e3/0xaf0 [mdt]
      [1692403.492843]  [<ffffffffc14bc6e7>] mdt_reint+0x67/0x140 [mdt]
      [1692403.498721]  [<ffffffffc101c64a>] tgt_request_handle+0xada/0x1570 [ptlrpc]
      [1692403.505806]  [<ffffffffc0ff40b1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
      [1692403.513553]  [<ffffffffc0affbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
      [1692403.520811]  [<ffffffffc0fbf43b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [1692403.528673]  [<ffffffffc0fbb565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc]
      [1692403.535632]  [<ffffffffb98cfeb4>] ? __wake_up+0x44/0x50
      [1692403.541065]  [<ffffffffc0fc2da4>] ptlrpc_main+0xb34/0x1470 [ptlrpc]
      [1692403.547537]  [<ffffffffc0fc2270>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
      [1692403.555106]  [<ffffffffb98c2e81>] kthread+0xd1/0xe0
      [1692403.560158]  [<ffffffffb98c2db0>] ? insert_kthread_work+0x40/0x40
      [1692403.566424]  [<ffffffffb9f77c24>] ret_from_fork_nospec_begin+0xe/0x21
      [1692403.573037]  [<ffffffffb98c2db0>] ? insert_kthread_work+0x40/0x40
      [root@fir-md1-s4 127.0.0.1-2020-05-25-00:59:34]# 
      Message from syslogd@fir-md1-s4 at May 25 09:46:42 ...
       kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: 
      
      Message from syslogd@fir-md1-s4 at May 25 09:46:42 ...
       kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) LBUG
      
      May 25 09:55:43 fir-md1-s1 kernel: Lustre: fir-MDT0003: Recovery over after 2:45, of 1302 clients 1301 recovered and 1 was evicted.
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tappro Mikhail Pershin
                Reporter:
                sthiell Stephane Thiell
              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: