Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.4
-
None
-
3
-
9223372036854775807
Description
We hit the following crash on one of a MDS of Fir last night, running Lustre 2.12.4. Same problem occurred after re-mount of MDT and recovery. I had to kill the robinhood client that was running purge but also MDT-to-MDT migration (a single lfs migrate -m 0). Then I was able to remount this MDT. It looks a bit like LU-5185. Unfortunately, we lost the vmcore this time. But if it happens again, I'll let you know and will attach it.
[1579975.369592] Lustre: fir-MDT0003: haven't heard from client 6fb18a53-0376-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff9150778e2400, cur 1590281140 expire 1590280990 last 1590280913 [1580039.870825] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2) [1600137.924189] Lustre: fir-MDT0003: haven't heard from client ed7f1c7c-f5de-4 (at 10.50.4.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f6360b800, cur 1590301302 expire 1590301152 last 1590301075 [1619876.303011] Lustre: fir-MDT0003: Connection restored to 796a800c-02e4-4 (at 10.49.20.10@o2ib1) [1639768.911234] Lustre: fir-MDT0003: Connection restored to 83415c02-51ff-4 (at 10.49.20.5@o2ib1) [1639821.000702] Lustre: fir-MDT0003: haven't heard from client 83415c02-51ff-4 (at 10.49.20.5@o2ib1) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f32f56800, cur 1590340984 expire 1590340834 last 1590340757 [1647672.215034] Lustre: fir-MDT0003: haven't heard from client 19e3d49f-43e4-4 (at 10.50.9.37@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff913f6240bc00, cur 1590348835 expire 1590348685 last 1590348608 [1647717.069200] Lustre: fir-MDT0003: Connection restored to a4c7b337-bfab-4 (at 10.50.9.37@o2ib2) [1667613.717650] Lustre: fir-MDT0003: haven't heard from client 20e68a82-bbdb-4 (at 10.50.6.54@o2ib2) in 227 seconds. I think it's dead, and I am evicting it. exp ffff914d1725c400, cur 1590368776 expire 1590368626 last 1590368549 [1667717.713398] Lustre: fir-MDT0003: Connection restored to b7b3778b-82b4-4 (at 10.50.6.54@o2ib2) [1692403.249073] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: [1692403.258985] LustreError: 30166:0:(service.c:189:ptlrpc_save_lock()) LBUG [1692403.265867] Pid: 30166, comm: mdt00_002 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 SMP Thu Nov 7 15:26:16 PST 2019 [1692403.276224] Call Trace: [1692403.278866] [<ffffffffc0aff7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [1692403.285611] [<ffffffffc0aff87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [1692403.292002] [<ffffffffc0fb8851>] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc] [1692403.298695] [<ffffffffc14b4bab>] mdt_save_lock+0x20b/0x360 [mdt] [1692403.305003] [<ffffffffc14b4d5c>] mdt_object_unlock+0x5c/0x3c0 [mdt] [1692403.311572] [<ffffffffc14b82e7>] mdt_object_unlock_put+0x17/0x120 [mdt] [1692403.318479] [<ffffffffc150c4fc>] mdt_unlock_list+0x54/0x174 [mdt] [1692403.324876] [<ffffffffc14d3fd3>] mdt_reint_migrate+0xa03/0x1310 [mdt] [1692403.331619] [<ffffffffc14d4963>] mdt_reint_rec+0x83/0x210 [mdt] [1692403.337841] [<ffffffffc14b1273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [1692403.344586] [<ffffffffc14bc6e7>] mdt_reint+0x67/0x140 [mdt] [1692403.350463] [<ffffffffc101c64a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [1692403.357586] [<ffffffffc0fbf43b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [1692403.365483] [<ffffffffc0fc2da4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [1692403.371991] [<ffffffffb98c2e81>] kthread+0xd1/0xe0 [1692403.377079] [<ffffffffb9f77c24>] ret_from_fork_nospec_begin+0xe/0x21 [1692403.383725] [<ffffffffffffffff>] 0xffffffffffffffff [1692403.388918] Kernel panic - not syncing: LBUG [1692403.393363] CPU: 44 PID: 30166 Comm: mdt00_002 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.27.2.el7_lustre.pl2.x86_64 #1 [1692403.406214] Hardware name: Dell Inc. PowerEdge R6415/07YXFK, BIOS 1.10.6 08/15/2019 [1692403.414040] Call Trace: [1692403.416670] [<ffffffffb9f65147>] dump_stack+0x19/0x1b [1692403.421981] [<ffffffffb9f5e850>] panic+0xe8/0x21f [1692403.426954] [<ffffffffc0aff8cb>] lbug_with_loc+0x9b/0xa0 [libcfs] [1692403.433351] [<ffffffffc0fb8851>] ptlrpc_save_lock+0xc1/0xd0 [ptlrpc] [1692403.439978] [<ffffffffc14b4bab>] mdt_save_lock+0x20b/0x360 [mdt] [1692403.446259] [<ffffffffc14b4d5c>] mdt_object_unlock+0x5c/0x3c0 [mdt] [1692403.452796] [<ffffffffc14b82e7>] mdt_object_unlock_put+0x17/0x120 [mdt] [1692403.459687] [<ffffffffc150c4fc>] mdt_unlock_list+0x54/0x174 [mdt] [1692403.466057] [<ffffffffc14d3fd3>] mdt_reint_migrate+0xa03/0x1310 [mdt] [1692403.472794] [<ffffffffc0d3cfa9>] ? check_unlink_entry+0x19/0xd0 [obdclass] [1692403.479942] [<ffffffffc14d4963>] mdt_reint_rec+0x83/0x210 [mdt] [1692403.486134] [<ffffffffc14b1273>] mdt_reint_internal+0x6e3/0xaf0 [mdt] [1692403.492843] [<ffffffffc14bc6e7>] mdt_reint+0x67/0x140 [mdt] [1692403.498721] [<ffffffffc101c64a>] tgt_request_handle+0xada/0x1570 [ptlrpc] [1692403.505806] [<ffffffffc0ff40b1>] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] [1692403.513553] [<ffffffffc0affbde>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [1692403.520811] [<ffffffffc0fbf43b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [1692403.528673] [<ffffffffc0fbb565>] ? ptlrpc_wait_event+0xa5/0x360 [ptlrpc] [1692403.535632] [<ffffffffb98cfeb4>] ? __wake_up+0x44/0x50 [1692403.541065] [<ffffffffc0fc2da4>] ptlrpc_main+0xb34/0x1470 [ptlrpc] [1692403.547537] [<ffffffffc0fc2270>] ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] [1692403.555106] [<ffffffffb98c2e81>] kthread+0xd1/0xe0 [1692403.560158] [<ffffffffb98c2db0>] ? insert_kthread_work+0x40/0x40 [1692403.566424] [<ffffffffb9f77c24>] ret_from_fork_nospec_begin+0xe/0x21 [1692403.573037] [<ffffffffb98c2db0>] ? insert_kthread_work+0x40/0x40 [root@fir-md1-s4 127.0.0.1-2020-05-25-00:59:34]# Message from syslogd@fir-md1-s4 at May 25 09:46:42 ... kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: Message from syslogd@fir-md1-s4 at May 25 09:46:42 ... kernel:LustreError: 29280:0:(service.c:189:ptlrpc_save_lock()) LBUG
May 25 09:55:43 fir-md1-s1 kernel: Lustre: fir-MDT0003: Recovery over after 2:45, of 1302 clients 1301 recovered and 1 was evicted.
Attachments
Issue Links
- is duplicated by
-
LU-13816 LustreError: 18408:0:(mdt_handler.c:892:mdt_big_xattr_get()) ASSERTION( info->mti_big_lmm_used == 0 ) failed:
- Resolved