Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.10.8
-
None
-
Clients: 2.12.0, CentOS 7.6
-
3
-
9223372036854775807
Description
LBUG today on oak-MDT0000, never seem this one before. We have had some big data transfers using dsync going on on Sherlock (2.12.0 clients). Might be related, or not.
[4954375.921845] LustreError: 15102:0:(tgt_handler.c:628:process_req_last_xid()) @@@ Unexpected xid 5d6425ffe4140 vs. last_xid 5d6425ffe418f req@ffffa1597f41f200 x1642955450237248/t0(0) o101->98bbe778-4f70-8a89-d80e-d6a8120c693b@10.8.2.23@o2ib6:663/0 lens 736/0 e 0 to 0 dl 1567111883 ref 1 fl Interpret:/2/ffffffff rc 0/-1 [4954542.487326] LustreError: 15290:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Unknown attr bits: 0x60000 [4954542.517377] LustreError: 15290:0:(mdt_lib.c:961:mdt_attr_valid_xlate()) Skipped 3754300 previous similar messages [4954874.316190] LustreError: 15347:0:(lod_object.c:3919:lod_ah_init()) ASSERTION( !lod_obj_is_striped(child) ) failed: [4954874.351112] LustreError: 15347:0:(lod_object.c:3919:lod_ah_init()) LBUG [4954874.373452] Pid: 15347, comm: mdt01_049 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Mon Oct 8 11:21:37 PDT 2018 [4954874.406359] Call Trace: [4954874.414973] [<ffffffffc08af7cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [4954874.437035] [<ffffffffc08af87c>] lbug_with_loc+0x4c/0xa0 [libcfs] [4954874.459664] [<ffffffffc135a89f>] lod_ah_init+0x23f/0xde0 [lod] [4954874.479751] [<ffffffffc13d306b>] mdd_object_make_hint+0xcb/0x190 [mdd] [4954874.502388] [<ffffffffc13bed50>] mdd_create_data+0x330/0x730 [mdd] [4954874.523606] [<ffffffffc129140c>] mdt_mfd_open+0xc5c/0xe70 [mdt] [4954874.544523] [<ffffffffc1291b9b>] mdt_finish_open+0x57b/0x690 [mdt] [4954874.565743] [<ffffffffc1293478>] mdt_reint_open+0x17c8/0x3190 [mdt] [4954874.587229] [<ffffffffc1288cb3>] mdt_reint_rec+0x83/0x210 [mdt] [4954874.607567] [<ffffffffc126a19b>] mdt_reint_internal+0x5fb/0x9c0 [mdt] [4954874.630197] [<ffffffffc126a6c2>] mdt_intent_reint+0x162/0x430 [mdt] [4954874.651677] [<ffffffffc126d4cb>] mdt_intent_opc+0x1eb/0xaf0 [mdt] [4954874.672619] [<ffffffffc1275d68>] mdt_intent_policy+0x138/0x320 [mdt] [4954874.694668] [<ffffffffc0be82dd>] ldlm_lock_enqueue+0x38d/0x980 [ptlrpc] [4954874.719320] [<ffffffffc0c11c03>] ldlm_handle_enqueue0+0xa83/0x1670 [ptlrpc] [4954874.743104] [<ffffffffc0c977f2>] tgt_enqueue+0x62/0x210 [ptlrpc] [4954874.764026] [<ffffffffc0c9b72a>] tgt_request_handle+0x92a/0x1370 [ptlrpc] [4954874.787245] [<ffffffffc0c4404b>] ptlrpc_server_handle_request+0x23b/0xaa0 [ptlrpc] [4954874.813872] [<ffffffffc0c47792>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [4954874.835628] [<ffffffff8babdf21>] kthread+0xd1/0xe0 [4954874.852252] [<ffffffff8c1255f7>] ret_from_fork_nospec_end+0x0/0x39 [4954874.873448] [<ffffffffffffffff>] 0xffffffffffffffff [4954874.890366] Kernel panic - not syncing: LBUG
I do have a crash dump if you're interested. MDT failover was smooth so not a big deal:
Aug 29 14:04:49 oak-md1-s1 kernel: Lustre: oak-MDT0000: Recovery over after 0:55, of 1464 clients 1464 recovered and 0 were evicted.
Hi! This issue hit us again today, even though we're now using SSDs on all Oak's MDTs. I see that Lai's patch above (https://review.whamcloud.com/36100) was almost ready to land and even had Andreas' approval. It would probably be too much effort to port it to 2.10.8 (that we're still running on Oak), but would it be possible that you look at the patch again so that it can land into master. That way, this rare issue would be avoided in the future. Thanks!