Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.1.6, Lustre 2.4.3
-
Clients:
- RHEL6 w/ patched kernel 2.6.32-431.11.2.el6
- Lustre 2.4.3 + bullpatches
Servers:
- RHEL6 w/ patched kernel 2.6.32-220.23.1
- Lustre 2.1.6 + bullpatches
-
3
-
15925
Description
We hit the following LBUG twice on one of our MDT:
[78073.117731] Lustre: 31681:0:(ldlm_lib.c:952:target_handle_connect()) work2-MDT0000: connection from 38d12a48-aabd-9279-dc69-b78c4e00321c@10.100.62.72@o2ib2 t189645377601 exp ffff880b95bb1c00 cur 1410508503 last 1410508503 [78079.176124] Lustre: 31681:0:(mdt_handler.c:1005:mdt_getattr_name_lock()) Although resent, but still not get child lockparent:[0x22f2b0783:0x34b:0x0] child:[0x22d854b6e:0x85d5:0x0] [78079.192443] LustreError: 31681:0:(mdt_handler.c:3203:mdt_intent_lock_replace()) ASSERTION( lustre_msg_get_flags(req->rq_reqmsg) & 0x0002 ) failed: [78079.205971] LustreError: 31681:0:(mdt_handler.c:3203:mdt_intent_lock_replace()) LBUG [78079.215326] Pid: 31681, comm: mdt_104 [78079.220352] [78079.220353] Call Trace: [78079.227394] [<ffffffffa051a7f5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [78079.236100] [<ffffffffa051ae07>] lbug_with_loc+0x47/0xb0 [libcfs] [78079.243815] [<ffffffffa0d9671b>] mdt_intent_lock_replace+0x3bb/0x440 [mdt] [78079.252140] [<ffffffffa0daad26>] mdt_intent_getattr+0x3a6/0x4a0 [mdt] [78079.260391] [<ffffffffa0da6c09>] mdt_intent_policy+0x379/0x690 [mdt] [78079.268641] [<ffffffffa07423c1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] [78079.276846] [<ffffffffa07683cd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc] [78079.285614] [<ffffffffa0da7586>] mdt_enqueue+0x46/0x130 [mdt] [78079.292950] [<ffffffffa0d9c762>] mdt_handle_common+0x932/0x1750 [mdt] [78079.300987] [<ffffffffa0d9d655>] mdt_regular_handle+0x15/0x20 [mdt] [78079.309024] [<ffffffffa07974f6>] ptlrpc_main+0xd16/0x1a80 [ptlrpc] [78079.316979] [<ffffffff810017cc>] ? __switch_to+0x1ac/0x320 [78079.324222] [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] [78079.331896] [<ffffffff8100412a>] child_rip+0xa/0x20 [78079.338522] [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] [78079.346599] [<ffffffffa07967e0>] ? ptlrpc_main+0x0/0x1a80 [ptlrpc] [78079.354520] [<ffffffff81004120>] ? child_rip+0x0/0x20 [78079.361136] [78079.364683] Kernel panic - not syncing: LBUG
The support engineer was able to retrieve the client node from the crash dump. Both time, the client was a login node running Lustre 2.4.3.
It looks like LU-5314. The backported patch proposal failed on maloo ( http://review.whamcloud.com/#/c/10902/ )
Hi,
We are now running Lustre 2.5.3 + b2_5 patch http://review.whamcloud.com/#/c/10492/. Since the upgrade, we are hitting several issues on MDS/OSS around the ldlm. Are you aware of any complementary fix that we should apply with this one?
In the meantime, we are still investigating those issues onsite and will report them asap in new JIRA tickets.