[LU-7466] ASSERTION( new_lock != NULL ) failed: lockh Created: 23/Nov/15  Updated: 08/Sep/16  Resolved: 08/Sep/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Mahmoud Hanafi Assignee: Peter Jones
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File r743i2n15.ldump.gz    
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

MDS hit LBUG just after our remote cluster clients (nids 10.153.x.x@o2ib233 connected through routers) experienced connection timeout to the routers.

<3>LustreError: 19688:0:(ldlm_lockd.c:1347:ldlm_handle_enqueue0()) ### lock on destroyed export ffff88344f2a1800 ns: mdt-nbp8-MDT0000_UUID lock: ffff88357462b980/0x249f5f7747c47748 lrc: 3/0,0 mode: PR/PR res: [0x3603b9ff8:0x134:0x0].0 bits 0x2 rrc: 2 type: IBT flags: 0x50200000000000 nid: 10.153.11.153@o2ib233 remote: 0x10f7e70cb9c0ae16 expref: 125 pid: 19764 timeout: 0 lvb_type: 0^M
<0>LustreError: 19918:0:(mdt_handler.c:3725:mdt_intent_lock_replace()) ASSERTION( new_lock != NULL ) failed: lockh 0x249f5f7747ef7873^M
<0>LustreError: 19918:0:(mdt_handler.c:3725:mdt_intent_lock_replace()) LBUG^M
<4>Pid: 19918, comm: mdt00_084^M
<4>^M
<4>Call Trace:^M
<4> [<ffffffffa04db895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
<4> [<ffffffffa04dbe97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
<4> [<ffffffffa0e6f47c>] mdt_intent_lock_replace+0x29c/0x410 [mdt]^M
<4> [<ffffffffa0e755f1>] mdt_intent_reint+0x381/0x410 [mdt]^M
<4> [<ffffffffa0e72c3e>] mdt_intent_policy+0x3ae/0x770 [mdt]^M
<4> [<ffffffffa076c2c5>] ldlm_lock_enqueue+0x135/0x980 [ptlrpc]^M
<4> [<ffffffffa0795ebb>] ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]^M
<4> [<ffffffffa0e73106>] mdt_enqueue+0x46/0xe0 [mdt]^M
<4> [<ffffffffa0e77ada>] mdt_handle_common+0x52a/0x1470 [mdt]^M
<4> [<ffffffffa0eb44a5>] mds_regular_handle+0x15/0x20 [mdt]^M
<4> [<ffffffffa07c50c5>] ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]^M
<4> [<ffffffffa04ed8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]^M
<4> [<ffffffffa07bda69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]^M
<4> [<ffffffffa07c789d>] ptlrpc_main+0xafd/0x1780 [ptlrpc]^M
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20^M
<4> [<ffffffffa07c6da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]^M
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20^M
<4>^M
<0>Kernel panic - not syncing: LBUG^M
<4>Pid: 19918, comm: mdt00_084 Not tainted 2.6.32-431.29.2.el6.20150203.x86_64.lustre253 #1^M
<4>Call Trace:^M
<4> [<ffffffff8155946e>] ? panic+0xa7/0x190^M
<4> [<ffffffffa04dbeeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
<4> [<ffffffffa0e6f47c>] ? mdt_intent_lock_replace+0x29c/0x410 [mdt]^M
<4> [<ffffffffa0e755f1>] ? mdt_intent_reint+0x381/0x410 [mdt]^M
<4> [<ffffffffa0e72c3e>] ? mdt_intent_policy+0x3ae/0x770 [mdt]^M
<4> [<ffffffffa076c2c5>] ? ldlm_lock_enqueue+0x135/0x980 [ptlrpc]^M
<4> [<ffffffffa0795ebb>] ? ldlm_handle_enqueue0+0x51b/0x10c0 [ptlrpc]^M
<4> [<ffffffffa0e73106>] ? mdt_enqueue+0x46/0xe0 [mdt]^M
<4> [<ffffffffa0e77ada>] ? mdt_handle_common+0x52a/0x1470 [mdt]^M
<4> [<ffffffffa0eb44a5>] ? mds_regular_handle+0x15/0x20 [mdt]^M
<4> [<ffffffffa07c50c5>] ? ptlrpc_server_handle_request+0x385/0xc00 [ptlrpc]^M
<4> [<ffffffffa04ed8d5>] ? lc_watchdog_touch+0x65/0x170 [libcfs]^M
<4> [<ffffffffa07bda69>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]^M
<4> [<ffffffffa07c789d>] ? ptlrpc_main+0xafd/0x1780 [ptlrpc]^M
<4> [<ffffffff8100c20a>] ? child_rip+0xa/0x20^M
<4> [<ffffffffa07c6da0>] ? ptlrpc_main+0x0/0x1780 [ptlrpc]^M
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20^M

Nid 10.153.11.153@o2ib233 is node r743i2n15 attched debug logs from it.

We are still trying to determine the cause of network timeouts.



 Comments   
Comment by Peter Jones [ 24/Nov/15 ]

Mahmoud

This area of code had a significant reworking in 2.5.4 (mostly tracked under LU-2827) and so I think that this should be dealt with when you next rebaseline.

Regards

Peter

Comment by Mahmoud Hanafi [ 08/Sep/16 ]

Close we have upgraded to 2.7

Comment by Peter Jones [ 08/Sep/16 ]

ok - thanks Mahmoud

Generated at Sat Feb 10 02:09:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.