[LU-3987] LBUG ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed Created: 20/Sep/13  Updated: 14/Nov/13  Resolved: 14/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.4
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Mahmoud Hanafi Assignee: Bruno Faccini (Inactive)
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Related
is related to LU-2943 LBUG mdt_reconstruct_open()) ASSERTIO... Resolved
Severity: 3
Rank (Obsolete): 10636

 Description   

This may be a dup of LU-2943. We have hit this several time and have a reproducer code.

3>LustreError: 0:0:(ldlm_lockd.c:358:waiting_locks_callback()) ### lock callback timer expired after 151s: evicting client at 10.151.4.243@o2ib ns: mdt-ffff880c2fb68000 lock: ffff880bf1ba3480/0x8107ff778cb60d48 lrc: 3/0,0 mode: CW/CW res: 9011569394/588 bits 0x5 rrc: 512 type: IBT flags: 0x4000030 remote: 0x1cf563a7b3f9d855 expref: 40 pid: 70400 timeout: 4583890257^M
<3>LustreError: 69191:0:(ldlm_lockd.c:1162:ldlm_handle_enqueue0()) ### lock on disconnected export ffff880c09a0f800 ns: mdt-ffff880c2fb68000 lock: ffff880bf2692900/0x8107ff778cb73a79 lrc: 2/0,0 mode: --/CR res: 8993497057/15684 bits 0x0 rrc: 699 type: IBT flags: 0x0 remote: 0x1cf563a7b3f9d871 expref: -99 pid: 69191 timeout: 0^M
<0>LustreError: 70380:0:(mdt_open.c:1056:mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed: ^M
<0>LustreError: 70380:0:(mdt_open.c:1056:mdt_reconstruct_open()) LBUG^M
<4>Pid: 70380, comm: mdt_295^M
<4>^M
<4>Call Trace:^M
<4> [<ffffffffa05a0785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
<4> [<ffffffffa05a0d97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
<4> [<ffffffffa0ee5697>] mdt_reconstruct_open+0x7c7/0xa80 [mdt]^M
<4> [<ffffffffa0ed71f5>] mdt_reconstruct+0x45/0x120 [mdt]^M
<4> [<ffffffffa0ec4099>] mdt_reint_internal+0x709/0x8e0 [mdt]^M
<4> [<ffffffffa0ec453d>] mdt_intent_reint+0x1ed/0x500 [mdt]^M
<4> [<ffffffffa0ec2c09>] mdt_intent_policy+0x379/0x690 [mdt]^M
<4> [<ffffffffa07f3351>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]^M
<4> [<ffffffffa08191ad>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]^M
<4> [<ffffffffa0ec3586>] mdt_enqueue+0x46/0x130 [mdt]^M
<4> [<ffffffffa0eb8772>] mdt_handle_common+0x932/0x1750 [mdt]^M
<4> [<ffffffffa0eb9665>] mdt_regular_handle+0x15/0x20 [mdt]^M
<4> [<ffffffffa0847b4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]^M
<4> [<ffffffffa0846f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
<4> [<ffffffffa0846f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffffa0846f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
<4>^M
<0>Kernel panic - not syncing: LBUG^M
<4>Pid: 70380, comm: mdt_295 Not tainted 2.6.32-279.19.1.el6.20130213.x86_64.lustre214 #1^M
<4>Call Trace:^M
<4> [<ffffffff8151c027>] ? panic+0xa0/0x189^M
<4> [<ffffffffa05a0deb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
<4> [<ffffffffa0ee5697>] ? mdt_reconstruct_open+0x7c7/0xa80 [mdt]^M
<4> [<ffffffffa0ed71f5>] ? mdt_reconstruct+0x45/0x120 [mdt]^M
<4> [<ffffffffa0ec4099>] ? mdt_reint_internal+0x709/0x8e0 [mdt]^M
<4> [<ffffffffa0ec453d>] ? mdt_intent_reint+0x1ed/0x500 [mdt]^M
<4> [<ffffffffa0ec2c09>] ? mdt_intent_policy+0x379/0x690 [mdt]^M
<4> [<ffffffffa07f3351>] ? ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]^M
<4> [<ffffffffa08191ad>] ? ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]^M
<4> [<ffffffffa0ec3586>] ? mdt_enqueue+0x46/0x130 [mdt]^M
<4> [<ffffffffa0eb8772>] ? mdt_handle_common+0x932/0x1750 [mdt]^M
<4> [<ffffffffa0eb9665>] ? mdt_regular_handle+0x15/0x20 [mdt]^M
<4> [<ffffffffa0847b4e>] ? ptlrpc_main+0xc4e/0x1a40 [ptlrpc]^M
<4> [<ffffffffa0846f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20^M
<4> [<ffffffffa0846f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffffa0846f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M



 Comments   
Comment by Bruno Faccini (Inactive) [ 20/Sep/13 ]

Hello Mahmoud,
I am already in charge of LU-2943 where I back-ported change from LU-2927.
Can you better detail your reproducer/platform content/config so I can expose it to my b2_1 back-port ?
May be you can provide it if its requirements are not too "specific" ??

Comment by Andreas Dilger [ 20/Sep/13 ]

Looks like it is getting an error during replay, though that should never happen. If the error is -EREMOTE (which should only happen for DNE recovery) then this can be closed as a duplicate.

Comment by Bruno Faccini (Inactive) [ 15/Oct/13 ]

Hello Mahmoud,
You did not answer my inquiry regarding your reproducer availability and its possible exposure to the b2_1 back-port of LU-2927 I created for LU-2943, can you give me some feed-back ?

Also, as Andreas pointed, a good way to also confirm it is related and likely to fix could be to identify if the error is -EREMOTE, so since you indicate problem is reproducible, is there any recent crash-dump or Lustre debug-log available ?

Comment by Kit Westneat (Inactive) [ 30/Oct/13 ]

Hi Bruno, we recently hit this bug at IU in 2.1.6. I have a core dump I will have the customer upload to the FTP site.

Did your patch for LU-2943 ever land? If not, then I don't think we are running with it. I am going to try building a version with that.

BTW the kernel-debuginfo we have is a different name, but it is the same kernel.

Comment by Mahmoud Hanafi [ 31/Oct/13 ]

We applied patch#3 from LU-2943 and have not hit the issue.

Comment by Bruno Faccini (Inactive) [ 31/Oct/13 ]

Mahmoud, thanks for your feed-back!

Kit, patch has not land until now, but since at least TGCC and now NASA sites successfully integrated it, I think it will land soon now.

Comment by Peter Jones [ 14/Nov/13 ]

duplicate of LU-2943

Generated at Sat Feb 10 01:38:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.