[LU-2927] mdt_reconstruct_open() ASSERTION failure Created: 07/Mar/13  Updated: 22/May/13  Resolved: 25/Mar/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.4.0

Type: Bug Priority: Blocker
Reporter: Andriy Skulysh Assignee: Di Wang
Resolution: Fixed Votes: 0
Labels: LB, mq213, patch

Attachments: Text File LU-2927-MDS-console-log.txt     Text File lustre.log.LU-2927.txt    
Severity: 3
Rank (Obsolete): 7028

 Description   

transno shouldn't be cleared for EREMOTE operation.



 Comments   
Comment by Andriy Skulysh [ 07/Mar/13 ]

PATCH: http://review.whamcloud.com/5632

Comment by Andreas Dilger [ 07/Mar/13 ]

How was this bug hit, and how often is it seen? Ideally there would also be a test case for this.

Comment by Andreas Dilger [ 08/Mar/13 ]

Dropping this from the blocker list. The patch is incorrect and we have no information about how this bug was hit or the symptoms of the failure (stack trace, error logs, etc), or how often it is hit, so no way to know how common or rare the problem is.

Comment by Andriy Skulysh [ 11/Mar/13 ]

Original call trace:

Jan 21 09:34:50 snx11026n003 kernel: [999566.276882] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed:
Jan 21 09:34:50 snx11026n003 kernel: [999566.293035] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) LBUG
Jan 21 09:34:50 snx11026n003 kernel: [999566.301506] Pid: 129089, comm: mdt_500
Jan 21 09:34:50 snx11026n003 kernel: [999566.305899]
Jan 21 09:34:50 snx11026n003 kernel: [999566.305900] Call Trace:
Jan 21 09:34:50 snx11026n003 kernel: [999566.310714] [<ffffffffa0498825>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Jan 21 09:34:50 snx11026n003 kernel: [999566.318711] [<ffffffffa0498e37>] lbug_with_loc+0x47/0xb0 [libcfs]
Jan 21 09:34:50 snx11026n003 kernel: [999566.325867] [<ffffffffa0c9de47>] mdt_reconstruct_open+0x7c7/0xa80 [mdt]
Jan 21 09:34:50 snx11026n003 kernel: [999566.333590] [<ffffffffa0c8f7c5>] mdt_reconstruct+0x45/0x120 [mdt]
Jan 21 09:34:50 snx11026n003 kernel: [999566.340724] [<ffffffffa0c7bd59>] mdt_reint_internal+0x709/0x8f0 [mdt]
Jan 21 09:34:50 snx11026n003 kernel: [999566.348248] [<ffffffffa0c7c20d>] mdt_intent_reint+0x1ed/0x500 [mdt]
Jan 21 09:34:50 snx11026n003 kernel: [999566.355579] [<ffffffffa0c7add9>] mdt_intent_policy+0x369/0x680 [mdt]
Jan 21 09:34:50 snx11026n003 kernel: [999566.363043] [<ffffffffa0724bc1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Jan 21 09:34:50 snx11026n003 kernel: [999566.370801] [<ffffffffa074b5fa>] ldlm_handle_enqueue0+0x48a/0xf40 [ptlrpc]
Jan 21 09:34:50 snx11026n003 kernel: [999566.378815] [<ffffffffa0c7b246>] mdt_enqueue+0x46/0x130 [mdt]
Jan 21 09:34:50 snx11026n003 kernel: [999566.385546] [<ffffffffa0c709f2>] mdt_handle_common+0x922/0x1760 [mdt]
Jan 21 09:34:50 snx11026n003 kernel: [999566.393062] [<ffffffffa0c71905>] mdt_regular_handle+0x15/0x20 [mdt]
Jan 21 09:34:50 snx11026n003 kernel: [999566.400428] [<ffffffffa0778e3a>] ptlrpc_server_handle_request+0x43a/0x1000 [ptlrpc]
Jan 21 09:34:50 snx11026n003 kernel: [999566.409382] [<ffffffffa049957e>] ? cfs_timer_arm+0xe/0x10 [libcfs]
Jan 21 09:34:50 snx11026n003 kernel: [999566.416603] [<ffffffffa04a69bf>] ? lc_watchdog_touch+0x6f/0x180 [libcfs]
Jan 21 09:34:50 snx11026n003 kernel: [999566.424462] [<ffffffffa0771ec0>] ? ptlrpc_wait_event+0xb0/0x2b0 [ptlrpc]
Jan 21 09:34:50 snx11026n003 kernel: [999566.432274] [<ffffffff8104f7a3>] ? __wake_up+0x53/0x70
Jan 21 09:34:50 snx11026n003 kernel: [999566.438368] [<ffffffffa077a27a>] ptlrpc_main+0x87a/0x1840 [ptlrpc]
Jan 21 09:34:50 snx11026n003 kernel: [999566.445644] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc]
Jan 21 09:34:50 snx11026n003 kernel: [999566.452859] [<ffffffff8100c1ca>] child_rip+0xa/0x20
Jan 21 09:34:50 snx11026n003 kernel: [999566.458677] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc]
Jan 21 09:34:50 snx11026n003 kernel: [999566.465931] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc]
Jan 21 09:34:50 snx11026n003 kernel: [999566.473166] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20
Jan 21 09:34:50 snx11026n003 kernel: [999566.479105]
Jan 21 09:34:50 snx11026n003 kernel: [999566.481480] Kernel panic - not syncing: LBUG

The LBUG was hit with single MDT. It is weird by itself.
The idea was to fix assertion.

Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ]

How do you know the assertion was wrong?

Comment by Di Wang [ 14/Mar/13 ]

http://review.whamcloud.com/#change,5694

Comment by Andreas Dilger [ 20/Mar/13 ]

Andriy, any information on how this bug was triggered? Was it under testing, or some user load? MDS recovery, network errors, etc?

Comment by Peter Jones [ 25/Mar/13 ]

Landed for 2.4

Comment by Ned Bass [ 22/May/13 ]

Andreas, we hit this last night on a production MDS still running 2.3.63. In case you're still interested I'm attaching the console log and lustre debug log. Not sure what the client load was like at the time. I see a lot of changelog and fid2path activity which is probably from a RobinHood scan.

Generated at Sat Feb 10 01:29:26 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.