[LU-2927] mdt_reconstruct_open() ASSERTION failure Created: 07/Mar/13 Updated: 22/May/13 Resolved: 25/Mar/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.4.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Andriy Skulysh | Assignee: | Di Wang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | LB, mq213, patch | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 7028 |
| Description |
|
transno shouldn't be cleared for EREMOTE operation. |
| Comments |
| Comment by Andriy Skulysh [ 07/Mar/13 ] |
| Comment by Andreas Dilger [ 07/Mar/13 ] |
|
How was this bug hit, and how often is it seen? Ideally there would also be a test case for this. |
| Comment by Andreas Dilger [ 08/Mar/13 ] |
|
Dropping this from the blocker list. The patch is incorrect and we have no information about how this bug was hit or the symptoms of the failure (stack trace, error logs, etc), or how often it is hit, so no way to know how common or rare the problem is. |
| Comment by Andriy Skulysh [ 11/Mar/13 ] |
|
Original call trace: Jan 21 09:34:50 snx11026n003 kernel: [999566.276882] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) ASSERTION( (!(rc < 0) || (lustre_msg_get_transno(req->rq_repmsg) == 0)) ) failed: Jan 21 09:34:50 snx11026n003 kernel: [999566.293035] LustreError: 129089:0:(mdt_open.c:1041:mdt_reconstruct_open()) LBUG Jan 21 09:34:50 snx11026n003 kernel: [999566.301506] Pid: 129089, comm: mdt_500 Jan 21 09:34:50 snx11026n003 kernel: [999566.305899] Jan 21 09:34:50 snx11026n003 kernel: [999566.305900] Call Trace: Jan 21 09:34:50 snx11026n003 kernel: [999566.310714] [<ffffffffa0498825>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.318711] [<ffffffffa0498e37>] lbug_with_loc+0x47/0xb0 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.325867] [<ffffffffa0c9de47>] mdt_reconstruct_open+0x7c7/0xa80 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.333590] [<ffffffffa0c8f7c5>] mdt_reconstruct+0x45/0x120 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.340724] [<ffffffffa0c7bd59>] mdt_reint_internal+0x709/0x8f0 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.348248] [<ffffffffa0c7c20d>] mdt_intent_reint+0x1ed/0x500 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.355579] [<ffffffffa0c7add9>] mdt_intent_policy+0x369/0x680 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.363043] [<ffffffffa0724bc1>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.370801] [<ffffffffa074b5fa>] ldlm_handle_enqueue0+0x48a/0xf40 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.378815] [<ffffffffa0c7b246>] mdt_enqueue+0x46/0x130 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.385546] [<ffffffffa0c709f2>] mdt_handle_common+0x922/0x1760 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.393062] [<ffffffffa0c71905>] mdt_regular_handle+0x15/0x20 [mdt] Jan 21 09:34:50 snx11026n003 kernel: [999566.400428] [<ffffffffa0778e3a>] ptlrpc_server_handle_request+0x43a/0x1000 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.409382] [<ffffffffa049957e>] ? cfs_timer_arm+0xe/0x10 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.416603] [<ffffffffa04a69bf>] ? lc_watchdog_touch+0x6f/0x180 [libcfs] Jan 21 09:34:50 snx11026n003 kernel: [999566.424462] [<ffffffffa0771ec0>] ? ptlrpc_wait_event+0xb0/0x2b0 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.432274] [<ffffffff8104f7a3>] ? __wake_up+0x53/0x70 Jan 21 09:34:50 snx11026n003 kernel: [999566.438368] [<ffffffffa077a27a>] ptlrpc_main+0x87a/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.445644] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.452859] [<ffffffff8100c1ca>] child_rip+0xa/0x20 Jan 21 09:34:50 snx11026n003 kernel: [999566.458677] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.465931] [<ffffffffa0779a00>] ? ptlrpc_main+0x0/0x1840 [ptlrpc] Jan 21 09:34:50 snx11026n003 kernel: [999566.473166] [<ffffffff8100c1c0>] ? child_rip+0x0/0x20 Jan 21 09:34:50 snx11026n003 kernel: [999566.479105] Jan 21 09:34:50 snx11026n003 kernel: [999566.481480] Kernel panic - not syncing: LBUG The LBUG was hit with single MDT. It is weird by itself. |
| Comment by Keith Mannthey (Inactive) [ 12/Mar/13 ] |
|
How do you know the assertion was wrong? |
| Comment by Di Wang [ 14/Mar/13 ] |
| Comment by Andreas Dilger [ 20/Mar/13 ] |
|
Andriy, any information on how this bug was triggered? Was it under testing, or some user load? MDS recovery, network errors, etc? |
| Comment by Peter Jones [ 25/Mar/13 ] |
|
Landed for 2.4 |
| Comment by Ned Bass [ 22/May/13 ] |
|
Andreas, we hit this last night on a production MDS still running 2.3.63. In case you're still interested I'm attaching the console log and lustre debug log. Not sure what the client load was like at the time. I see a lot of changelog and fid2path activity which is probably from a RobinHood scan. |