[LU-3765] 2.5.0<->2.1.5 interop: sanity test 24u: (mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed Created: 15/Aug/13 Updated: 24/Oct/13 Resolved: 01/Oct/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.6, Lustre 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jian Yu | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | HB, mn1 | ||
| Environment: |
Lustre client build: http://build.whamcloud.com/job/lustre-master/1613/ |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9699 | ||||||||||||||||||||
| Description |
|
sanity test 24u hit the following failure on MDS: 11:06:31:Lustre: DEBUG MARKER: == sanity test 24u: create stripe file == 11:06:31 (1376417191) 11:06:31:LustreError: 13255:0:(mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed: 11:06:31:LustreError: 13255:0:(mdt_handler.c:224:mdt_lock_pdo_init()) LBUG 11:06:31:Pid: 13255, comm: mdt_01 11:06:31: 11:06:31:Call Trace: 11:06:31: [<ffffffffa04d0785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 11:06:31: [<ffffffffa04d0d97>] lbug_with_loc+0x47/0xb0 [libcfs] 11:06:31: [<ffffffffa0bdea65>] mdt_lock_pdo_init+0xe5/0xf0 [mdt] 11:06:31: [<ffffffffa0c127c6>] mdt_reint_open+0x1f6/0x2940 [mdt] 11:06:31: [<ffffffffa077b764>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc] 11:06:32: [<ffffffffa0ba256e>] ? md_ucred+0x1e/0x60 [mdd] 11:06:32: [<ffffffffa0be15d5>] ? mdt_ucred+0x15/0x20 [mdt] 11:06:32: [<ffffffffa0bf84ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt] 11:06:32: [<ffffffffa0bfcc51>] mdt_reint_rec+0x41/0xe0 [mdt] 11:06:32: [<ffffffffa0bf3ed4>] mdt_reint_internal+0x544/0x8e0 [mdt] 11:06:32: [<ffffffffa0bf453d>] mdt_intent_reint+0x1ed/0x500 [mdt] 11:06:32: [<ffffffffa0bf2c09>] mdt_intent_policy+0x379/0x690 [mdt] 11:06:32: [<ffffffffa0737391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] 11:06:32: [<ffffffffa075d1ed>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc] 11:06:32: [<ffffffffa0bf3586>] mdt_enqueue+0x46/0x130 [mdt] 11:06:32: [<ffffffffa0be8772>] mdt_handle_common+0x932/0x1750 [mdt] 11:06:32: [<ffffffffa0be9665>] mdt_regular_handle+0x15/0x20 [mdt] 11:06:32: [<ffffffffa078bbae>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc] 11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] 11:06:32: [<ffffffff8100c0ca>] child_rip+0xa/0x20 11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] 11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] 11:06:32: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 11:06:32: 11:06:32:Kernel panic - not syncing: LBUG Maloo report: https://maloo.whamcloud.com/test_sets/e3f3b3d8-0525-11e3-8d88-52540035b04c More instances: |
| Comments |
| Comment by Jian Yu [ 15/Aug/13 ] |
|
Lustre client build: http://build.whamcloud.com/job/lustre-b2_4/29/ The sanity test 24u also hung and hit the same LBUG on MDS: 08:33:17:Lustre: DEBUG MARKER: == sanity test 24u: create stripe file == 08:33:10 (1376407990) 08:33:17:LustreError: 13776:0:(mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed: 08:33:17:LustreError: 13776:0:(mdt_handler.c:224:mdt_lock_pdo_init()) LBUG 08:33:17:Pid: 13776, comm: mdt_00 08:33:18: 08:33:18:Call Trace: 08:33:18: [<ffffffffa0472785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 08:33:18: [<ffffffffa0472d97>] lbug_with_loc+0x47/0xb0 [libcfs] 08:33:18: [<ffffffffa0b6ea65>] mdt_lock_pdo_init+0xe5/0xf0 [mdt] 08:33:18: [<ffffffffa0ba28a6>] mdt_reint_open+0x1f6/0x2940 [mdt] 08:33:18: [<ffffffffa0715754>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc] 08:33:18: [<ffffffffa0b3356e>] ? md_ucred+0x1e/0x60 [mdd] 08:33:18: [<ffffffffa0b715d5>] ? mdt_ucred+0x15/0x20 [mdt] 08:33:18: [<ffffffffa0b884ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt] 08:33:18: [<ffffffffa0b8cc51>] mdt_reint_rec+0x41/0xe0 [mdt] 08:33:18: [<ffffffffa0b83ed4>] mdt_reint_internal+0x544/0x8e0 [mdt] 08:33:18: [<ffffffffa0b8453d>] mdt_intent_reint+0x1ed/0x500 [mdt] 08:33:18: [<ffffffffa0b82c09>] mdt_intent_policy+0x379/0x690 [mdt] 08:33:18: [<ffffffffa06d1391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] 08:33:18: [<ffffffffa06f71dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc] 08:33:18: [<ffffffffa0b83586>] mdt_enqueue+0x46/0x130 [mdt] 08:33:18: [<ffffffffa0b78772>] mdt_handle_common+0x932/0x1750 [mdt] 08:33:18: [<ffffffffa0b79665>] mdt_regular_handle+0x15/0x20 [mdt] 08:33:18: [<ffffffffa0725b9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc] 08:33:18: [<ffffffffa0724f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] 08:33:18: [<ffffffff8100c0ca>] child_rip+0xa/0x20 08:33:18: [<ffffffffa0724f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] 08:33:18: [<ffffffffa0724f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] 08:33:18: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 08:33:18: 08:33:19:Kernel panic - not syncing: LBUG Maloo report: https://maloo.whamcloud.com/test_sets/ae6aa048-0569-11e3-b127-52540035b04c The same test passed on Lustre 2.4.0 client with 2.1.5 server: This is a regression issue on Lustre b2_4 branch. |
| Comment by Di Wang [ 15/Aug/13 ] |
|
This is clearly caused by This happend with SLES11SP2 Lustre client, which in turn acts as an We detected that whenever we are writing to a new file using, fx, As MDS_OPEN_BY_FID is always set on this request, we never need Signed-off-by: Cheng Shao <cheng_shao@xyratex.com> In this patch, it stops sending name for open by FID and set lovea(test 24u) request. But 2.1.5 server can not handle 1. fix 2.1.5 server to handle this zero name length issue. please check open_by_fid part in mdt_reint_open. |
| Comment by Peter Jones [ 16/Aug/13 ] |
|
Lai Could you please help with this one? Thanks peter |
| Comment by Jian Yu [ 16/Aug/13 ] |
|
This is blocking the whole test session on Lustre b2_4 client with 2.1.6 server: |
| Comment by Lai Siyao [ 16/Aug/13 ] |
|
IMO once client specified MDS_OPEN_BY_FID, MDS should never with open with name because name may be invalid, or it will cause inconsistency. If this is true, MDS open by fid code can be simplified a lot. Patch is on http://review.whamcloud.com/#/c/7358/ |
| Comment by Patrick Farrell (Inactive) [ 22/Aug/13 ] |
|
Lai, Take a look at my latest in |
| Comment by Lai Siyao [ 28/Aug/13 ] |
|
Patch for master is on: These patches enabled getattr/open-by-fid by default, thus either fid or name is packed in these requests, and server can handle op-by-fid correctly. Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this. |
| Comment by Peter Jones [ 31/Aug/13 ] |
|
We have reverted |
| Comment by Lai Siyao [ 12/Sep/13 ] |
|
Patch for b2_1 is on: http://review.whamcloud.com/#/c/7627/ Now it's ready to continue interop test between 2.5 and 2.1 with these three patches. |
| Comment by Oleg Drokin [ 01/Oct/13 ] |
|
|
| Comment by Andreas Dilger [ 01/Oct/13 ] |
|
This problem was introduced by the patches for |
| Comment by John Hammond [ 16/Oct/13 ] |
|
The MDT is still generally at the mercy of the client to send valid names. Please see http://review.whamcloud.com/#/c/7961/ from |
| Comment by Cheng Shao (Inactive) [ 23/Oct/13 ] |
|
Lai, for your comment below and the related patch here http://review.whamcloud.com/#/c/7627/,
Vitaly has a different thought - we shouldn't make change in the old version server code to accommodate features introduced in the newer version. The necessary changes should be done solely in the newer version to fix any interop issues with old versions. I think that is a legitimate comment, although fixing the server side is much simpler in this case. Thoughts? |
| Comment by Patrick Farrell (Inactive) [ 24/Oct/13 ] |
|
While I agree with Vitaly in the abstract, I don't see how we can fix this issue on the 2.4 clients. The open_by_fid code on 2.1 is activated when the namelength is 0. SLES11SP2 changes anonymous dentries to be non-null/non-zero name length, so they no longer hit this check. We tried to fix this by having ll_intent_file_open not pass the name, but hit issues. The problem, as I see it, is that we need to only not send the name in the case of a root dentry. We tried that by using that code only when the DCACHE_DISCONNECTED flag was observed, but that still triggered the assertion on the MDS. It's not clear to me why - I thought DCACHE_DISCONNECTED was unique to the root dentry, but the observed behavior suggests not. The other possibility is that DCACHE_DISCONNECTED does uniquely identify the root dentry, but the zero-length root dentry from CentOS/SLES11SP1 comes across in ll_intent_file_open differently than just a 0 namelen and null pointer for the name, and that this difference is essential in avoiding the MDS crashes. [THIS INFORMATION IS INCORRECT. See my reply to Andreas below.] I'm not sure that's possible. That's why we chose to add MDS_OPEN_BY_FID to 2.1. This is a fairly minor patch to 2.1, as all it adds is this way to force 2.1 to do open_by_fid. |
| Comment by Andreas Dilger [ 24/Oct/13 ] |
|
FYI, DCACHE_DUSCONNECTED is used on any NFS inode that is not connected to the namespace (i.e. if NFS client does its own getattr-by-handle operation). To determine the root dentry you should check if dentry == sb->s_root. |
| Comment by Patrick Farrell (Inactive) [ 24/Oct/13 ] |
|
Andreas: Thanks for the reminder. Reading it again, I see my comment above is flawed. The change in newer kernels is all anonymous dentries, I wasn't thinking clearly when I wrote that. We don't need to identify merely the root dentries, this issue applies to all anonymous dentries. The problem is that when we change the names as we did, we fail that NAMELEN related assertion in 2.1. In retrospect, I think we have succeeded in our goal of isolating anonymous dentries, but perhaps there is a difference between passing NULL in to ll_prep_md_op_data for the name and the anonymous dentry names in 2.6.32 kernels. They appear (I haven't tested, but this is my reading of the code) to be pointers to a string containing nothing but the null terminator, but they aren't actually NULL. I wonder if this difference isn't significant. |