[LU-3765] 2.5.0<->2.1.5 interop: sanity test 24u: (mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed Created: 15/Aug/13  Updated: 24/Oct/13  Resolved: 01/Oct/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.6, Lustre 2.5.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: HB, mn1
Environment:

Lustre client build: http://build.whamcloud.com/job/lustre-master/1613/
Lustre server build: http://build.whamcloud.com/job/lustre-b2_1/191/ (2.1.5)
Distro/Arch: RHEL6.4/x86_64


Issue Links:
Duplicate
duplicates LU-3544 Writing to new files under NFS export... Closed
Related
is related to LU-2875 Remove LASSERT()s on return values fr... Resolved
is related to LU-3233 tgt_cb_last_committed()) ASSERTION( c... Resolved
Severity: 3
Rank (Obsolete): 9699

 Description   

sanity test 24u hit the following failure on MDS:

11:06:31:Lustre: DEBUG MARKER: == sanity test 24u: create stripe file == 11:06:31 (1376417191)
11:06:31:LustreError: 13255:0:(mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed: 
11:06:31:LustreError: 13255:0:(mdt_handler.c:224:mdt_lock_pdo_init()) LBUG
11:06:31:Pid: 13255, comm: mdt_01
11:06:31:
11:06:31:Call Trace:
11:06:31: [<ffffffffa04d0785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
11:06:31: [<ffffffffa04d0d97>] lbug_with_loc+0x47/0xb0 [libcfs]
11:06:31: [<ffffffffa0bdea65>] mdt_lock_pdo_init+0xe5/0xf0 [mdt]
11:06:31: [<ffffffffa0c127c6>] mdt_reint_open+0x1f6/0x2940 [mdt]
11:06:31: [<ffffffffa077b764>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
11:06:32: [<ffffffffa0ba256e>] ? md_ucred+0x1e/0x60 [mdd]
11:06:32: [<ffffffffa0be15d5>] ? mdt_ucred+0x15/0x20 [mdt]
11:06:32: [<ffffffffa0bf84ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt]
11:06:32: [<ffffffffa0bfcc51>] mdt_reint_rec+0x41/0xe0 [mdt]
11:06:32: [<ffffffffa0bf3ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
11:06:32: [<ffffffffa0bf453d>] mdt_intent_reint+0x1ed/0x500 [mdt]
11:06:32: [<ffffffffa0bf2c09>] mdt_intent_policy+0x379/0x690 [mdt]
11:06:32: [<ffffffffa0737391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
11:06:32: [<ffffffffa075d1ed>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
11:06:32: [<ffffffffa0bf3586>] mdt_enqueue+0x46/0x130 [mdt]
11:06:32: [<ffffffffa0be8772>] mdt_handle_common+0x932/0x1750 [mdt]
11:06:32: [<ffffffffa0be9665>] mdt_regular_handle+0x15/0x20 [mdt]
11:06:32: [<ffffffffa078bbae>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
11:06:32: [<ffffffff8100c0ca>] child_rip+0xa/0x20
11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
11:06:32: [<ffffffffa078af60>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
11:06:32: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
11:06:32:
11:06:32:Kernel panic - not syncing: LBUG

Maloo report: https://maloo.whamcloud.com/test_sets/e3f3b3d8-0525-11e3-8d88-52540035b04c

More instances:
https://maloo.whamcloud.com/test_sets/369e054c-0059-11e3-bb00-52540035b04c
https://maloo.whamcloud.com/test_sets/0bf3fdbc-f8f5-11e2-8917-52540035b04c
https://maloo.whamcloud.com/test_sets/59b2a818-f504-11e2-a8f6-52540035b04c



 Comments   
Comment by Jian Yu [ 15/Aug/13 ]

Lustre client build: http://build.whamcloud.com/job/lustre-b2_4/29/
Lustre server build: http://build.whamcloud.com/job/lustre-b2_1/215/ (2.1.6)
Distro/Arch: RHEL6.4/x86_64

The sanity test 24u also hung and hit the same LBUG on MDS:

08:33:17:Lustre: DEBUG MARKER: == sanity test 24u: create stripe file == 08:33:10 (1376407990)
08:33:17:LustreError: 13776:0:(mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed: 
08:33:17:LustreError: 13776:0:(mdt_handler.c:224:mdt_lock_pdo_init()) LBUG
08:33:17:Pid: 13776, comm: mdt_00
08:33:18:
08:33:18:Call Trace:
08:33:18: [<ffffffffa0472785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
08:33:18: [<ffffffffa0472d97>] lbug_with_loc+0x47/0xb0 [libcfs]
08:33:18: [<ffffffffa0b6ea65>] mdt_lock_pdo_init+0xe5/0xf0 [mdt]
08:33:18: [<ffffffffa0ba28a6>] mdt_reint_open+0x1f6/0x2940 [mdt]
08:33:18: [<ffffffffa0715754>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
08:33:18: [<ffffffffa0b3356e>] ? md_ucred+0x1e/0x60 [mdd]
08:33:18: [<ffffffffa0b715d5>] ? mdt_ucred+0x15/0x20 [mdt]
08:33:18: [<ffffffffa0b884ec>] ? mdt_root_squash+0x2c/0x3e0 [mdt]
08:33:18: [<ffffffffa0b8cc51>] mdt_reint_rec+0x41/0xe0 [mdt]
08:33:18: [<ffffffffa0b83ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
08:33:18: [<ffffffffa0b8453d>] mdt_intent_reint+0x1ed/0x500 [mdt]
08:33:18: [<ffffffffa0b82c09>] mdt_intent_policy+0x379/0x690 [mdt]
08:33:18: [<ffffffffa06d1391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
08:33:18: [<ffffffffa06f71dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
08:33:18: [<ffffffffa0b83586>] mdt_enqueue+0x46/0x130 [mdt]
08:33:18: [<ffffffffa0b78772>] mdt_handle_common+0x932/0x1750 [mdt]
08:33:18: [<ffffffffa0b79665>] mdt_regular_handle+0x15/0x20 [mdt]
08:33:18: [<ffffffffa0725b9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
08:33:18: [<ffffffffa0724f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
08:33:18: [<ffffffff8100c0ca>] child_rip+0xa/0x20
08:33:18: [<ffffffffa0724f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
08:33:18: [<ffffffffa0724f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
08:33:18: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
08:33:18:
08:33:19:Kernel panic - not syncing: LBUG

Maloo report: https://maloo.whamcloud.com/test_sets/ae6aa048-0569-11e3-b127-52540035b04c

The same test passed on Lustre 2.4.0 client with 2.1.5 server:
https://maloo.whamcloud.com/sub_tests/397da174-c63d-11e2-ad5d-52540035b04c

This is a regression issue on Lustre b2_4 branch.

Comment by Di Wang [ 15/Aug/13 ]

This is clearly caused by

LU-3544 nfs: writing to new files will return ENOENT

This happend with SLES11SP2 Lustre client, which in turn acts as an
NFS server, exporting a subtree of an Lustre fs through NFS.

We detected that whenever we are writing to a new file using, fx,
'echo blah > newfile', it will return ENOENT error. We found
out that this was caused by the anonymous dentry. In SLESS11SP2,
anonymous dentries are assigned '/' as the name, instead of an
empty string. When MDT handles the intent_open call, it will look
up the obj by the name if it is not an empty string, and thus
couldn't find it.

As MDS_OPEN_BY_FID is always set on this request, we never need
to send the name in this request. The fid is already available
and should be used in case the file has been renamed.

Signed-off-by: Cheng Shao <cheng_shao@xyratex.com>
Signed-off-by: Patrick Farrell <paf@cray.com>
Change-Id: Ia8bd6f2814d05350d0a197df8a3ffd9729e2081b
Reviewed-on: http://review.whamcloud.com/6920
Reviewed-by: Bob Glossman <bob.glossman@intel.com>
Tested-by: Hudson
Reviewed-by: Alexey Shvetsov <alexxy@gentoo.org>
Reviewed-by: Lai Siyao <lai.siyao@intel.com>
Tested-by: Maloo <whamcloud.maloo@gmail.com>
Reviewed-by: James Simmons <uja.ornl@gmail.com>
Reviewed-by: Oleg Drokin <oleg.drokin@intel.com>

In this patch, it stops sending name for open by FID and set lovea(test 24u) request. But 2.1.5 server can not handle
this correctly. So we either

1. fix 2.1.5 server to handle this zero name length issue. please check open_by_fid part in mdt_reint_open.
2. or fix b2_4 client to add open lock flag for lovea setting req, which can avoid the problem as well, IMHO.

Comment by Peter Jones [ 16/Aug/13 ]

Lai

Could you please help with this one?

Thanks

peter

Comment by Jian Yu [ 16/Aug/13 ]

This is blocking the whole test session on Lustre b2_4 client with 2.1.6 server:
https://maloo.whamcloud.com/test_sessions/ac905704-0569-11e3-b127-52540035b04c

Comment by Lai Siyao [ 16/Aug/13 ]

IMO once client specified MDS_OPEN_BY_FID, MDS should never with open with name because name may be invalid, or it will cause inconsistency. If this is true, MDS open by fid code can be simplified a lot.

Patch is on http://review.whamcloud.com/#/c/7358/

Comment by Patrick Farrell (Inactive) [ 22/Aug/13 ]

Lai,

Take a look at my latest in LU-3544. I think it belongs there and not here, but it could go in either. It's about the problems with the proposed patch for LU-3765.

Comment by Lai Siyao [ 28/Aug/13 ]

Patch for master is on:
http://review.whamcloud.com/#/c/7475/
http://review.whamcloud.com/#/c/7476/

These patches enabled getattr/open-by-fid by default, thus either fid or name is packed in these requests, and server can handle op-by-fid correctly.

Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this.

Comment by Peter Jones [ 31/Aug/13 ]

We have reverted LU-3544 from b2_4 for now but are continuing to work on a more complete fix on master

Comment by Lai Siyao [ 12/Sep/13 ]

Patch for b2_1 is on: http://review.whamcloud.com/#/c/7627/

Now it's ready to continue interop test between 2.5 and 2.1 with these three patches.

Comment by Oleg Drokin [ 01/Oct/13 ]

LU-3544 was reverted from master as well

Comment by Andreas Dilger [ 01/Oct/13 ]

This problem was introduced by the patches for LU-3544, and is no longer an issue now that the patch has been reverted.

Comment by John Hammond [ 16/Oct/13 ]

The MDT is still generally at the mercy of the client to send valid names. Please see http://review.whamcloud.com/#/c/7961/ from LU-2875.

Comment by Cheng Shao (Inactive) [ 23/Oct/13 ]

Lai, for your comment below and the related patch here http://review.whamcloud.com/#/c/7627/,

Once these patches are accepted, they need to be backported to 2.4, and also fix 2.1 server code to maintain 2.5 <-> 2.1 interop. I'll continue working on this.

Vitaly has a different thought - we shouldn't make change in the old version server code to accommodate features introduced in the newer version. The necessary changes should be done solely in the newer version to fix any interop issues with old versions. I think that is a legitimate comment, although fixing the server side is much simpler in this case. Thoughts?

Comment by Patrick Farrell (Inactive) [ 24/Oct/13 ]

While I agree with Vitaly in the abstract, I don't see how we can fix this issue on the 2.4 clients.

The open_by_fid code on 2.1 is activated when the namelength is 0. SLES11SP2 changes anonymous dentries to be non-null/non-zero name length, so they no longer hit this check. We tried to fix this by having ll_intent_file_open not pass the name, but hit issues. The problem, as I see it, is that we need to only not send the name in the case of a root dentry.

We tried that by using that code only when the DCACHE_DISCONNECTED flag was observed, but that still triggered the assertion on the MDS. It's not clear to me why - I thought DCACHE_DISCONNECTED was unique to the root dentry, but the observed behavior suggests not.

The other possibility is that DCACHE_DISCONNECTED does uniquely identify the root dentry, but the zero-length root dentry from CentOS/SLES11SP1 comes across in ll_intent_file_open differently than just a 0 namelen and null pointer for the name, and that this difference is essential in avoiding the MDS crashes.

[THIS INFORMATION IS INCORRECT. See my reply to Andreas below.]
So the problem remains: In order to achieve a 2.4/2.5 only fix, we must somehow force open_by_fid on 2.1 servers for root dentries, and only for root dentries. And we need to do it in a way that does not crash the 2.1 server, or cause other problems on 2.4 because of the nulled file name.
[^-- This is not only for root dentries, it is for all anonymous dentries.]

I'm not sure that's possible. That's why we chose to add MDS_OPEN_BY_FID to 2.1. This is a fairly minor patch to 2.1, as all it adds is this way to force 2.1 to do open_by_fid.

Comment by Andreas Dilger [ 24/Oct/13 ]

FYI, DCACHE_DUSCONNECTED is used on any NFS inode that is not connected to the namespace (i.e. if NFS client does its own getattr-by-handle operation). To determine the root dentry you should check if dentry == sb->s_root.

Comment by Patrick Farrell (Inactive) [ 24/Oct/13 ]

Andreas: Thanks for the reminder.

Reading it again, I see my comment above is flawed. The change in newer kernels is all anonymous dentries, I wasn't thinking clearly when I wrote that. We don't need to identify merely the root dentries, this issue applies to all anonymous dentries.

The problem is that when we change the names as we did, we fail that NAMELEN related assertion in 2.1. In retrospect, I think we have succeeded in our goal of isolating anonymous dentries, but perhaps there is a difference between passing NULL in to ll_prep_md_op_data for the name and the anonymous dentry names in 2.6.32 kernels. They appear (I haven't tested, but this is my reading of the code) to be pointers to a string containing nothing but the null terminator, but they aren't actually NULL.

I wonder if this difference isn't significant.

Generated at Sat Feb 10 01:36:43 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.