[LU-4205] lustre 2.4 api setstrip on a 2.1.5 server Causes LBUG ASSERTION( namelen > 0 ) Created: 04/Nov/13  Updated: 14/Nov/13  Resolved: 14/Nov/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.5
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Mahmoud Hanafi Assignee: Zhenyu Xu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Client: Sles11-sp2 Lustre 2.4.0-3nas
Server Centos6 Lustre2.1.5-2nas
source at git://github.com/jlan/lustre-nas.git


Severity: 3
Rank (Obsolete): 11433

 Description   

The following code compiled on a Sles11-sp2 Lustre2.4.0-3nas client will cause a lustre2.1.5 MDT server to LBUG.

-bug.c-
#include <stdio.h>
#include <lustre/liblustreapi.h>
// gcc bug_endeavour_stripe.c -Wl,-Bstatic -llustreapi -Wl,-Bdynamic
int main(int argc, char *argv[]) {
// set stripe count to 2 with default stripe size
if (llapi_file_create(argv[1], 0, -1, 2, 0))

{ perror("problem"); }

}


LustreError: 3291:0:(mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed: ^M
LustreError: 3291:0:(mdt_handler.c:224:mdt_lock_pdo_init()) LBUG^M
Pid: 3291, comm: mdt_01^M
^M
Call Trace:^M
[<ffffffffa0340785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
[<ffffffffa0340d97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
[<ffffffffa0bb3a65>] mdt_lock_pdo_init+0xe5/0xf0 [mdt]^M
[<ffffffffa0be77b8>] mdt_reint_open+0x1f8/0x28a0 [mdt]^M
[<ffffffffa0634724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]^M
[<ffffffffa0b7856e>] ? md_ucred+0x1e/0x60 [mdd]^M
[<ffffffffa0bb65d5>] ? mdt_ucred+0x15/0x20 [mdt]^M
[<ffffffffa0bcd51c>] ? mdt_root_squash+0x2c/0x3e0 [mdt]^M
[<ffffffffa0bd1c81>] mdt_reint_rec+0x41/0xe0 [mdt]^M
[<ffffffffa0bc8ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]^M
[<ffffffffa0bc953d>] mdt_intent_reint+0x1ed/0x530 [mdt]^M

<0>LustreError: 2167:0:(mdt_handler.c:224:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed: ^M
<0>LustreError: 2167:0:(mdt_handler.c:224:mdt_lock_pdo_init()) LBUG^M
<4>Pid: 2167, comm: mdt_00^M
<4>^M
<4>Call Trace:^M
<4> [<ffffffffa052f785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
<4> [<ffffffffa052fd97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
<4> [<ffffffffa0cf6a65>] mdt_lock_pdo_init+0xe5/0xf0 [mdt]^M
<4> [<ffffffffa0d2a7b8>] mdt_reint_open+0x1f8/0x28a0 [mdt]^M
<4> [<ffffffffa07d2724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]^M
<4> [<ffffffffa0ca456e>] ? md_ucred+0x1e/0x60 [mdd]^M
<4> [<ffffffffa0cf95d5>] ? mdt_ucred+0x15/0x20 [mdt]^M
<4> [<ffffffffa0d1051c>] ? mdt_root_squash+0x2c/0x3e0 [mdt]^M
<4> [<ffffffffa0d14c81>] mdt_reint_rec+0x41/0xe0 [mdt]^M
<4> [<ffffffffa0d0bed4>] mdt_reint_internal+0x544/0x8e0 [mdt]^M
<4> [<ffffffffa0d0c53d>] mdt_intent_reint+0x1ed/0x530 [mdt]^M
<4> [<ffffffffa0d0ac09>] mdt_intent_policy+0x379/0x690 [mdt]^M
<4> [<ffffffffa078e351>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]^M
<4> [<ffffffffa07b41ad>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]^M
<4> [<ffffffffa0d0b586>] mdt_enqueue+0x46/0x130 [mdt]^M
<4> [<ffffffffa0d00772>] mdt_handle_common+0x932/0x1750 [mdt]^M
<4> [<ffffffffa0d01665>] mdt_regular_handle+0x15/0x20 [mdt]^M
<4> [<ffffffffa07e2b4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]^M
<4> [<ffffffff811a65d0>] ? end_bio_bh_io_sync+0x0/0x60^M
<4> [<ffffffffa07e1f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffff8100c0ca>] child_rip+0xa/0x20^M
<4> [<ffffffffa07e1f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffffa07e1f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
<4>^M
<0>Kernel panic - not syncing: LBUG^M
<4>Pid: 2167, comm: mdt_00 Not tainted 2.6.32-279.19.1.el6.20130516.x86_64.lustre215 #1^M
<4>Call Trace:^M
<4> [<ffffffff8151c027>] ? panic+0xa0/0x189^M
<4> [<ffffffffa052fdeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
<4> [<ffffffffa0cf6a65>] ? mdt_lock_pdo_init+0xe5/0xf0 [mdt]^M
<4> [<ffffffffa0d2a7b8>] ? mdt_reint_open+0x1f8/0x28a0 [mdt]^M
<4> [<ffffffffa07d2724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]^M
<4> [<ffffffffa0ca456e>] ? md_ucred+0x1e/0x60 [mdd]^M
<4> [<ffffffffa0cf95d5>] ? mdt_ucred+0x15/0x20 [mdt]^M
<4> [<ffffffffa0d1051c>] ? mdt_root_squash+0x2c/0x3e0 [mdt]^M
<4> [<ffffffffa0d14c81>] ? mdt_reint_rec+0x41/0xe0 [mdt]^M
<4> [<ffffffffa0d0bed4>] ? mdt_reint_internal+0x544/0x8e0 [mdt]^M
<4> [<ffffffffa0d0c53d>] ? mdt_intent_reint+0x1ed/0x530 [mdt]^M
<4> [<ffffffffa0d0ac09>] ? mdt_intent_policy+0x379/0x690 [mdt]^M
<4> [<ffffffffa078e351>] ? ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]^M
<4> [<ffffffffa07b41ad>] ? ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]^M
<4> [<ffffffffa0d0b586>] ? mdt_enqueue+0x46/0x130 [mdt]^M
<4> [<ffffffffa0d00772>] ? mdt_handle_common+0x932/0x1750 [mdt]^M
<4> [<ffffffffa0d01665>] ? mdt_regular_handle+0x15/0x20 [mdt]^M
<4> [<ffffffffa07e2b4e>] ? ptlrpc_main+0xc4e/0x1a40 [ptlrpc]^M
<4> [<ffffffff811a65d0>] ? end_bio_bh_io_sync+0x0/0x60^M
<4> [<ffffffffa07e1f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffff8100c0ca>] ? child_rip+0xa/0x20^M
<4> [<ffffffffa07e1f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffffa07e1f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
<4> [<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M



 Comments   
Comment by Peter Jones [ 05/Nov/13 ]

Bobijam

Could you please advise on this issue?

Thanks

Peter

Comment by Zhenyu Xu [ 05/Nov/13 ]

Would you mind trying this debug patch? This is for the 2.1.x server code.

http://review.whamcloud.com/8175

Comment by Mahmoud Hanafi [ 06/Nov/13 ]

no luck....

LustreError: 2987:0:(mdt_internal.h:789:mdt_name()) ASSERTION( namelen > 0 ) failed: ^M
LustreError: 2987:0:(mdt_internal.h:789:mdt_name()) LBUG^M
Call Trace:^M
[<ffffffffa0340785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
^M
Entering kdb (current=0xffff8807f89b7500, pid 2987) on processor 5 Oops: (null)^M
due to oops @ 0x0^M

[<ffffffffa0340785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]^M
[<ffffffffa0340d97>] lbug_with_loc+0x47/0xb0 [libcfs]^M
[<ffffffffa0bd7d8f>] mdt_reint_open+0x27cf/0x28f0 [mdt]^M
[<ffffffffa0634724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]^M
[<ffffffffa0b6656e>] ? md_ucred+0x1e/0x60 [mdd]^M
[<ffffffffa0ba45d5>] ? mdt_ucred+0x15/0x20 [mdt]^M
[<ffffffffa0bbb51c>] ? mdt_root_squash+0x2c/0x3e0 [mdt]^M
[<ffffffffa0bbfc81>] mdt_reint_rec+0x41/0xe0 [mdt]^M
[<ffffffffa0bb6ed4>] mdt_reint_internal+0x544/0x8e0 [mdt]^M
[<ffffffffa0bb753d>] mdt_intent_reint+0x1ed/0x530 [mdt]^M
[<ffffffffa0bb5e49>] mdt_intent_policy+0x379/0x690 [mdt]^M
[<ffffffffa05f0351>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]^M
[<ffffffffa06161ad>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]^M
[<ffffffffa0bb59e6>] mdt_enqueue+0x46/0x130 [mdt]^M
[<ffffffffa0bab772>] mdt_handle_common+0x932/0x1750 [mdt]^M
[<ffffffffa0bac665>] mdt_regular_handle+0x15/0x20 [mdt]^M
[<ffffffffa0644b4e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]^M
[<ffffffffa0643f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
[<ffffffff8100c0ca>] child_rip+0xa/0x20^M
[<ffffffffa0643f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
[<ffffffffa0643f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M
^M
Kernel panic - not syncing: LBUG^M
Pid: 2930, comm: mdt_01 Not tainted 2.6.32-279.19.1.el6.20130516.x86_64.lustre215 #1^M
Call Trace:^M
[<ffffffff8151c027>] ? panic+0xa0/0x189^M
[<ffffffffa0340deb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]^M
[<ffffffffa0bd7d8f>] ? mdt_reint_open+0x27cf/0x28f0 [mdt]^M
[<ffffffffa0634724>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]^M
[<ffffffffa0b6656e>] ? md_ucred+0x1e/0x60 [mdd]^M
[<ffffffffa0ba45d5>] ? mdt_ucred+0x15/0x20 [mdt]^M
[<ffffffffa0bbb51c>] ? mdt_root_squash+0x2c/0x3e0 [mdt]^M
[<ffffffffa0bbfc81>] ? mdt_reint_rec+0x41/0xe0 [mdt]^M
[<ffffffffa0bb6ed4>] ? mdt_reint_internal+0x544/0x8e0 [mdt]^M
[<ffffffffa0bb753d>] ? mdt_intent_reint+0x1ed/0x530 [mdt]^M
[<ffffffffa0bb5e49>] ? mdt_intent_policy+0x379/0x690 [mdt]^M
[<ffffffffa05f0351>] ? ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]^M
[<ffffffffa06161ad>] ? ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]^M
[<ffffffffa0bb59e6>] ? mdt_enqueue+0x46/0x130 [mdt]^M
[<ffffffffa0bab772>] ? mdt_handle_common+0x932/0x1750 [mdt]^M
[<ffffffffa0bac665>] ? mdt_regular_handle+0x15/0x20 [mdt]^M
[<ffffffffa0644b4e>] ? ptlrpc_main+0xc4e/0x1a40 [ptlrpc]^M
[<ffffffffa0643f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
[<ffffffff8100c0ca>] ? child_rip+0xa/0x20^M
[<ffffffffa0643f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
[<ffffffffa0643f00>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]^M
[<ffffffff8100c0c0>] ? child_rip+0x0/0x20^M

Comment by Zhenyu Xu [ 07/Nov/13 ]

update what I've tried. Could not reproduce it.

test code: bug_endeavour_stripe.c

server client result
2.1.5 2.1.5 passed
2.1.5 2.5 passed
2.1.5 2.4.1 passed
2.1.5 2.4.0 passed
Comment by Mahmoud Hanafi [ 07/Nov/13 ]

Was your client a sles11sp2?

Also you may notice that with the patch it LBUG at a different location.

Comment by Zhenyu Xu [ 08/Nov/13 ]

no, my client is RHEL6, I'll try install a sles11sp2 to do the test.

The LBUG after applying the patch reveals the same error, MDS did not receive the filename it is supposed to get, there are two point to check this assertion.

Comment by Bob Glossman (Inactive) [ 08/Nov/13 ]

I have been trying to reproduce the reported server panic. So far I have attempted both current (2.5.50) and old (2.1.5) Centos servers. Clients have been current (2.5.50), latest b2_4 (2.4.1+), and old (2.4.1) sles11sp2 clients. All but one worked fine.

The one combination that didn't work fine, 2.5.50 sles11sp2 client on 2.1.5 server, didn't panic the server. It just gave a client error like the following:

# ./bug /mnt/lustre/zzz
error on ioctl 0x4008669a for '/mnt/lustre/zzz' (3): Inappropriate ioctl for device
problem: Inappropriate ioctl for device

I did try some Centos clients as well just because I had them handy. They all worked fine too.

In no case was I able to produce a server panic as reported.

Comment by Bob Glossman (Inactive) [ 09/Nov/13 ]

The failure case I reported above was incorrect. At the time I had my lustre fs mounted at /mnt/l2, not /mnt/lustre. The error shown was the result of running the reproducer on a path not in a lustre fs.

Repeating the test with the correct path it worked fine too. So to sum up all the combinations I tried succeeded, node generated errors, none panic'ed servers.

Comment by Mahmoud Hanafi [ 11/Nov/13 ]

This is where we are crashing.

CLIENT CODE:
int llapi_file_open_pool(const char *name, int flags, int mode,
unsigned long long stripe_size, int stripe_offset,
int stripe_count, int stripe_pattern, char *pool_name)
{
struct lov_user_md_v3 lum
.
.
.
if (ioctl(fd, LL_IOC_LOV_SETSTRIPE, &lum)) { <====== CRASH !!!!
.
.
.
}

gdb) print lum
$5 = {lmm_magic = 198249424, lmm_pattern = 0, lmm_oi = {{oi =

{oi_id = 0, oi_seq = 0}

, oi_fid =

{f_seq = 0, f_oid = 0, f_ver = 0}

}},
lmm_stripe_size = 0, lmm_stripe_count = 2,

{lmm_stripe_offset = 65535, lmm_layout_gen = 65535}

,
lmm_pool_name = '\000' <repeats 15 times>, lmm_objects = 0x7fffffffe950}

(gdb) print &lum.lmm_magic
$7 = (__u32 *) 0x7fffffffe920
(gdb) x/x 0x7fffffffe920
0x7fffffffe920: 0x0bd10bd0

Comment by Oleg Drokin [ 13/Nov/13 ]

I don't easily see it, but do you carry lu3544patch in your affected tree?

Comment by Jay Lan (Inactive) [ 14/Nov/13 ]

If we run 2.4.1 client, no crash on 2.1.5 mds, but if we run 2.4.0 client, mds crashed.

Well, 2.4.1 reverted LU-3544, but 2.4.0 did not revert that patch. Certainly a difference. But, no NFS mount was involved though. Do you think LU-3544 could be the cause?

Comment by Zhenyu Xu [ 14/Nov/13 ]

My 2.4.0 doesn't have the original LU-3544 patch, let alone the reversion of it. The last commit of my 2.4.0 is

commit d3f91c45ec56329c52ff1f15bc56d38f5fe9cf7c
Author:     Oleg Drokin <oleg.drokin@intel.com>
AuthorDate: Fri May 24 16:46:24 2013 -0400
Commit:     Oleg Drokin <oleg.drokin@intel.com>
CommitDate: Fri May 24 16:46:24 2013 -0400

    New tag 2.4.0-RC2
    
    Change-Id: I6cacd097c6f3c5f2a6e80f2338650edae6a1a83c
    Signed-off-by: Oleg Drokin <oleg.drokin@intel.com>

I think the original LU-3544 patch is the culprit, you need revert it.

the original LU-3544 patch
commit 2402980a0891e43668f4016e17f2ff872006e0fa
Author:     Patrick Farrell <paf@cray.com>
AuthorDate: Thu Jul 11 11:06:27 2013 -0500
Commit:     Oleg Drokin <oleg.drokin@intel.com>
CommitDate: Tue Jul 23 05:22:28 2013 +0000

    LU-3544 nfs: writing to new files will return ENOENT
...
Comment by Mahmoud Hanafi [ 14/Nov/13 ]

Testing showed that LU-3544 was the cause of this.

This case can be closed.

Comment by Peter Jones [ 14/Nov/13 ]

ok. Thanks Mahmoud!

Generated at Sat Feb 10 01:40:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.