[LU-2811] LBUG: stripe_count > LOV_MAX_STRIPE_COUNT Created: 14/Feb/13  Updated: 27/Aug/15  Resolved: 27/Aug/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.3
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Roland Fehrenbacher Assignee: Jian Yu
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Vanilla Kernel 2.6.32.59
12 OSTs
approx. 980 clients running a mix of 1.8.x and 2.3


Issue Links:
Duplicate
is duplicated by LU-5578 LBUG : (mdt_lib.c:543:mdt_dump_lmm())... Closed
Severity: 4
Rank (Obsolete): 6812

 Description   

After running fine for about 8 weeks:

Message from syslogd@pfs1n1 at Thu Feb 14 08:30:05 2013 ...
pfs1n1 kernel: [1452648.918541] LustreError:
15442:0:(mdt_lib.c:541:mdt_dump_lmm()) ASSERTION( stripe_count <=
(__s16)160 ) failed:

Message from syslogd@pfs1n1 at Thu Feb 14 08:30:05 2013 ...
pfs1n1 kernel: [1452648.928990] LustreError:
15442:0:(mdt_lib.c:541:mdt_dump_lmm()) LBUG

Message from syslogd@pfs1n1 at Thu Feb 14 08:30:05 2013 ...
pfs1n1 kernel: [1452649.070201] Kernel panic - not syncing: LBUG



 Comments   
Comment by Bruno Faccini (Inactive) [ 14/Feb/13 ]

Hello,
Was there a crash-dump taken at the time of the LBUG ?

Comment by Roland Fehrenbacher [ 14/Feb/13 ]

I can't tell, since even if it was, it would have been wiped by the /tmp cleaning script upon reboot

Comment by Bruno Faccini (Inactive) [ 14/Feb/13 ]

Roland,
I was not speaking about a Lustre debug log dumped into /tmp, but a full system crash-dump, meaning that Kdump or such kind of tool is installed+configured on your system. Are you aware of that ?

Comment by Roland Fehrenbacher [ 14/Feb/13 ]

Unfortunately not.

Comment by Roland Fehrenbacher [ 14/Feb/13 ]

The system is part of an HA pair and got reset by the peer. So we couldn't even copy anything from the console.

Comment by Andreas Dilger [ 15/Feb/13 ]

Without at least the stack trace, there isn't really enough information available about what triggered this problem. Is it corruption of an on-disk LOV EA? Was it a corrupted setstripe request from a client? Is there some sanity checking missing in some normal code path? Was this a problem with a 1.8 or 2.1 client?

Roland, some things that could help debugging in the future:

  • connecting a serial cable on each of the MDS nodes and log to a third system (or each other), perhaps in conjunction with "conman" which can manage ethernet-attached serial console servers
  • using "netconsole" to send console logs to a remote node via UDP
  • configure "kdump" and/or "netdump" to capture crash dumps (this gives by far the most debugging information)
Comment by Andreas Dilger [ 15/Feb/13 ]

I'm going to close this bug for now, since there isn't enough information to figure out what went wrong. Please re-open if it happens again.

Comment by Roland Fehrenbacher [ 18/Feb/13 ]

OK. I already thought that it would be hard to find the cause with only so much info. I see how easy it will be to get kdump on the cluster. Thanks for looking at this anyway.

Comment by Peter Jones [ 18/Feb/13 ]

Thanks Roland. The other thing to ask is - have any changes been made to the vanilla Lustre code on either of the releases?

Comment by Roland Fehrenbacher [ 18/Feb/13 ]

Not as far as I know (the SuSE clients are not managed by myself, they are running 2.6.32/1.8.x, 3.0/2.3 respectively).

Comment by Tommy Minyard [ 24/Mar/14 ]

We have a reproducer for this error, basically if you are running a newer client such as 2.4.2 and mount an older 2.1.5 Lustre filesystem, users can set the stripe to greater than the max stripe setting of 160 in 2.1 and trigger this crash as soon as they query or access the directory with stripe count set to more than 160.

Comment by Peter Jones [ 25/Mar/14 ]

Yu, jian

Could you please take care of this one?

Thanks

Peter

Comment by Jian Yu [ 27/Mar/14 ]

I'll reproduce and investigate the failure.

Comment by James A Simmons [ 27/Mar/14 ]

Can you try patch http://review.whamcloud.com/#/c/9734.

Comment by Jian Yu [ 28/Mar/14 ]

Can you try patch http://review.whamcloud.com/#/c/9734.

Sure, thanks!

Before validating the patch, I can reproduce the failure on Lustre 2.1.6 server with Lustre 2.4.3 client according to the steps from Tommy.

On Lustre 2.4.3 client node:

# mkdir /mnt/lustre/dir
# lfs getstripe -d /mnt/lustre/dir
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1
# lfs setstripe -c 200 /mnt/lustre/dir
# lfs getstripe -d /mnt/lustre/dir            <------------- hung here

Console log on Lustre 2.1.6 MDS:

Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:0
LustreError: 8510:0:(mdt_lib.c:543:mdt_dump_lmm()) ASSERTION( stripe_count <= (__s16)160 ) failed: 
LustreError: 8510:0:(mdt_lib.c:543:mdt_dump_lmm()) LBUG
Pid: 8510, comm: mdt_00

Call Trace:
 [<ffffffffa0425785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
 [<ffffffffa0425d97>] lbug_with_loc+0x47/0xb0 [libcfs]
 [<ffffffffa0d4fb72>] mdt_dump_lmm+0x272/0x280 [mdt]
 [<ffffffffa0d495f2>] mdt_getattr_internal+0x672/0xe90 [mdt]
 [<ffffffffa06bb6c0>] ? lustre_swab_mdt_body+0x0/0x150 [ptlrpc]
 [<ffffffffa0d4a035>] mdt_getattr+0x225/0x920 [mdt]
 [<ffffffffa0d40772>] mdt_handle_common+0x932/0x1750 [mdt]
 [<ffffffffa0d41665>] mdt_regular_handle+0x15/0x20 [mdt]
 [<ffffffffa06c8b9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
 [<ffffffffa06c7f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa06c7f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
 [<ffffffffa06c7f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Kernel panic - not syncing: LBUG

I'll back-port the above patch to Lustre b2_1 branch and try again.

Comment by Jian Yu [ 04/Apr/14 ]

Here is the back-ported patch for Lustre b2_1 branch http://review.whamcloud.com/9884. I'll validate it.

Comment by Jian Yu [ 07/Apr/14 ]

With the above patch on Lustre b2_1 branch, the same test passed:

# mkdir /mnt/lustre/dir
# lfs getstripe -d /mnt/lustre/dir
stripe_count:   1 stripe_size:    1048576 stripe_offset:  -1 
# lfs setstripe -c 200 /mnt/lustre/dir
# lfs getstripe -d /mnt/lustre/dir
stripe_count:   200 stripe_size:    1048576 stripe_offset:  -1 
# touch /mnt/lustre/dir/file
# lfs getstripe -i -c -s /mnt/lustre/dir/file
lmm_stripe_count:   160
lmm_stripe_size:    1048576
lmm_stripe_offset:  93
Comment by James A Simmons [ 27/Aug/15 ]

Since Lustre 2.1 is no longer supported we should close this ticket. The solution if someone needs it is in this ticket.

Generated at Sat Feb 10 01:28:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.