[LU-2811] LBUG: stripe_count > LOV_MAX_STRIPE_COUNT Created: 14/Feb/13 Updated: 27/Aug/15 Resolved: 27/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Roland Fehrenbacher | Assignee: | Jian Yu |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Vanilla Kernel 2.6.32.59 |
||
| Issue Links: |
|
||||||||
| Severity: | 4 | ||||||||
| Rank (Obsolete): | 6812 | ||||||||
| Description |
|
After running fine for about 8 weeks: Message from syslogd@pfs1n1 at Thu Feb 14 08:30:05 2013 ... Message from syslogd@pfs1n1 at Thu Feb 14 08:30:05 2013 ... Message from syslogd@pfs1n1 at Thu Feb 14 08:30:05 2013 ... |
| Comments |
| Comment by Bruno Faccini (Inactive) [ 14/Feb/13 ] |
|
Hello, |
| Comment by Roland Fehrenbacher [ 14/Feb/13 ] |
|
I can't tell, since even if it was, it would have been wiped by the /tmp cleaning script upon reboot |
| Comment by Bruno Faccini (Inactive) [ 14/Feb/13 ] |
|
Roland, |
| Comment by Roland Fehrenbacher [ 14/Feb/13 ] |
|
Unfortunately not. |
| Comment by Roland Fehrenbacher [ 14/Feb/13 ] |
|
The system is part of an HA pair and got reset by the peer. So we couldn't even copy anything from the console. |
| Comment by Andreas Dilger [ 15/Feb/13 ] |
|
Without at least the stack trace, there isn't really enough information available about what triggered this problem. Is it corruption of an on-disk LOV EA? Was it a corrupted setstripe request from a client? Is there some sanity checking missing in some normal code path? Was this a problem with a 1.8 or 2.1 client? Roland, some things that could help debugging in the future:
|
| Comment by Andreas Dilger [ 15/Feb/13 ] |
|
I'm going to close this bug for now, since there isn't enough information to figure out what went wrong. Please re-open if it happens again. |
| Comment by Roland Fehrenbacher [ 18/Feb/13 ] |
|
OK. I already thought that it would be hard to find the cause with only so much info. I see how easy it will be to get kdump on the cluster. Thanks for looking at this anyway. |
| Comment by Peter Jones [ 18/Feb/13 ] |
|
Thanks Roland. The other thing to ask is - have any changes been made to the vanilla Lustre code on either of the releases? |
| Comment by Roland Fehrenbacher [ 18/Feb/13 ] |
|
Not as far as I know (the SuSE clients are not managed by myself, they are running 2.6.32/1.8.x, 3.0/2.3 respectively). |
| Comment by Tommy Minyard [ 24/Mar/14 ] |
|
We have a reproducer for this error, basically if you are running a newer client such as 2.4.2 and mount an older 2.1.5 Lustre filesystem, users can set the stripe to greater than the max stripe setting of 160 in 2.1 and trigger this crash as soon as they query or access the directory with stripe count set to more than 160. |
| Comment by Peter Jones [ 25/Mar/14 ] |
|
Yu, jian Could you please take care of this one? Thanks Peter |
| Comment by Jian Yu [ 27/Mar/14 ] |
|
I'll reproduce and investigate the failure. |
| Comment by James A Simmons [ 27/Mar/14 ] |
|
Can you try patch http://review.whamcloud.com/#/c/9734. |
| Comment by Jian Yu [ 28/Mar/14 ] |
Sure, thanks! Before validating the patch, I can reproduce the failure on Lustre 2.1.6 server with Lustre 2.4.3 client according to the steps from Tommy. On Lustre 2.4.3 client node: # mkdir /mnt/lustre/dir # lfs getstripe -d /mnt/lustre/dir stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 # lfs setstripe -c 200 /mnt/lustre/dir # lfs getstripe -d /mnt/lustre/dir <------------- hung here Console log on Lustre 2.1.6 MDS: Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):0:0 LustreError: 8510:0:(mdt_lib.c:543:mdt_dump_lmm()) ASSERTION( stripe_count <= (__s16)160 ) failed: LustreError: 8510:0:(mdt_lib.c:543:mdt_dump_lmm()) LBUG Pid: 8510, comm: mdt_00 Call Trace: [<ffffffffa0425785>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0425d97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0d4fb72>] mdt_dump_lmm+0x272/0x280 [mdt] [<ffffffffa0d495f2>] mdt_getattr_internal+0x672/0xe90 [mdt] [<ffffffffa06bb6c0>] ? lustre_swab_mdt_body+0x0/0x150 [ptlrpc] [<ffffffffa0d4a035>] mdt_getattr+0x225/0x920 [mdt] [<ffffffffa0d40772>] mdt_handle_common+0x932/0x1750 [mdt] [<ffffffffa0d41665>] mdt_regular_handle+0x15/0x20 [mdt] [<ffffffffa06c8b9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc] [<ffffffffa06c7f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa06c7f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffffa06c7f50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Kernel panic - not syncing: LBUG I'll back-port the above patch to Lustre b2_1 branch and try again. |
| Comment by Jian Yu [ 04/Apr/14 ] |
|
Here is the back-ported patch for Lustre b2_1 branch http://review.whamcloud.com/9884. I'll validate it. |
| Comment by Jian Yu [ 07/Apr/14 ] |
|
With the above patch on Lustre b2_1 branch, the same test passed: # mkdir /mnt/lustre/dir # lfs getstripe -d /mnt/lustre/dir stripe_count: 1 stripe_size: 1048576 stripe_offset: -1 # lfs setstripe -c 200 /mnt/lustre/dir # lfs getstripe -d /mnt/lustre/dir stripe_count: 200 stripe_size: 1048576 stripe_offset: -1 # touch /mnt/lustre/dir/file # lfs getstripe -i -c -s /mnt/lustre/dir/file lmm_stripe_count: 160 lmm_stripe_size: 1048576 lmm_stripe_offset: 93 |
| Comment by James A Simmons [ 27/Aug/15 ] |
|
Since Lustre 2.1 is no longer supported we should close this ticket. The solution if someone needs it is in this ticket. |