[LU-16938] "lfs setstripe -C -1" stripes too widely, should be limited to OST_COUNT Created: 03/Jul/23  Updated: 15/Dec/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Improvement Priority: Major
Reporter: Rajeev Mishra Assignee: Rajeev Mishra
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Related
is related to LU-16623 lod_statfs_and_check() does not skip ... Resolved
is related to LU-13748 'lfs setstripe -C -1' stripes too widely Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I am reaching out to seek clarification regarding the expected behavior of the "lfs setstripe" command when using the -C -1 option.

Currently, it appears that this command is creating a higher stripe count than anticipated. For instance, on my test system, it generated a stripe count of 2727 for a single file. This count exceeds the allowed limit of LOV_MAX_STRIPE_COUNT. 

I am uncertain about the appropriate solution to address this issue related to the "-1" argument. I have contemplated the following options:

1.    Consider making the option -1 illegal, preventing its usage altogether.

2.    Implement a mechanism to automatically set the stripe count to the maximum allowed value (LOV_MAX_STRIPE_COUNT) if the count exceeds this limit.

I would greatly appreciate your input and guidance in this matter. It is worth noting that setting the stripe count higher than LOV_MAX_STRIPE_COUNT leads to other problems, such as the failure of the "llapi_layout_get_by_fd" API to open the file.

Please let me know your input.



 Comments   
Comment by Andreas Dilger [ 03/Jul/23 ]

This should limit the stripe count of the component to LOV_MAX_STRIPE_COUNT, or whatever will fit into the remaining 64KiB xattr space. This was also discussed in LU-13748. I made patch https://review.whamcloud.com/50532 "{{ LU-13748 mdt: remove LASSERT in mdt_dump_lmm()}}" to fix the crash from overstriping too widely, but didn't fix the stripe_count limit setting.

Comment by Alexander Zarochentsev [ 04/Jul/23 ]

Andreas,
"-C -1" mimics "-c -1" but unlike the wide striping across all OSTs , overstripe count of 2000 is not a sane thing to do. I bet nobody but only Lustre testers really want 2000 stripes per file. That brings a question about dropping support of a special value of "-1" for -C.

Comment by Cory Spitz [ 05/Jul/23 ]

+1 Zam's comment. 2000 overstripes will more likely hurt more than it might help. We should protect the users from doing stupid things. I think if someone wants the stripe limit, then they can specify an appropriate value themselves.

Comment by Andreas Dilger [ 05/Jul/23 ]

Patrick, any comments on this? At one time I was thinking that we might use "-C -2" to mean 2xOST_COUNT overstriping, etc. up to some reasonable maximum. That would make "-C -1" behave the same as "-c -1".

Comment by Patrick Farrell [ 05/Jul/23 ]

Yeah, I think you're both right - The current behavior doesn't make sense and should go, and the suggested behavior from Andreas is reasonable.  My plate is full at the moment, though, so if anyone wants it... heh

By the way, I believe scherementsev fixed the core bug reported here (where -C -1 leads to bad behavior and even possible crashes) in a patch he was doing reworking part of stripe allocation?  Hopefully he remembers which one.  He took option '2', since we should do that regardless.

Comment by Patrick Farrell [ 05/Jul/23 ]

Rajeev, I would support a patch to make -C -1 do the same as '-c -1', or just to return -EINVAL.  Either one is fine with me.  The improved behavior Andreas is suggesting for -2, etc would be neat as well but would involve at least a little work.  (Documentation, tests, etc, even if the implementation is easy)

Comment by Sergey Cheremencev [ 05/Jul/23 ]

I think Patrick is speaking about "LU-16623 lod: handle object allocation consistently".
Checking master with mentioned patch I don't see any problem:

[root@vm2 tests]# lfs setstripe -C -1 /mnt/lustre/foo
[root@vm2 tests]# lfs getstripe /mnt/lustre/foo | head -n 4
/mnt/lustre/foo
lmm_stripe_count:  2000
lmm_stripe_size:   1048576
lmm_pattern:       raid0,overstriped 

Guys, does version where you reproduced the issue include above patch? Or am I doing something wrong to reproduce it? If so, please give more details.

Comment by Patrick Farrell [ 05/Jul/23 ]

Thanks Sergey.

Comment by Rajeev Mishra [ 05/Jul/23 ]

Patrick and Sergey I do not have the LU 16623 in my workspace. Will update my workspace and let you know if the problem still persist.

Thanks for your help.

Comment by Rajeev Mishra [ 05/Jul/23 ]

With the patch it works good as shown below

lfs setstripe -C -1 /mnt/lustre/rajeev

  1. lfs getstripe /mnt/lustre/rajeev 

/mnt/lustre/rajeev

lmm_stripe_count:  2000

lmm_stripe_size:   1048576

lmm_pattern:       raid0,overstriped

lmm_layout_gen:    0

lmm_stripe_offset: 0

Comment by Andreas Dilger [ 06/Jul/23 ]

Rajeev, if you have the cycles, it would be good to implement the "-C -1/-2/-3/..." option to specify 1x/2x/3x/... overstriping of the OSTs, maybe up to 32x the OST count, or at least "-C -1" limiting to OST count? There may be a couple of tests using "-C -1" that need to be changed to e.g. "-C 2000".

Comment by Rajeev Mishra [ 06/Jul/23 ]

@Andreas I will try to add the functionality as suggested. I assume max in any case should not cross LOV_MAX_STRIPE_COUNT that is 2000 ?

Comment by Patrick Farrell [ 06/Jul/23 ]

Definitely not.  That can cause crashes, or at least errors (or it should).

Comment by Andreas Dilger [ 20/Jul/23 ]

Since the core "-C -1" issues were already fixed by LU-13748 and LU-16623, I changed this issue to track the improvement for mapping "-C -1" to use OST_COUNT like "-c -1", and "-C -2" to use "2 * OST_COUNT", etc.

Comment by Rajeev Mishra [ 07/Sep/23 ]

I'm currently reviewing the options -c, -C, overstripe-count, and stripe-count to gain a better understanding of their behavior.

I've noticed that there's a bug where the command accepts all of these options simultaneously, even though it appears that they should be mutually exclusive. Presently, all flags can be used together, as demonstrated in the following example:

[root@test2-rocky8 jbs]# lfs setstripe --overstripe-count 1024 --stripe-count -1 -c 10 -C -1 test

[root@test2-rocky8 jbs]# lfs getstripe test | more

test

lmm_stripe_count:  2000

lmm_stripe_size:   1048576

lmm_pattern:       raid0,overstriped

lmm_layout_gen:    0

 

The documentation for these options are as follows:

 

    -c, --stripe-count <stripe_count>: Specifies the nu    mber of OSTs to stripe a file over. A value of 0 means to use the filesystem-wide default stripe count (default is 1), and -1 means to stripe over all available OSTs.

 

    -C, --overstripe-count <stripe_count>: Specifies the number of stripes to create, creating more than one stripe per OST if the count exceeds the number of OSTs in the file system. Similar to -c, 0 uses the filesystem-wide default stripe count (default is 1), and -1 means to stripe over all available OSTs.

    

Now, the question arises: should we consider making these options mutually exclusive? The reason for this consideration is that the behavior of -c is essentially the same as using -C,. If we allow mutual inclusion, we would need to define the precedence of these options when used together.

Your feedback and input on whether or not we should make these options mutually exclusive would be greatly appreciated.

The fix will stick with the pattern mentioned in the comment above, which means it will use -1, -2, -3, and so on. In simpler terms, the stripe count will be calculated as a multiple of the OST count, up to the maximum stripe count allowed.

*~ *                                                                                                     

Comment by Patrick Farrell [ 07/Sep/23 ]

Interesting!  I think we should make them mutually exclusive, yes.  I would do that in a separate patch from the one which adds '-2, -3' functionality.

Comment by Rajeev Mishra [ 07/Sep/23 ]

Ok I will just take care of n*ostcount as part of this ticket. Thanks Patrick

Comment by Patrick Farrell [ 07/Sep/23 ]

It would be OK to fix that as a separate patch on this ticket, it's small enough it doesn't need a new ticket.  But it should be a separate patch, that's all.

Comment by Andreas Dilger [ 08/Sep/23 ]

Actually, I will disagree with Patrick here. While "-C M" enables overstriping, it works with a stripe count that is less than the number of OSTs < M, equivalent to "-c M" in that case. I don't think it "conflicts" with a later "-c N" option. Like many utilities, the last option specified will take precedence.

Comment by Patrick Farrell [ 08/Sep/23 ]

Interesting, OK!  Happy to defer.  I wasn't familiar with "last option takes precedence".

Comment by Josh Schwartz [ 08/Sep/23 ]

> Like many utilities, the last option specified will take precedence.

I would be fine with either (mutually exclusive or last takes precedence in its entirety) but this bothers me:

jupiter-p2:/lus/kjcf08 # mkdir test
jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 1024 --stripe-count 10 test
jupiter-p2:/lus/kjcf08 # lfs getstripe -d test
test
stripe_count:  10 stripe_size:   1048576 pattern:       raid0,overstriped stripe_offset: -1

Note that we got (and kept) overstriped from the first param, but then picked up the count from the second. If the last option truly took precedence I would expect a stripe count of 10 without overstriped (just like if the first one took precedence I would expect a stripe count of 1024 with overstriped).

It is inconsistent that the behavior is different if you issue them individually, but in the same order:

jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 1024 test
jupiter-p2:/lus/kjcf08 # lfs getstripe -d test
stripe_count:  1024 stripe_size:   1048576 pattern:       raid0,overstriped stripe_offset: -1
jupiter-p2:/lus/kjcf08 # lfs setstripe --stripe-count 10 test
jupiter-p2:/lus/kjcf08 # lfs getstripe -d test
stripe_count:  10 stripe_size:   1048576 pattern:       raid0 stripe_offset: -1

Here each command does as I would expect; --overstripe-count 1024 by itself yields overstriped and stripe count 1024, and --stripe-count 10 by itself on the same directory removes overstriped (which is what I would expect) yielding stripe count 10 without overstriped.

The fact that combining them causes it to take the overstriped from the first param and the stripe count from the second is surprising. --stripe-count explicitly means not-overstriped and if the rule is that the last one takes precedence, then it should be like the --overstripe-count wasn't there at all instead of the --stripe-count acting as a modifier.

Comment by Patrick Farrell [ 08/Sep/23 ]

Josh,

Makes sense to me.  There's also another possible bug here - how many OSTs do you have on that system?  If it's >= 10, then overstriped shouldn't be set by the server code either, which is also a concern.  Overstriping should only be set on the file when the actual file striping exceeds the number of available OSTs.  (Or at least that was the intent...)

So there may be two things to fix there - proper overriding by later parameters in userspace, so the overstriping flag isn't passed along, and then - if you have >= 10 OSTs, then the server shouldn't set the overstriping pattern regardless of what userspace asked for.  If you have 20 OSTs and give -C 10, overstriping shouldn't be set, because the file is not actually overstriped.  Overstriping set on a not-overstriped file isn't fatal, but it's definitely wrong.

Comment by Andreas Dilger [ 08/Sep/23 ]

There is code in lod_ost_alloc_rr() in the MDS object allocation that should be removing the LOV_PATTERN_OVERSTRIPING flag if it is set unnecessarily:


        /* If there are enough OSTs, a component with overstriping requested
         * will not actually end up overstriped.  The comp should reflect this.
         */
        if (!overstriped)
                lod_comp->llc_pattern &= ~LOV_PATTERN_OVERSTRIPING;

If this isn't being applied consistently, then that would be a bug.

Comment by Josh Schwartz [ 08/Sep/23 ]

I don't think that is coming into play here because I'm just showing the default striping on a directory. If I actually create a file within the directory I believe it is behaving as you suggest:

jupiter-p2:/lus/kjcf08 # mkdir test
jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 1024 --stripe-count 10 test
jupiter-p2:/lus/kjcf08 # touch test/foo
jupiter-p2:/lus/kjcf08 # lfs getstripe test | head
test
stripe_count:  10 stripe_size:   1048576 pattern:       raid0,overstriped stripe_offset: -1

test/foo
lmm_stripe_count:  10
lmm_stripe_size:   1048576
lmm_pattern:       raid0,overstriped
lmm_layout_gen:    0
lmm_stripe_offset: 1
	obdidx		 objid		 objid		 group

here the file is overstriped because I only have 2 OSTs.

This is a bit of a degenerative example, but if I just set the --overstripe-count 2 the directory will have a default of overstriped with a stripe count of 2, but files that are created are not overstriped (and have a stripe count of 2):

jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 2 test
jupiter-p2:/lus/kjcf08 # lfs getstripe -d test
stripe_count:  2 stripe_size:   1048576 pattern:       raid0,overstriped stripe_offset: -1
jupiter-p2:/lus/kjcf08 # touch test/foo
jupiter-p2:/lus/kjcf08 # lfs getstripe test/foo
test/foo
lmm_stripe_count:  2
lmm_stripe_size:   1048576
lmm_pattern:       raid0
lmm_layout_gen:    0
lmm_stripe_offset: 1
	obdidx		 objid		 objid		 group
	     1	     116959791	    0x6f8aa2f	             0
	     0	     117253333	    0x6fd24d5	             0

so I think that part of it is working OK.

Comment by Patrick Farrell [ 08/Sep/23 ]

OK, that's good, then.  The user interface is important but I was more concerned that the server might be marking the layout incorrectly.  Obviously default layouts are a different case.

Comment by Josh Schwartz [ 08/Sep/23 ]

But just to be clear, the inconsistency I'm concerned about can ultimately affect files, e.g. by ending up with a MUCH larger overstripe count than perhaps was intended if one accidentally does something like this:

jupiter-p2:/lus/kjcf08 # mkdir test
jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 10 --stripe-count -1 test
jupiter-p2:/lus/kjcf08 # touch test/foo
jupiter-p2:/lus/kjcf08 # lfs getstripe test | head
test
stripe_count:  -1 stripe_size:   1048576 pattern:       raid0,overstriped stripe_offset: -1

test/foo
lmm_stripe_count:  2727
lmm_stripe_size:   1048576
lmm_pattern:       raid0,overstriped
lmm_layout_gen:    0
lmm_stripe_offset: 0
	obdidx		 objid		 objid		 group

(note the 2727 value is because I don't have Rajeev's other fix on this system, but on the latest code this would be 2000... still probably not what was expected on a system with 2 OSTs).

Generated at Sat Feb 10 03:31:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.