[LU-16938] "lfs setstripe -C -1" stripes too widely, should be limited to OST_COUNT Created: 03/Jul/23 Updated: 15/Dec/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.15.0 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major |
| Reporter: | Rajeev Mishra | Assignee: | Rajeev Mishra |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
I am reaching out to seek clarification regarding the expected behavior of the "lfs setstripe" command when using the -C -1 option. Currently, it appears that this command is creating a higher stripe count than anticipated. For instance, on my test system, it generated a stripe count of 2727 for a single file. This count exceeds the allowed limit of LOV_MAX_STRIPE_COUNT. I am uncertain about the appropriate solution to address this issue related to the "-1" argument. I have contemplated the following options: 1. Consider making the option -1 illegal, preventing its usage altogether. 2. Implement a mechanism to automatically set the stripe count to the maximum allowed value (LOV_MAX_STRIPE_COUNT) if the count exceeds this limit. I would greatly appreciate your input and guidance in this matter. It is worth noting that setting the stripe count higher than LOV_MAX_STRIPE_COUNT leads to other problems, such as the failure of the "llapi_layout_get_by_fd" API to open the file. Please let me know your input. |
| Comments |
| Comment by Andreas Dilger [ 03/Jul/23 ] |
|
This should limit the stripe count of the component to LOV_MAX_STRIPE_COUNT, or whatever will fit into the remaining 64KiB xattr space. This was also discussed in |
| Comment by Alexander Zarochentsev [ 04/Jul/23 ] |
|
Andreas, |
| Comment by Cory Spitz [ 05/Jul/23 ] |
|
+1 Zam's comment. 2000 overstripes will more likely hurt more than it might help. We should protect the users from doing stupid things. I think if someone wants the stripe limit, then they can specify an appropriate value themselves. |
| Comment by Andreas Dilger [ 05/Jul/23 ] |
|
Patrick, any comments on this? At one time I was thinking that we might use "-C -2" to mean 2xOST_COUNT overstriping, etc. up to some reasonable maximum. That would make "-C -1" behave the same as "-c -1". |
| Comment by Patrick Farrell [ 05/Jul/23 ] |
|
Yeah, I think you're both right - The current behavior doesn't make sense and should go, and the suggested behavior from Andreas is reasonable. My plate is full at the moment, though, so if anyone wants it... heh By the way, I believe scherementsev fixed the core bug reported here (where -C -1 leads to bad behavior and even possible crashes) in a patch he was doing reworking part of stripe allocation? Hopefully he remembers which one. He took option '2', since we should do that regardless. |
| Comment by Patrick Farrell [ 05/Jul/23 ] |
|
Rajeev, I would support a patch to make -C -1 do the same as '-c -1', or just to return -EINVAL. Either one is fine with me. The improved behavior Andreas is suggesting for -2, etc would be neat as well but would involve at least a little work. (Documentation, tests, etc, even if the implementation is easy) |
| Comment by Sergey Cheremencev [ 05/Jul/23 ] |
|
I think Patrick is speaking about " [root@vm2 tests]# lfs setstripe -C -1 /mnt/lustre/foo [root@vm2 tests]# lfs getstripe /mnt/lustre/foo | head -n 4 /mnt/lustre/foo lmm_stripe_count: 2000 lmm_stripe_size: 1048576 lmm_pattern: raid0,overstriped Guys, does version where you reproduced the issue include above patch? Or am I doing something wrong to reproduce it? If so, please give more details. |
| Comment by Patrick Farrell [ 05/Jul/23 ] |
|
Thanks Sergey. |
| Comment by Rajeev Mishra [ 05/Jul/23 ] |
|
Patrick and Sergey I do not have the LU 16623 in my workspace. Will update my workspace and let you know if the problem still persist. Thanks for your help. |
| Comment by Rajeev Mishra [ 05/Jul/23 ] |
|
With the patch it works good as shown below lfs setstripe -C -1 /mnt/lustre/rajeev
/mnt/lustre/rajeev lmm_stripe_count: 2000 lmm_stripe_size: 1048576 lmm_pattern: raid0,overstriped lmm_layout_gen: 0 lmm_stripe_offset: 0 |
| Comment by Andreas Dilger [ 06/Jul/23 ] |
|
Rajeev, if you have the cycles, it would be good to implement the "-C -1/-2/-3/..." option to specify 1x/2x/3x/... overstriping of the OSTs, maybe up to 32x the OST count, or at least "-C -1" limiting to OST count? There may be a couple of tests using "-C -1" that need to be changed to e.g. "-C 2000". |
| Comment by Rajeev Mishra [ 06/Jul/23 ] |
|
@Andreas I will try to add the functionality as suggested. I assume max in any case should not cross LOV_MAX_STRIPE_COUNT that is 2000 ? |
| Comment by Patrick Farrell [ 06/Jul/23 ] |
|
Definitely not. That can cause crashes, or at least errors (or it should). |
| Comment by Andreas Dilger [ 20/Jul/23 ] |
|
Since the core "-C -1" issues were already fixed by |
| Comment by Rajeev Mishra [ 07/Sep/23 ] |
|
I'm currently reviewing the options -c, -C, overstripe-count, and stripe-count to gain a better understanding of their behavior. I've noticed that there's a bug where the command accepts all of these options simultaneously, even though it appears that they should be mutually exclusive. Presently, all flags can be used together, as demonstrated in the following example: [root@test2-rocky8 jbs]# lfs setstripe --overstripe-count 1024 --stripe-count -1 -c 10 -C -1 test [root@test2-rocky8 jbs]# lfs getstripe test | more test lmm_stripe_count: 2000 lmm_stripe_size: 1048576 lmm_pattern: raid0,overstriped lmm_layout_gen: 0
The documentation for these options are as follows:
-c, --stripe-count <stripe_count>: Specifies the nu mber of OSTs to stripe a file over. A value of 0 means to use the filesystem-wide default stripe count (default is 1), and -1 means to stripe over all available OSTs.
-C, --overstripe-count <stripe_count>: Specifies the number of stripes to create, creating more than one stripe per OST if the count exceeds the number of OSTs in the file system. Similar to -c, 0 uses the filesystem-wide default stripe count (default is 1), and -1 means to stripe over all available OSTs.
Now, the question arises: should we consider making these options mutually exclusive? The reason for this consideration is that the behavior of -c is essentially the same as using -C,. If we allow mutual inclusion, we would need to define the precedence of these options when used together. Your feedback and input on whether or not we should make these options mutually exclusive would be greatly appreciated. The fix will stick with the pattern mentioned in the comment above, which means it will use -1, -2, -3, and so on. In simpler terms, the stripe count will be calculated as a multiple of the OST count, up to the maximum stripe count allowed. *~ * |
| Comment by Patrick Farrell [ 07/Sep/23 ] |
|
Interesting! I think we should make them mutually exclusive, yes. I would do that in a separate patch from the one which adds '-2, -3' functionality. |
| Comment by Rajeev Mishra [ 07/Sep/23 ] |
|
Ok I will just take care of n*ostcount as part of this ticket. Thanks Patrick |
| Comment by Patrick Farrell [ 07/Sep/23 ] |
|
It would be OK to fix that as a separate patch on this ticket, it's small enough it doesn't need a new ticket. But it should be a separate patch, that's all. |
| Comment by Andreas Dilger [ 08/Sep/23 ] |
|
Actually, I will disagree with Patrick here. While "-C M" enables overstriping, it works with a stripe count that is less than the number of OSTs < M, equivalent to "-c M" in that case. I don't think it "conflicts" with a later "-c N" option. Like many utilities, the last option specified will take precedence. |
| Comment by Patrick Farrell [ 08/Sep/23 ] |
|
Interesting, OK! Happy to defer. I wasn't familiar with "last option takes precedence". |
| Comment by Josh Schwartz [ 08/Sep/23 ] |
|
> Like many utilities, the last option specified will take precedence. I would be fine with either (mutually exclusive or last takes precedence in its entirety) but this bothers me: jupiter-p2:/lus/kjcf08 # mkdir test jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 1024 --stripe-count 10 test jupiter-p2:/lus/kjcf08 # lfs getstripe -d test test stripe_count: 10 stripe_size: 1048576 pattern: raid0,overstriped stripe_offset: -1 Note that we got (and kept) overstriped from the first param, but then picked up the count from the second. If the last option truly took precedence I would expect a stripe count of 10 without overstriped (just like if the first one took precedence I would expect a stripe count of 1024 with overstriped). It is inconsistent that the behavior is different if you issue them individually, but in the same order: jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 1024 test jupiter-p2:/lus/kjcf08 # lfs getstripe -d test stripe_count: 1024 stripe_size: 1048576 pattern: raid0,overstriped stripe_offset: -1 jupiter-p2:/lus/kjcf08 # lfs setstripe --stripe-count 10 test jupiter-p2:/lus/kjcf08 # lfs getstripe -d test stripe_count: 10 stripe_size: 1048576 pattern: raid0 stripe_offset: -1 Here each command does as I would expect; --overstripe-count 1024 by itself yields overstriped and stripe count 1024, and --stripe-count 10 by itself on the same directory removes overstriped (which is what I would expect) yielding stripe count 10 without overstriped. The fact that combining them causes it to take the overstriped from the first param and the stripe count from the second is surprising. --stripe-count explicitly means not-overstriped and if the rule is that the last one takes precedence, then it should be like the --overstripe-count wasn't there at all instead of the --stripe-count acting as a modifier. |
| Comment by Patrick Farrell [ 08/Sep/23 ] |
|
Josh, Makes sense to me. There's also another possible bug here - how many OSTs do you have on that system? If it's >= 10, then overstriped shouldn't be set by the server code either, which is also a concern. Overstriping should only be set on the file when the actual file striping exceeds the number of available OSTs. (Or at least that was the intent...) So there may be two things to fix there - proper overriding by later parameters in userspace, so the overstriping flag isn't passed along, and then - if you have >= 10 OSTs, then the server shouldn't set the overstriping pattern regardless of what userspace asked for. If you have 20 OSTs and give -C 10, overstriping shouldn't be set, because the file is not actually overstriped. Overstriping set on a not-overstriped file isn't fatal, but it's definitely wrong. |
| Comment by Andreas Dilger [ 08/Sep/23 ] |
|
There is code in lod_ost_alloc_rr() in the MDS object allocation that should be removing the LOV_PATTERN_OVERSTRIPING flag if it is set unnecessarily:
/* If there are enough OSTs, a component with overstriping requested
* will not actually end up overstriped. The comp should reflect this.
*/
if (!overstriped)
lod_comp->llc_pattern &= ~LOV_PATTERN_OVERSTRIPING;
If this isn't being applied consistently, then that would be a bug. |
| Comment by Josh Schwartz [ 08/Sep/23 ] |
|
I don't think that is coming into play here because I'm just showing the default striping on a directory. If I actually create a file within the directory I believe it is behaving as you suggest: jupiter-p2:/lus/kjcf08 # mkdir test jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 1024 --stripe-count 10 test jupiter-p2:/lus/kjcf08 # touch test/foo jupiter-p2:/lus/kjcf08 # lfs getstripe test | head test stripe_count: 10 stripe_size: 1048576 pattern: raid0,overstriped stripe_offset: -1 test/foo lmm_stripe_count: 10 lmm_stripe_size: 1048576 lmm_pattern: raid0,overstriped lmm_layout_gen: 0 lmm_stripe_offset: 1 obdidx objid objid group here the file is overstriped because I only have 2 OSTs. This is a bit of a degenerative example, but if I just set the --overstripe-count 2 the directory will have a default of overstriped with a stripe count of 2, but files that are created are not overstriped (and have a stripe count of 2): jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 2 test jupiter-p2:/lus/kjcf08 # lfs getstripe -d test stripe_count: 2 stripe_size: 1048576 pattern: raid0,overstriped stripe_offset: -1 jupiter-p2:/lus/kjcf08 # touch test/foo jupiter-p2:/lus/kjcf08 # lfs getstripe test/foo test/foo lmm_stripe_count: 2 lmm_stripe_size: 1048576 lmm_pattern: raid0 lmm_layout_gen: 0 lmm_stripe_offset: 1 obdidx objid objid group 1 116959791 0x6f8aa2f 0 0 117253333 0x6fd24d5 0 so I think that part of it is working OK. |
| Comment by Patrick Farrell [ 08/Sep/23 ] |
|
OK, that's good, then. The user interface is important but I was more concerned that the server might be marking the layout incorrectly. Obviously default layouts are a different case. |
| Comment by Josh Schwartz [ 08/Sep/23 ] |
|
But just to be clear, the inconsistency I'm concerned about can ultimately affect files, e.g. by ending up with a MUCH larger overstripe count than perhaps was intended if one accidentally does something like this: jupiter-p2:/lus/kjcf08 # mkdir test jupiter-p2:/lus/kjcf08 # lfs setstripe --overstripe-count 10 --stripe-count -1 test jupiter-p2:/lus/kjcf08 # touch test/foo jupiter-p2:/lus/kjcf08 # lfs getstripe test | head test stripe_count: -1 stripe_size: 1048576 pattern: raid0,overstriped stripe_offset: -1 test/foo lmm_stripe_count: 2727 lmm_stripe_size: 1048576 lmm_pattern: raid0,overstriped lmm_layout_gen: 0 lmm_stripe_offset: 0 obdidx objid objid group (note the 2727 value is because I don't have Rajeev's other fix on this system, but on the latest code this would be 2000... still probably not what was expected on a system with 2 OSTs). |