[LU-16749] apply a filter to the configuration log. Created: 18/Apr/23 Updated: 02/Nov/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Beal | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Epic/Theme: | client | ||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
We discovered that a lustre server can have entries in its configuration log which can effect other filesystems. eg lctl set_param -P llite.*.max_cached_mb=2048 When a filesystem is mounted which has this in the filesystems configuration log, the client will apply the option to all mounted filesystems. I believe that the client should only apply a configuration option from the log to elements of that filesystem I can imagine a site which has multiple vendors and for example a vendor putting options in that casued issues for their competitor.
|
| Comments |
| Comment by James A Simmons [ 18/Apr/23 ] |
|
A filter exist already. The correct command to use is "lctl set_param lllite.my_filesystem-*.max_cached_mb=2048". Just using '*' means apply to everything. |
| Comment by Andreas Dilger [ 18/Apr/23 ] |
|
This has definitely been a problem in the past, especially if mounting two filesystems with different timeouts, since there is only a single global value for the timeout on a client node. Patches in Since '*' is affecting all filesystems, this is a bit tricky to fix, because a single MGS may be serving multiple local filesystems, so it isn't possible to automatically map '*' in lctl to contain an fsname in the MGS config. It might be possible to constrain this to a specific filesystem during mount, since the other filesystem is also going to re-apply the settings when it mounts. However, not all '*' uses are for the device name (though usually that is true), so this may be more complicated than expected. James B., getting some specific details of which parameters you had problems with would help to understand what needs to be fixed. As James S. wrote, it is already possible to use fsname-* instead of just * for many cases, though this requires more familiarity with Lustre, and probably updates to the Lustre manual and man pages to emphasize the correct usage. |
| Comment by James Beal [ 19/Apr/23 ] |
|
James S, I understand that you can correctly set parameters, "lctl set_param lllite.my_filesystem-*.max_cached_mb=2048" is what bit us in particular. It was set in error on a new filesystem during debuging another issue. When the filesystem was put in to production then performance on all filesystems plummetted in a hard to find way. What I am asking for is that a lustre client only applies settings fetched from a remote MGS to the filesystem that that client is mounting at that time.
To put this in context, we put the new fileserver in to production on the 27th of Feburary. We then had serval weeks were we were getting tickets for bad performance and the filesystems were being hit harder than normal but when we looked at a filesystem we got good performance for the level of load that we could see on the filesystem. By the 6th of March I was convinced that this was a issue and started investigating. And we found the issue by the 10th, and by the 14th we had returned performance to acceptable levels.
As an aside, and I suspect this is more a specific vendor comment rather than a whamcloud one: There seems to be a practice of setting configuration in the servers where the clients are in a better position to know the best values, eg a 2TB machine will have a different setting for max_cached_mb than a machine with 64GB..
I am also interested in why there is a max_cached_mb setting at all, why should there be a limit to the amount of ram used by the cache ( But I feel that would be a new ticket, if warrents discussion ).
In particular I can imagine a black hat fs that has a config for a competitors system reducing their performance however I suspect human error as we saw is much more likely. |
| Comment by Andreas Dilger [ 20/Apr/23 ] |
|
Definitely the max_cached_mb parameter is a client setting. The fact that it is set on the server is purely for convenience so that every client mounting the filesystem does not have to update the configuration. As for why this parameter exists, it is to allow admins to limit the amount of RAM used by the filesystem to avoid impacting application memory. Some apps ate tuned for a specific amount of RAM, and the admins/users don't want Lustre to interfere with that. Allowing a value like "20%" or "80%" would allow this to be usable on heterogeneous clients and probably would not be hard to implement. It used to be that most nodes in a cluster were the same, and a few would get a client-side setting, but it makes sense to handle this more flexibly. As for limiting parameters to a single filesystem mount, that is definitely something that I'm interested to have fixed, it just hasn't happened yet. |
| Comment by James Beal [ 20/Apr/23 ] |
|
Andreas. Can we use this issue for limiting parameters to a single filesystem mount. I would like to see the percentage accepted as a value for max_cached_mb however I don't understand why the parameter needs to exist, the nearest equiverlent I know of for vfs cache is vfs_cache_pressure It seems to me that the kernel should "just" evict pages if a user space program wants the pages ( In a tightly coupled mpi program I can see this could be an issue over multiple nodes, but the use case for genomics is a random soup of work loads all decoupled, so I feel that we should have the setting as something like 95% ). Shall I make a new issue for the percentage ? |
| Comment by Andreas Dilger [ 20/Apr/23 ] |
|
Yes, a separate issue for max_cached_mb=N% would be good. I can assure you that this tunable to limit Lustre cache size is a feature that some sites want in HPC environments. If the cache limit is high then page cache management defers to the VM under pressure, but otherwise it can be difficult for the VM to manage pages because it can take tens of seconds for dirty data to be written when the server is busy. |
| Comment by Andreas Dilger [ 02/Nov/23 ] |
|
The patch https://review.whamcloud.com/51952 " |