Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.3.0, Lustre 2.1.2
-
https://github.com/chaos/lustre/commits/2.1.1-10chaos, 3000+ OSTs across 4 filesystems
-
3
-
4033
Description
Our users found that their application was scaling very poorly on our "zin" cluster. It is a sandy bridge cluster, 16 cores per node, roughly 3000 nodes. At relatively low node counts (512 nodes), they found that their performance on zin now that it is one the secure network is 1/4 of what is was when zin was on the open network.
One of the few differences is that zin now talks to 3000+ OSTs on the secure network, whereas it only talk to a few hundred OSTs while it was shaken down on the open network. One of our engineers noted that the ldlm_poold was frequently using 0.3% of CPU time on zin.
The application in question is HIGHLY sensitive to system daemons and other CPU noise on the compute nodes because it highly MPI coordinated. I created the attached patch (ldlm_poold_period.patch) that allows me to change the sleep interval used by the ldlm_poold. Sure enough, if I change the sleep time to 300 seconds, the application's performance immediate improves by 4X.
The ldlm_poold walking a list of 3000+ namespaces every second and doing nothing most of the time (because client namespaces are only actually "recalculated" every 10s) is a very bad design. The patch was just to determine if that was really the cause.
I will now work on a real fix.
I think instead of making the ldlm_poold's sleep time configurable, I will make both the LDLM_POOL_SRV_DEF_RECALC_PERIOD and LDLM_POOL_CLI_DEF_RECALC_PERIOD tunables. Then I will make the ldlm_poold will dynamically sleep based on the next period in the list of namespaces...although I probably don't want each name space to have its own starting time.
I probably want to keep the server and client namespace periods in sync with the namespaces of the same type, and then perhaps order the list as well to avoid walking the entire list unnecessarily.
No work needed by Whamcloud right now, except perhaps to comment on my approach if you think there is something that I should be doing differently (or if there is already work in this area that I haven't found).
Attachments
Issue Links
- is related to
-
LU-2924 shrink ldlm_poold workload
-
- Resolved
-
I took some time to look at the code fore this today. The more I read, the worse things look.
First of all, I think the use of these globals makes the code less readable that it could be:
If we just made a struct with the three variables, and then had two global variables of that type, I think we could reduce the number of special accessor functions to one.
Next, having a thread regularly poll all of the namespaces seems like a poor approach to handling SLV on the client side. I would think that when a change SLV comes in over the network only then would we want to put the involved pool on a list of work, and send a signal to a sleeping thread to handle the work.
Also, each pool tracks its own last recalculation time in pl_recalc_time, and has its own pl_recalc_period. Why? I see no reason at all for these. If not for the time it takes to walk the namespace list, these recalc times would stay exactly in sync. So why not just have a single value that can actually be adjusted?
On the client side, we wake every second and shuffle the entire client list of namespaces. 9 times out of 10 there will be nothing to do because the pl_recalc_period is 10 seconds in each ns. That is pretty silly.
We could perhaps order the list based on time remaining to next recalc needed...but that doesn't seem to be worth while.
I can perhaps see walking only a limited number of namespaces each time, as Andreas mentioned above. This means that each namespace will only have its pool recalculated relatively rarely. So then the question is whether it is better to have frequent small bits of noise, or rarer but larger amounts of noise.
In either case though, we make SLV less effective as a tool to reduce server lock load. Of course, SLV seems to currently be a pretty ineffective way of reducing lock usage when the server is under memory pressure (
LU-1128), so maybe I just shouldn't worry about that for now.I think I will just make the ldlm_poold's polling interval configurable. I will also probably synchronize the polling intervals with the wall clock, because it is much better for all of the nodes to see the noise at the same time.
But I think we need to come up with a plan for overhauling this in a future version.