Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.3.0, Lustre 2.1.2
-
https://github.com/chaos/lustre/commits/2.1.1-10chaos, 3000+ OSTs across 4 filesystems
-
3
-
4033
Description
Our users found that their application was scaling very poorly on our "zin" cluster. It is a sandy bridge cluster, 16 cores per node, roughly 3000 nodes. At relatively low node counts (512 nodes), they found that their performance on zin now that it is one the secure network is 1/4 of what is was when zin was on the open network.
One of the few differences is that zin now talks to 3000+ OSTs on the secure network, whereas it only talk to a few hundred OSTs while it was shaken down on the open network. One of our engineers noted that the ldlm_poold was frequently using 0.3% of CPU time on zin.
The application in question is HIGHLY sensitive to system daemons and other CPU noise on the compute nodes because it highly MPI coordinated. I created the attached patch (ldlm_poold_period.patch) that allows me to change the sleep interval used by the ldlm_poold. Sure enough, if I change the sleep time to 300 seconds, the application's performance immediate improves by 4X.
The ldlm_poold walking a list of 3000+ namespaces every second and doing nothing most of the time (because client namespaces are only actually "recalculated" every 10s) is a very bad design. The patch was just to determine if that was really the cause.
I will now work on a real fix.
I think instead of making the ldlm_poold's sleep time configurable, I will make both the LDLM_POOL_SRV_DEF_RECALC_PERIOD and LDLM_POOL_CLI_DEF_RECALC_PERIOD tunables. Then I will make the ldlm_poold will dynamically sleep based on the next period in the list of namespaces...although I probably don't want each name space to have its own starting time.
I probably want to keep the server and client namespace periods in sync with the namespaces of the same type, and then perhaps order the list as well to avoid walking the entire list unnecessarily.
No work needed by Whamcloud right now, except perhaps to comment on my approach if you think there is something that I should be doing differently (or if there is already work in this area that I haven't found).
Attachments
Issue Links
- is related to
-
LU-2924 shrink ldlm_poold workload
- Resolved