[LU-12368] concurrent statfs() calls on the client should be blocked Created: 31/May/19 Updated: 12/Aug/20 Resolved: 08/Nov/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.10.8, Lustre 2.12.3 |
| Fix Version/s: | Lustre 2.13.0 |
| Type: | Improvement | Priority: | Minor |
| Reporter: | Andreas Dilger | Assignee: | Andreas Dilger |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | easy | ||
| Issue Links: |
|
||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
If multiple threads on a client are executing statfs() calls concurrently, and the obd_statfs() cache has expired, then each thread will send an OST_STATFS RPC to each OST. With certain statfs-heavy workloads on many-core client nodes, this can result in thousands of needless RPCs being sent from each client every few seconds. Since all of the callers funnel through obd_statfs(), and there is no benefit to having multiple OST_STATFS or MDS_STATFS replies from the same target (they return the same data, and all threads are blocked on the reply) it makes sense to just allow one thread to execute the statfs and other threads to (interruptibly) wait for it to complete. |
| Comments |
| Comment by Gerrit Updater [ 25/Jun/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35311 |
| Comment by Gerrit Updater [ 29/Jun/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35380 |
| Comment by Gerrit Updater [ 29/Jun/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35383 |
| Comment by Gerrit Updater [ 12/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35380/ |
| Comment by Gerrit Updater [ 12/Jul/19 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35485 |
| Comment by Bob Hawkins [ 25/Jul/19 ] |
|
When testing with the default age of 1, we observed a regular occurrence (~1 in 10) of slow app timesteps in which each timestep made numerous statfs calls. Increasing the statfs_max_age to 30 seconds, we did see a marked, positive change in behavior in which we only saw 1 in 900 slow app timesteps. |
| Comment by Andreas Dilger [ 25/Jul/19 ] |
|
Bob, the default statfs cache age has been 1s since the beginning of Lustre. Increasing it to 30s is pretty reasonable, as (IMHO) applications shouldn't really put too much stock in the amount of free space in the filesystem anyway, given that there may be a large number of jobs running concurrently and allocating and freeing space, space is reserved by clients, quota limits may intervene before the filesystem runs out of space, etc. Either there is lots of free space, and the application shouldn't care, or there isn't lots of space and the application may run out of space even if it checks in advance (e.g. some other process consumes the remaining space). Is there any clear indication why the application is doing so many statfs() calls? |
| Comment by Gerrit Updater [ 26/Jul/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35485/ |
| Comment by Patrick Farrell (Inactive) [ 30/Jul/19 ] |
|
Andreas, Any reason this ticket is still open? I think we've got the patches all merged. |
| Comment by Andreas Dilger [ 31/Jul/19 ] |
|
Patrick, there is still patch https://review.whamcloud.com/35383 "obdclass: allow 'lfs df' to specify cache age" still open. I added that patch while working on "obdclass: don't send multiple statfs RPCs" but I'm not sure if it is actually useful to land or not. |
| Comment by Gerrit Updater [ 09/Sep/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35311/ |
| Comment by Gerrit Updater [ 26/Nov/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36874 |
| Comment by Alexey Lyashkov [ 26/Feb/20 ] |
|
In fact - fix is incorrect. Lets look to |