[LU-12368] concurrent statfs() calls on the client should be blocked Created: 31/May/19  Updated: 12/Aug/20  Resolved: 08/Nov/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.8, Lustre 2.12.3
Fix Version/s: Lustre 2.13.0

Type: Improvement Priority: Minor
Reporter: Andreas Dilger Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: easy

Issue Links:
Related
is related to LU-13296 statfs isn't work properly with MDT s... Resolved
Rank (Obsolete): 9223372036854775807

 Description   

If multiple threads on a client are executing statfs() calls concurrently, and the obd_statfs() cache has expired, then each thread will send an OST_STATFS RPC to each OST. With certain statfs-heavy workloads on many-core client nodes, this can result in thousands of needless RPCs being sent from each client every few seconds.

Since all of the callers funnel through obd_statfs(), and there is no benefit to having multiple OST_STATFS or MDS_STATFS replies from the same target (they return the same data, and all threads are blocked on the reply) it makes sense to just allow one thread to execute the statfs and other threads to (interruptibly) wait for it to complete.



 Comments   
Comment by Gerrit Updater [ 25/Jun/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35311
Subject: LU-12368 ptlrpc: make DEBUG_REQ messages consistent
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 5548152268223786e8acc9df6d470672d8b6e403

Comment by Gerrit Updater [ 29/Jun/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35380
Subject: LU-12368 obdclass: don't send multiple statfs RPCs
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: acb1b155fd18a7bc835c9d31ffb7fa06e2c009fa

Comment by Gerrit Updater [ 29/Jun/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35383
Subject: LU-12368 obdclass: allow 'lfs df' to specify cache age
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 2448d63aed511658f4b1b6e61789271c77fb0390

Comment by Gerrit Updater [ 12/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35380/
Subject: LU-12368 obdclass: don't send multiple statfs RPCs
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 1c41a6ac390bf74a135861efcd576a3b433d3c49

Comment by Gerrit Updater [ 12/Jul/19 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35485
Subject: LU-12368 obdclass: don't send multiple statfs RPCs
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: f0819a6e0b676fb80b9a612ce5c48786242da864

Comment by Bob Hawkins [ 25/Jul/19 ]

LU-12368 testing is underway right now at the customer site, running ExaScaler 4.2 and DDN Lustre 2.10.7_ddn5. We noted the default llite.*.statfs_max_age was set to 1. (or 1 second)

When testing with the default age of 1, we observed a regular occurrence (~1 in 10) of slow app timesteps in which each timestep made numerous statfs calls.

Increasing the statfs_max_age to 30 seconds, we did see a marked, positive change in behavior in which we only saw 1 in 900 slow app timesteps.

Comment by Andreas Dilger [ 25/Jul/19 ]

Bob, the default statfs cache age has been 1s since the beginning of Lustre. Increasing it to 30s is pretty reasonable, as (IMHO) applications shouldn't really put too much stock in the amount of free space in the filesystem anyway, given that there may be a large number of jobs running concurrently and allocating and freeing space, space is reserved by clients, quota limits may intervene before the filesystem runs out of space, etc. Either there is lots of free space, and the application shouldn't care, or there isn't lots of space and the application may run out of space even if it checks in advance (e.g. some other process consumes the remaining space).

Is there any clear indication why the application is doing so many statfs() calls?

Comment by Gerrit Updater [ 26/Jul/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35485/
Subject: LU-12368 obdclass: don't send multiple statfs RPCs
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 1839debbba0e70dd5cc9f11c8bc83bcead0114d4

Comment by Patrick Farrell (Inactive) [ 30/Jul/19 ]

Andreas,

Any reason this ticket is still open?  I think we've got the patches all merged.

Comment by Andreas Dilger [ 31/Jul/19 ]

Patrick, there is still patch https://review.whamcloud.com/35383 "obdclass: allow 'lfs df' to specify cache age" still open. I added that patch while working on "obdclass: don't send multiple statfs RPCs" but I'm not sure if it is actually useful to land or not.

Comment by Gerrit Updater [ 09/Sep/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35311/
Subject: LU-12368 ptlrpc: make DEBUG_REQ messages consistent
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: c0fa0ba4a8efcd774f1fe27986a0217c76dedf6d

Comment by Gerrit Updater [ 26/Nov/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36874
Subject: LU-12368 ptlrpc: make DEBUG_REQ messages consistent
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: bcf0777c27333aade617e92b9cfe03877e568e4e

Comment by Alexey Lyashkov [ 26/Feb/20 ]

In fact - fix is incorrect. Lets look to LU-13296.

Generated at Sat Feb 10 02:51:58 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.