Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12368

concurrent statfs() calls on the client should be blocked

Details

    • Improvement
    • Resolution: Fixed
    • Minor
    • Lustre 2.13.0
    • Lustre 2.10.8, Lustre 2.12.3
    • 9223372036854775807

    Description

      If multiple threads on a client are executing statfs() calls concurrently, and the obd_statfs() cache has expired, then each thread will send an OST_STATFS RPC to each OST. With certain statfs-heavy workloads on many-core client nodes, this can result in thousands of needless RPCs being sent from each client every few seconds.

      Since all of the callers funnel through obd_statfs(), and there is no benefit to having multiple OST_STATFS or MDS_STATFS replies from the same target (they return the same data, and all threads are blocked on the reply) it makes sense to just allow one thread to execute the statfs and other threads to (interruptibly) wait for it to complete.

      Attachments

        Issue Links

          Activity

            [LU-12368] concurrent statfs() calls on the client should be blocked

            In fact - fix is incorrect. Lets look to LU-13296.

            shadow Alexey Lyashkov added a comment - In fact - fix is incorrect. Lets look to LU-13296 .

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36874
            Subject: LU-12368 ptlrpc: make DEBUG_REQ messages consistent
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: bcf0777c27333aade617e92b9cfe03877e568e4e

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36874 Subject: LU-12368 ptlrpc: make DEBUG_REQ messages consistent Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: bcf0777c27333aade617e92b9cfe03877e568e4e

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35311/
            Subject: LU-12368 ptlrpc: make DEBUG_REQ messages consistent
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: c0fa0ba4a8efcd774f1fe27986a0217c76dedf6d

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35311/ Subject: LU-12368 ptlrpc: make DEBUG_REQ messages consistent Project: fs/lustre-release Branch: master Current Patch Set: Commit: c0fa0ba4a8efcd774f1fe27986a0217c76dedf6d

            Patrick, there is still patch https://review.whamcloud.com/35383 "obdclass: allow 'lfs df' to specify cache age" still open. I added that patch while working on "obdclass: don't send multiple statfs RPCs" but I'm not sure if it is actually useful to land or not.

            adilger Andreas Dilger added a comment - Patrick, there is still patch https://review.whamcloud.com/35383 " obdclass: allow 'lfs df' to specify cache age " still open. I added that patch while working on " obdclass: don't send multiple statfs RPCs " but I'm not sure if it is actually useful to land or not.

            Andreas,

            Any reason this ticket is still open?  I think we've got the patches all merged.

            pfarrell Patrick Farrell (Inactive) added a comment - Andreas, Any reason this ticket is still open?  I think we've got the patches all merged.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35485/
            Subject: LU-12368 obdclass: don't send multiple statfs RPCs
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 1839debbba0e70dd5cc9f11c8bc83bcead0114d4

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35485/ Subject: LU-12368 obdclass: don't send multiple statfs RPCs Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 1839debbba0e70dd5cc9f11c8bc83bcead0114d4

            Bob, the default statfs cache age has been 1s since the beginning of Lustre. Increasing it to 30s is pretty reasonable, as (IMHO) applications shouldn't really put too much stock in the amount of free space in the filesystem anyway, given that there may be a large number of jobs running concurrently and allocating and freeing space, space is reserved by clients, quota limits may intervene before the filesystem runs out of space, etc. Either there is lots of free space, and the application shouldn't care, or there isn't lots of space and the application may run out of space even if it checks in advance (e.g. some other process consumes the remaining space).

            Is there any clear indication why the application is doing so many statfs() calls?

            adilger Andreas Dilger added a comment - Bob, the default statfs cache age has been 1s since the beginning of Lustre. Increasing it to 30s is pretty reasonable, as (IMHO) applications shouldn't really put too much stock in the amount of free space in the filesystem anyway, given that there may be a large number of jobs running concurrently and allocating and freeing space, space is reserved by clients, quota limits may intervene before the filesystem runs out of space, etc. Either there is lots of free space, and the application shouldn't care, or there isn't lots of space and the application may run out of space even if it checks in advance (e.g. some other process consumes the remaining space). Is there any clear indication why the application is doing so many statfs() calls?
            bobhawkins Bob Hawkins added a comment -

            LU-12368 testing is underway right now at the customer site, running ExaScaler 4.2 and DDN Lustre 2.10.7_ddn5. We noted the default llite.*.statfs_max_age was set to 1. (or 1 second)

            When testing with the default age of 1, we observed a regular occurrence (~1 in 10) of slow app timesteps in which each timestep made numerous statfs calls.

            Increasing the statfs_max_age to 30 seconds, we did see a marked, positive change in behavior in which we only saw 1 in 900 slow app timesteps.

            bobhawkins Bob Hawkins added a comment - LU-12368 testing is underway right now at the customer site, running ExaScaler 4.2 and DDN Lustre 2.10.7_ddn5. We noted the default llite.*.statfs_max_age was set to 1. (or 1 second) When testing with the default age of 1, we observed a regular occurrence (~1 in 10) of slow app timesteps in which each timestep made numerous statfs calls. Increasing the statfs_max_age to 30 seconds, we did see a marked, positive change in behavior in which we only saw 1 in 900 slow app timesteps.

            Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35485
            Subject: LU-12368 obdclass: don't send multiple statfs RPCs
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: f0819a6e0b676fb80b9a612ce5c48786242da864

            gerrit Gerrit Updater added a comment - Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35485 Subject: LU-12368 obdclass: don't send multiple statfs RPCs Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: f0819a6e0b676fb80b9a612ce5c48786242da864

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35380/
            Subject: LU-12368 obdclass: don't send multiple statfs RPCs
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 1c41a6ac390bf74a135861efcd576a3b433d3c49

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35380/ Subject: LU-12368 obdclass: don't send multiple statfs RPCs Project: fs/lustre-release Branch: master Current Patch Set: Commit: 1c41a6ac390bf74a135861efcd576a3b433d3c49

            People

              adilger Andreas Dilger
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: