Details

    Description

      When running a single shared file IOR on a compute node with a large number of cores it's possible to trigger soft locks.  Applying LU-17630 helps but doesn't entirely resolve the issue.  The stack traces logged by the soft lockup watchdog indicate the cause is heavy contention in delete_from_page_cache() on the page cache spin lock.

      RIP: 0010:delete_from_page_cache+0x52/0x70
      [ 9375.915829]  generic_error_remove_page+0x36/0x60
      [ 9375.915837]  cl_page_discard+0x47/0x80 [obdclass]
      [ 9375.915883]  discard_pagevec+0x7d/0x150 [osc]
      [ 9375.915900]  osc_lru_shrink+0x87f/0x8b0 [osc] 
      [ 9375.915913]  lru_queue_work+0xfd/0x230 [osc]
      [ 9375.915925]  work_interpreter+0x32/0x110 [ptlrpc]
      [ 9375.915992]  ptlrpc_check_set+0x5cf/0x1fc0 [ptlrpc]
      [ 9375.916052]  ptlrpcd+0x6df/0xa70 [ptlrpc]
      [ 9375.916176]  kthread+0x14c/0x170

      It looks like this is possible because:
      1. Multiple callers pass 'force=1' to osc_lru_shrink() allowing multiple threads to run concurrently.   lru_queue_work() does use 'force=0' which is good.
      2. There is no per-filesystem or per-node limit on how many threads can run osc_lru_shrink().  It's only limited per client_obd using the 'cl_lru_shrinkers' atomic.

      I'll push a patch for review which adds a per-filesystem limit.  Interestingly, it looks portions of this may have been implemented long ago but not completed.  The proposed patch still needs to be tested on a system with a large number of OSCs but I wanted to post it for initial feedback.

      Attachments

        Issue Links

          Activity

            [LU-18053] Add active osc_lru_shrink() limit

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56390/
            Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink()
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: 1329003c03d7a40262cad74c1446313eb94a0f9a

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/56390/ Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink() Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: 1329003c03d7a40262cad74c1446313eb94a0f9a
            pjones Peter Jones added a comment -

            Included in 2.16

            pjones Peter Jones added a comment - Included in 2.16

            "Eric Carbonneau <carbonneau1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56390
            Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink()
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: a132ecac9e077ffb42c1e04fd41ba5aa4ecd34f2

            gerrit Gerrit Updater added a comment - "Eric Carbonneau <carbonneau1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/56390 Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink() Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: a132ecac9e077ffb42c1e04fd41ba5aa4ecd34f2

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55888/
            Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 66549c1540b2931ae1d1d1ebb50afbf15683baf4

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/55888/ Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink() Project: fs/lustre-release Branch: master Current Patch Set: Commit: 66549c1540b2931ae1d1d1ebb50afbf15683baf4

            "Brian Behlendorf <behlendorf1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55888
            Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink()
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: a9a9e505caf24d337eb77396bc993445c56e94f1

            gerrit Gerrit Updater added a comment - "Brian Behlendorf <behlendorf1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55888 Subject: LU-18053 osc: add another cond_resched() to osc_lru_shrink() Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: a9a9e505caf24d337eb77396bc993445c56e94f1

            Our last round of testing included a backport of the LU-17630 patch.  This helped a bit, but wasn't sufficient.  We still were not able to complete a shared file IOR and soft lockups were reported on the clients.  https://review.whamcloud.com/c/fs/lustre-release/+/55830 was aimed at further reducing this contention.

            behlendorf Brian Behlendorf added a comment - Our last round of testing included a backport of the LU-17630 patch.  This helped a bit, but wasn't sufficient.  We still were not able to complete a shared file IOR and soft lockups were reported on the clients.   https://review.whamcloud.com/c/fs/lustre-release/+/55830 was aimed at further reducing this contention.

            Brian,
            can you try https://review.whamcloud.com/c/fs/lustre-release/+/54346 from LU-17630 and check whether soft lockups gone?

            Thanks,
            Zam

            zam Alexander Zarochentsev added a comment - Brian, can you try https://review.whamcloud.com/c/fs/lustre-release/+/54346 from LU-17630 and check whether soft lockups gone? Thanks, Zam
            gerrit Gerrit Updater added a comment - - edited

            "Brian Behlendorf <behlendorf1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55830
            Subject: LU-18053 osc: add active osc_lru_shrink() limit
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: e3064da710de98da43f8bb607dab2ab140e087e3

            gerrit Gerrit Updater added a comment - - edited "Brian Behlendorf <behlendorf1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/55830 Subject: LU-18053 osc: add active osc_lru_shrink() limit Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: e3064da710de98da43f8bb607dab2ab140e087e3
            behlendorf Brian Behlendorf added a comment - - edited

            It seems I'm having a bit of trouble with gerrit.  For the moment, I've pushed the proposed change to https://github.com/LLNL/lustre/commits/add-fs-lru-shrink-limit/ for reference.  I'll look in to pushing it correctly on Monday.

            behlendorf Brian Behlendorf added a comment - - edited It seems I'm having a bit of trouble with gerrit.  For the moment, I've pushed the proposed change to https://github.com/LLNL/lustre/commits/add-fs-lru-shrink-limit/ for reference.  I'll look in to pushing it correctly on Monday.

            People

              behlendorf Brian Behlendorf
              behlendorf Brian Behlendorf
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: