Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14125

client starved for grant but OST has plenty of free space

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Lustre 2.14.0
    • None
    • zfs-0.7
      lustre-2.12.5_5.llnl-1.ch6.x86_64
    • 3
    • 9223372036854775807

    Description

      Something is causing cur_grant_bytes for some OSCs to go below 1MB. Which OSCs and which nodes are apparently random. The OSTs themselves have many TB of free space. Sequential writes (e.g. dd if=/dev/urandom of=file_on_ost_3 bs=1M count=40 where the file has just one stripe) produces osc_enter_cache() debug entries reporting it is falling back to sync I/O. We also see osc_update_grant() report it got 0 extra grant.

      We have not been able to identify a workload or other trigger that pushes cur_grant_bytes low in the first place. We also have not been able to find a workaround that results in the OST returning extra grant.

      We set grant_shrink=0 on all clients (using set_param -P on the mgs) and then stopped and started all the OSTs on the file system. This did not change the symptoms in any obvious way.

      Client snippet with debug="+cache":

      00000008:00000020:55.0:1604686702.850333:0:15766:0:(osc_cache.c:1613:osc_enter_cache()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }need:1703936
      
      00000008:00000020:55.0:1604686702.850335:0:15766:0:(osc_cache.c:1543:osc_enter_cache_try()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }need:1703936
      
      00000008:00000020:55.0:1604686702.850337:0:15766:0:(osc_cache.c:1658:osc_enter_cache()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }no grant space, fall back to sync i/o
      
      00000008:00400020:55.0:1604686702.850352:0:15766:0:(osc_io.c:127:osc_io_submit()) 256 1
      00000008:00000020:55.0:1604686702.850385:0:15766:0:(osc_cache.c:1743:osc_update_pending()) obj ffff8b2c6bc58640 ready 0|-|- wr 256|+|- rd 0|- update pending cmd 2 delta 256.
      00000008:00000020:55.0:1604686702.850387:0:15766:0:(osc_cache.c:2297:osc_io_unplug0()) Queue writeback work for client ffff8b1efb0d25e0.
      00000008:00000020:19.0:1604686702.850400:0:20698:0:(osc_request.c:3171:brw_queue_work()) Run writeback work for client obd ffff8b1efb0d25e0.
      00000008:00000020:19.0:1604686702.850402:0:20698:0:(osc_cache.c:2222:osc_check_rpcs()) obj ffff8b2c6bc58640 ready 0|-|- wr 256|+|- rd 0|- 0 in flight
      00000008:00000020:19.0:1604686702.850404:0:20698:0:(osc_cache.c:1697:osc_makes_rpc()) high prio request forcing RPC
      00000008:00000020:19.0:1604686702.850405:0:20698:0:(osc_cache.c:1888:try_to_add_extent_for_io()) extent ffff8b2c6b9c7c30@{[9984 -> 10239/10239], [1|0|+|lockdone|wShu|ffff8b2c6bc58640], [0|256|+|-|          (null)|256|          (null)]} trying to add this extent
      00000008:00000020:19.0:1604686702.850408:0:20698:0:(osc_cache.c:1743:osc_update_pending()) obj ffff8b2c6bc58640 ready 0|-|- wr 0|-|- rd 0|- update pending cmd 2 delta -256.
      00000008:00000020:19.0:1604686702.850441:0:20698:0:(osc_request.c:705:osc_announce_cached()) dirty: 0 undirty: 1879048191 dropped 0 grant: 997461
      00000008:00000020:19.0:1604686702.850443:0:20698:0:(osc_request.c:714:osc_update_next_shrink()) next time 6200398 to shrink grant
      00000008:00000020:60.0:1604686703.244890:0:20699:0:(osc_request.c:727:osc_update_grant()) got 0 extra grant
      

      Server snippet:

      00000020:00000020:1.0:1604686702.851666:0:15401:0:(tgt_grant.c:413:tgt_grant_statfs()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 free: 263316194721792 avail: 263316186333184
      00000020:00000020:1.0:1604686702.851668:0:15401:0:(tgt_grant.c:477:tgt_grant_space_left()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 avail 263316186333184 left 262158115930112 unstable 3407872 tot_grant 1158069646026 pending 3407872
      00000020:00000020:1.0:1604686702.851670:0:15401:0:(tgt_grant.c:519:tgt_grant_incoming()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 reports grant 997461 dropped 0, local 1882456063
      00000020:00000020:1.0:1604686702.851672:0:15401:0:(tgt_grant.c:848:tgt_grant_check()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 granted: 0 ungranted: 1703936 grant: 1882456063 dirty: 0
      

      Each pool contains only one OST, and the storage is used for nothing else. All the OSTs have about the same amount of free space. This is the pool containing OST0003.

      [root@brass8:toss-4917-grant]# zpool list
      NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
      brass8   580T   243T   337T         -    25%    41%  1.00x  ONLINE  -
      

      Attachments

        Issue Links

          Activity

            [LU-14125] client starved for grant but OST has plenty of free space
            ofaaland Olaf Faaland added a comment - - edited

            Disabling grant_shrink has been a good enough workaround for us, removing topllnl. We haven't rolled out Lustre 2.15 enough to know whether or not the issue is fixed in that version.

            ofaaland Olaf Faaland added a comment - - edited Disabling grant_shrink has been a good enough workaround for us, removing topllnl. We haven't rolled out Lustre 2.15 enough to know whether or not the issue is fixed in that version.

            We've seen the issue at LLNL again:

            • with 40615 "prevent overflow of o_dropped"
            • without 40564 "grant shrink shouldn't account skipped OSC"
            • without 42129 "prevent overflowing of tgd->tgd_tot_granted"
            • with grant_shrink enabled

            The same clients mounted another file system which had grant_shrink disabled and those OSCs did not encounter the issue.

            Our clients will get 40564 "grant shrink shouldn't account skipped OSC" in the next few weeks, but it typically takes weeks for the issue to become easily detectable.

             

            ofaaland Olaf Faaland added a comment - We've seen the issue at LLNL again: with 40615 "prevent overflow of o_dropped" without 40564 "grant shrink shouldn't account skipped OSC" without 42129 "prevent overflowing of tgd->tgd_tot_granted" with grant_shrink enabled The same clients mounted another file system which had grant_shrink disabled and those OSCs did not encounter the issue. Our clients will get 40564 "grant shrink shouldn't account skipped OSC" in the next few weeks, but it typically takes weeks for the issue to become easily detectable.  

            There is a new patch https://review.whamcloud.com/42129 "LU-14543 target: prevent overflowing of tgd->tgd_tot_granted" that may be of interest here. I'm not 100% sure it is related, since it involves an underflow of tot_grant and/or tot_dirty AFAICS, so if that happened it would likely affect all clients, but worthwhile to mention it here.

            adilger Andreas Dilger added a comment - There is a new patch https://review.whamcloud.com/42129 " LU-14543 target: prevent overflowing of tgd->tgd_tot_granted " that may be of interest here. I'm not 100% sure it is related, since it involves an underflow of tot_grant and/or tot_dirty AFAICS, so if that happened it would likely affect all clients, but worthwhile to mention it here.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40615/
            Subject: LU-14125 osc: prevent overflow of o_dropped
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 1da8349175a136df0aadb28ae0e0f64ac0385961

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40615/ Subject: LU-14125 osc: prevent overflow of o_dropped Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 1da8349175a136df0aadb28ae0e0f64ac0385961

            Olaf, the backported patch https://review.whamcloud.com/40564 "LU-11409 osc: grant shrink shouldn't account skipped OSC" just landed to b2_12 but is not in any tag yet. This is a very simple client-only patch, so could be added to your clients relatively easily to see if it solves the problem.

            adilger Andreas Dilger added a comment - Olaf, the backported patch https://review.whamcloud.com/40564 " LU-11409 osc: grant shrink shouldn't account skipped OSC " just landed to b2_12 but is not in any tag yet. This is a very simple client-only patch, so could be added to your clients relatively easily to see if it solves the problem.
            ofaaland Olaf Faaland added a comment -

            Cory, thanks for the update. You may be conflating two issues here. LU-11409 is applicable to clients connecting to more than 100 OSTs. It would be useful to know from the reporters on this ticket if that applies to the systems where this problem is being seen.

            At LLNL the clients where we're seeing this problem do connect to more than 100 OSTs (across 3 file systems).

            ofaaland Olaf Faaland added a comment - Cory, thanks for the update. You may be conflating two issues here.  LU-11409  is applicable to clients connecting to more than 100 OSTs. It would be useful to know from the reporters on this ticket if that applies to the systems where this problem is being seen. At LLNL the clients where we're seeing this problem do connect to more than 100 OSTs (across 3 file systems).

            Cory, thanks for the update. You may be conflating two issues here. LU-11409 is applicable to clients connecting to more than 100 OSTs. It would be useful to know from the reporters on this ticket if that applies to the systems where this problem is being seen.

            Separately, tgt_grant_sanity_check() is a server-side verification of the grants given to the clients vs. the total granted counters for the target at disconnect time, which is disabled on systems with more than 100 connected clients because it adds significant overhead at that point (O(n^2) with the number of connected clients). However, that check is not being triggered in this case (AFAIK) because it isn't a problem with the per-export vs. global counters on the OST, but a disconnect between what the client is counting and what the server is counting.

            adilger Andreas Dilger added a comment - Cory, thanks for the update. You may be conflating two issues here. LU-11409 is applicable to clients connecting to more than 100 OSTs. It would be useful to know from the reporters on this ticket if that applies to the systems where this problem is being seen. Separately, tgt_grant_sanity_check() is a server-side verification of the grants given to the clients vs. the total granted counters for the target at disconnect time, which is disabled on systems with more than 100 connected clients because it adds significant overhead at that point (O(n^2) with the number of connected clients). However, that check is not being triggered in this case (AFAIK) because it isn't a problem with the per-export vs. global counters on the OST, but a disconnect between what the client is counting and what the server is counting.
            spitzcor Cory Spitz added a comment - - edited

            FWIW, we have a couple of reports. One customer running 2.12.6 saw trouble with performance. One set of nodes was set to grant_shrink=0, which mitigated the problem. Another customer running ~2.12.4 saw their grant problems disappear with this patch and LU-11409. Related to LU-11409: vsaveliev has spotted that tgt_grant_sanity_check() is a no-op and the grant check gets turned off with more than 100 exports.

            spitzcor Cory Spitz added a comment - - edited FWIW, we have a couple of reports. One customer running 2.12.6 saw trouble with performance. One set of nodes was set to grant_shrink=0, which mitigated the problem. Another customer running ~2.12.4 saw their grant problems disappear with this patch and LU-11409 . Related to LU-11409 : vsaveliev has spotted that tgt_grant_sanity_check() is a no-op and the grant check gets turned off with more than 100 exports.

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39324/
            Subject: LU-14125 obdclass: add grant fields to export procfile
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 7a354e82d99d57103ed52cb7872cd64090b43383

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39324/ Subject: LU-14125 obdclass: add grant fields to export procfile Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 7a354e82d99d57103ed52cb7872cd64090b43383

            Another site reported that just the o_dropped does not seem to have (fully?) resolved the problem, since they are still seeing clients with low grant. I'm still trying to find out whether the site is running with grant_shrink=1 or not.

            One theory that I had was that something with the grant shrink is causing incremental loss of grant because it is using non-PAGE_SIZE grant amounts? I notice that the clients and servers have values that are not even PAGE_SIZE multiples, so maybe there is some kind of rounding problems between the client and server?

            adilger Andreas Dilger added a comment - Another site reported that just the o_dropped does not seem to have (fully?) resolved the problem, since they are still seeing clients with low grant. I'm still trying to find out whether the site is running with grant_shrink=1 or not. One theory that I had was that something with the grant shrink is causing incremental loss of grant because it is using non-PAGE_SIZE grant amounts? I notice that the clients and servers have values that are not even PAGE_SIZE multiples, so maybe there is some kind of rounding problems between the client and server?

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              23 Start watching this issue

              Dates

                Created:
                Updated: