[LU-14125] client starved for grant but OST has plenty of free space Created: 06/Nov/20  Updated: 29/Jul/22

Status: Reopened
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Mikhail Pershin
Resolution: Unresolved Votes: 0
Labels: llnl, topllnl
Environment:

zfs-0.7
lustre-2.12.5_5.llnl-1.ch6.x86_64


Issue Links:
Duplicate
is duplicated by LU-14124 super slow i/o on client maybe relate... Resolved
Related
is related to LU-11409 only first 100 OSC can shrink grants Resolved
is related to LU-13766 tgt_grant_check() lsrza-OST000a: cli ... Resolved
is related to LU-14901 consume grants if async write fallbac... Open
is related to LU-9704 ofd_grant_check() claims GRANT, real ... Resolved
is related to LU-14543 tgt_grant_discard(): avoid tgd->tgd_... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Something is causing cur_grant_bytes for some OSCs to go below 1MB. Which OSCs and which nodes are apparently random. The OSTs themselves have many TB of free space. Sequential writes (e.g. dd if=/dev/urandom of=file_on_ost_3 bs=1M count=40 where the file has just one stripe) produces osc_enter_cache() debug entries reporting it is falling back to sync I/O. We also see osc_update_grant() report it got 0 extra grant.

We have not been able to identify a workload or other trigger that pushes cur_grant_bytes low in the first place. We also have not been able to find a workaround that results in the OST returning extra grant.

We set grant_shrink=0 on all clients (using set_param -P on the mgs) and then stopped and started all the OSTs on the file system. This did not change the symptoms in any obvious way.

Client snippet with debug="+cache":

00000008:00000020:55.0:1604686702.850333:0:15766:0:(osc_cache.c:1613:osc_enter_cache()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }need:1703936

00000008:00000020:55.0:1604686702.850335:0:15766:0:(osc_cache.c:1543:osc_enter_cache_try()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }need:1703936

00000008:00000020:55.0:1604686702.850337:0:15766:0:(osc_cache.c:1658:osc_enter_cache()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }no grant space, fall back to sync i/o

00000008:00400020:55.0:1604686702.850352:0:15766:0:(osc_io.c:127:osc_io_submit()) 256 1
00000008:00000020:55.0:1604686702.850385:0:15766:0:(osc_cache.c:1743:osc_update_pending()) obj ffff8b2c6bc58640 ready 0|-|- wr 256|+|- rd 0|- update pending cmd 2 delta 256.
00000008:00000020:55.0:1604686702.850387:0:15766:0:(osc_cache.c:2297:osc_io_unplug0()) Queue writeback work for client ffff8b1efb0d25e0.
00000008:00000020:19.0:1604686702.850400:0:20698:0:(osc_request.c:3171:brw_queue_work()) Run writeback work for client obd ffff8b1efb0d25e0.
00000008:00000020:19.0:1604686702.850402:0:20698:0:(osc_cache.c:2222:osc_check_rpcs()) obj ffff8b2c6bc58640 ready 0|-|- wr 256|+|- rd 0|- 0 in flight
00000008:00000020:19.0:1604686702.850404:0:20698:0:(osc_cache.c:1697:osc_makes_rpc()) high prio request forcing RPC
00000008:00000020:19.0:1604686702.850405:0:20698:0:(osc_cache.c:1888:try_to_add_extent_for_io()) extent ffff8b2c6b9c7c30@{[9984 -> 10239/10239], [1|0|+|lockdone|wShu|ffff8b2c6bc58640], [0|256|+|-|          (null)|256|          (null)]} trying to add this extent
00000008:00000020:19.0:1604686702.850408:0:20698:0:(osc_cache.c:1743:osc_update_pending()) obj ffff8b2c6bc58640 ready 0|-|- wr 0|-|- rd 0|- update pending cmd 2 delta -256.
00000008:00000020:19.0:1604686702.850441:0:20698:0:(osc_request.c:705:osc_announce_cached()) dirty: 0 undirty: 1879048191 dropped 0 grant: 997461
00000008:00000020:19.0:1604686702.850443:0:20698:0:(osc_request.c:714:osc_update_next_shrink()) next time 6200398 to shrink grant
00000008:00000020:60.0:1604686703.244890:0:20699:0:(osc_request.c:727:osc_update_grant()) got 0 extra grant

Server snippet:

00000020:00000020:1.0:1604686702.851666:0:15401:0:(tgt_grant.c:413:tgt_grant_statfs()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 free: 263316194721792 avail: 263316186333184
00000020:00000020:1.0:1604686702.851668:0:15401:0:(tgt_grant.c:477:tgt_grant_space_left()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 avail 263316186333184 left 262158115930112 unstable 3407872 tot_grant 1158069646026 pending 3407872
00000020:00000020:1.0:1604686702.851670:0:15401:0:(tgt_grant.c:519:tgt_grant_incoming()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 reports grant 997461 dropped 0, local 1882456063
00000020:00000020:1.0:1604686702.851672:0:15401:0:(tgt_grant.c:848:tgt_grant_check()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 granted: 0 ungranted: 1703936 grant: 1882456063 dirty: 0

Each pool contains only one OST, and the storage is used for nothing else. All the OSTs have about the same amount of free space. This is the pool containing OST0003.

[root@brass8:toss-4917-grant]# zpool list
NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
brass8   580T   243T   337T         -    25%    41%  1.00x  ONLINE  -


 Comments   
Comment by Olaf Faaland [ 06/Nov/20 ]

We can get a debug patch on either the server or the client within about 1 week (it would need to be pushed to gerrit and pass tests first). For instance, there are no CDEBUGs in osc_should_shrink_grant().

We haven't identified a workaround other than stopping and starting servers, which is obviously not good enough. Workaround suggestions would be very useful, as this is significantly impacting our users.

Comment by Peter Jones [ 06/Nov/20 ]

Mike

Could you please advise

Thanks

Peter

Comment by Olaf Faaland [ 06/Nov/20 ]

Thanks Peter

Comment by Olaf Faaland [ 06/Nov/20 ]

For my tracking, my local ticket is TOSS-4917

Comment by Andreas Dilger [ 06/Nov/20 ]

There was a grant-related overflow fixed in patch https://review.whamcloud.com/39380 "LU-13763 osc: don't allow negative grants" that may be causing issues here. If the client thinks it has little or no grant, but the server thinks it has lots of grant, the client will keep asking, but the server will not give it more grant until it uses its current amount.

The patch was landed to b2_12 but not until after 2.12.5 was released.

Mike, for debugging purposes, it would be useful to add per-export grant parameters on the server (obdfilter.*.exports.*.grant*" to match the clients). That would allow comparing the values on the client and server and see what kind of difference there is.

Comment by Olaf Faaland [ 06/Nov/20 ]

Thanks, Andreas.

Mike, for debugging purposes, it would be useful to add per-export grant parameters on the server (obdfilter..exports..grant*" to match the clients). That would allow comparing the values on the client and server and see what kind of difference there is.

Like this?
https://review.whamcloud.com/#/c/39324/

Comment by Olaf Faaland [ 06/Nov/20 ]

There was a grant-related overflow fixed in patch https://review.whamcloud.com/39380 "LU-13763 osc: don't allow negative grants" that may be causing issues here. If the client thinks it has little or no grant, but the server thinks it has lots of grant, the client will keep asking, but the server will not give it more grant until it uses its current amount.

Ah. Next week, the systems are scheduled to get an update with that patch.

Specifically, this:
https://github.com/LLNL/lustre/compare/2.12.5_5.llnl...2.12.5_10.llnl

Comment by Andreas Dilger [ 06/Nov/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40563
Subject: LU-14125 obdclass: add grant fields to export procfile
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: efeaa656a5533fb6cff20c06447c7bf686db2bce

Comment by James A Simmons [ 06/Nov/20 ]

ORNL is also running into this problem and we have the grant overflow fix patch.

Comment by Olaf Faaland [ 06/Nov/20 ]

ORNL is also running into this problem and we have the grant overflow fix patch.

Thanks James, that's good to know.

Comment by Olaf Faaland [ 06/Nov/20 ]

I'd forgotten, the lustre version running on these two machines already has the per-export grant parameters patch.

From the server:

[root@brass8:192.168.128.133@o2ib32]# cat export 
726797b8-322a-1989-0cb5-3645daf9a6ce:
    name: lsrza-OST0003
    client: 192.168.128.133@o2ib32
    connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, grant_shrink, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, grant_param, lvb_type, short_io, lfsck, bulk_mbits, second_flags, lockaheadv2 ]
    connect_data:
       flags: 0xa0425af2e3440478
       instance: 42
       target_version: 2.12.5.0
       initial_grant: 2097152
       max_brw_size: 1048576
       grant_block_size: 1048576
       grant_inode_size: 4096
       grant_max_extent_size: 1073741824
       grant_extent_tax: 655360
       cksum_types: 0xf7
       max_object_bytes: 9223372036854775807
    export_flags: [  ]
    grant:
       granted: 1882456063
       dirty: 0
       pending: 0

From the client:

[root@rztopaz133:toss-4917-grant]# lctl list_nids
192.168.128.133@o2ib32
[root@rztopaz133:toss-4917-grant]# lctl get_param osc.lsrza-OST0003*.cur_grant_bytes
osc.lsrza-OST0003-osc-ffff8b2f37fc1000.cur_grant_bytes=997461
Comment by Andreas Dilger [ 07/Nov/20 ]

So 1882456063 = 0x7033ffff, which means that it is very close to overflowing, and likely had overflowed at some point in the past?

Would it be possible to check all of the exports on that OSS (and/or other OSSes) and see if the high granted: line on the OSS, and a low cur_grant_bytes on the client correlates to those clients that are having slow writes? They should in theory be identical between client and OSS when the client does not have any dirty cached pages. It may not be 100% correlation, because some clients might not have actually overflowed yet. 

Comment by Olaf Faaland [ 09/Nov/20 ]

Hi Andreas, the high granted: line on the OST does not correlate well with low cur_grant_bytes on the clients.

One cluster, rztopaz, has 6 clients with cur_grant_bytes < 1M for OST0003. The OST reports that 238 exports to nodes in that cluster have granted: value of 1882456063.

It's interesting there are so many nodes with granted = 1882456063. 
I spot-checked 5 clients with that value for granted, and cur_grant_bytes on those 5 clients are all over the place.

/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.102@o2ib32/export:       granted: 1882456063
/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.104@o2ib32/export:       granted: 1882456063
/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.108@o2ib32/export:       granted: 1882456063
/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.109@o2ib32/export:       granted: 1882456063
/proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.110@o2ib32/export:       granted: 1882456063

vs

e102: osc.lsrza-OST0003-osc-ffff9bbe688de000.cur_grant_bytes=82790394
e104: osc.lsrza-OST0003-osc-ffff9b69c9dea000.cur_grant_bytes=1412694016
e108: osc.lsrza-OST0003-osc-ffffa1006893e000.cur_grant_bytes=271487350
e109: osc.lsrza-OST0003-osc-ffff8ac65aa26800.cur_grant_bytes=1072398336
e110: osc.lsrza-OST0003-osc-ffff8f737194f800.cur_grant_bytes=1875640319
Comment by Andreas Dilger [ 09/Nov/20 ]

Olaf, what values do you have on the client(s) for osc.*.max_pages_per_rpc, osc.*.max_rpcs_in_flight, and osc.*.max_dirty_mb? I'm trying to figure out if there is some kind of overflow in the calculation of the max grant. There are several patches in this area that could be related:

but this is just speculation so far.

The other question is whether there are any error messages in the client logs like "dirty NNNN > dirty_max MMMM", or related to grant?

Some things to try if this is relatively easily reproduced on a client after a fresh mount:

  • run "lctl set_param osc.*.max_dirty_mb=1024" to limit the amount of dirty data per OSC to see if this prevents the overflow. This should still be below (max_rpcs_in_flight * max_pages_per_rpc), typically (8 * 16MB = 128MB) so would not limit performance.
Comment by Olaf Faaland [ 09/Nov/20 ]

Andreas, we have (I just show OST0003 but the same for all OSTs and all clients)

osc.lsrza-OST0003-osc-ffff8b2f37fc1000.max_pages_per_rpc=256
osc.lsrza-OST0003-osc-ffff8b2f37fc1000.max_rpcs_in_flight=8
osc.lsrza-OST0003-osc-ffff8b2f37fc1000.max_dirty_mb=2000 

No console log warnings or errors on the clients at all, except for reconnections when we bounced OSTs or MDTs.

The current patch stack does not include the negative grants / direct IO patches. It is:

* 27e1e03 (tag: 2.12.5_5.llnl) log lfs setstripe paths to syslog
* f7a3e6e (tag: 2.12.5_4.llnl) LU-13599 mdt: fix mti_big_lmm buffer usage
* 8179deb LU-13599 mdt: fix logic of skipping local locks in reply_state
* 52a48df (tag: 2.12.5_3.llnl) LU-13709 utils: 'lfs mkdir -i -1' doesn't work
* 81479ad LU-13657 kernel: kernel update RHEL8.2 [4.18.0-193.6.3.el8_2]
* 2744b2f LU-13421 kernel: kernel update RHEL8.1 [4.18.0-147.8.1.el8_1]
* 0aaf383 LU-10395 osd: stop OI at device shutdown
* fb8ae25 LU-13766 obdclass: add grant fields to export procfile
* e6ee866 (tag: 2.12.5_2.chaos) LU-13667 ptlrpc: fix endless loop issue
* 049ed85 LU-11623 llite: hash just created files if lock allows
* e1e865f (tag: 2.12.5_1.llnl, tag: 2.12.5_1.chaos) Don't install lustre init script on systemd systems
* 6eebb51 LLNL build customizations
* 407ac6c TOSS-4431 build: build ldiskfs only for x86_64
* 78d712a (tag: v2_12_5, tag: 2.12.5) New release 2.12.5
* 
Comment by Olaf Faaland [ 09/Nov/20 ]

We have several clusters where we see this occur in fairly short order (4-8 hours) after a reboot. But so far we do not have a procedure for reproducing the issue.

Comment by Cameron Harr [ 09/Nov/20 ]

I'm dropping max_dirty_mb to 1024 on Pascal; however, the CZ is relatively clean now after restarting all the OSTs last week (and remounting Lustre on a handful of uncooperative compute nodes), so I don't think we'd be hitting the issue anytime soon anyway.

I will also set it on rztopaz which hasn't had the cleanup.

Comment by Gerrit Updater [ 11/Nov/20 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/40615
Subject: LU-14125 osc: prevent overflow of o_dropped
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 968f580554f5211ea78bb825d8666cf70a17c288

Comment by Nathan Dauchy (Inactive) [ 12/Nov/20 ]

As another reference point, we seem to be hitting this issue at NOAA as well, with 2.12.3_ddn44 on the clients and lustre-2.12.3_ddn31 on the servers.  Unmounting and remounting a client seems to improve performance, at least for a little while.

Comment by Nathan Dauchy (Inactive) [ 12/Nov/20 ]

A colleague found a way to force reconnection between client and server when working a different issue, and I tried it in this case and was able to get cur_grant_bytes to renegotiate without the heavy-handed unmount.  Is this a dangerous thing to do as a workaround?

[root@Jet:k11 ~]# lctl set_param osc.*.grant_shrink=0

[root@Jet:k11 ~]# lctl get_param osc.*.cur_grant_bytes | sort -nk2 -t = | head -n 5
osc.lfs1-OST0002-osc-ffff8e7aed73c800.cur_grant_bytes=66638
osc.lfs1-OST0018-osc-ffff8e7aed73c800.cur_grant_bytes=651802
osc.lfs1-OST0029-osc-ffff8e7aed73c800.cur_grant_bytes=3013787
osc.lfs1-OST001a-osc-ffff8e7aed73c800.cur_grant_bytes=3146809
osc.lfs1-OST003b-osc-ffff8e7aed73c800.cur_grant_bytes=3363014

[root@Jet:k11 ~]# lctl device_list | egrep "lfs1-OST00(02|18)"
  4 UP osc lfs1-OST0018-osc-ffff8e7aed73c800 d39fcae6-1604-7b71-2bba-0ddeb32aa971 4
 25 UP osc lfs1-OST0002-osc-ffff8e7aed73c800 d39fcae6-1604-7b71-2bba-0ddeb32aa971 4

[root@Jet:k11 ~]# lctl --device 4 deactivate; sleep 1; lctl --device 4 activate
[root@Jet:k11 ~]# lctl --device 25 deactivate; sleep 1; lctl --device 25 activate

[root@Jet:k11 ~]# lctl get_param osc.*.cur_grant_bytes | egrep "lfs1-OST00(02|18)"
osc.lfs1-OST0002-osc-ffff8e7aed73c800.cur_grant_bytes=541450240
osc.lfs1-OST0018-osc-ffff8e7aed73c800.cur_grant_bytes=542035968

Comment by Andreas Dilger [ 12/Nov/20 ]

Nathan,
You can also likely use "--device Xxxx recover" to do the same thing in one step.

However, it should be noted that disconnecting and reconnecting the device like this could cause in-flight RPCs to that OST to be aborted, so this should only be done on quiescent clients after "lctl set_param ldlm.namespaces.*.lru_size=clear" to flush all dirty data from the client.

So it is mostly ok, but not safe to do randomly during the day.

Comment by SC Admin (Inactive) [ 13/Nov/20 ]

Hiya,

using the small dd test from our LU-14124 I've been tracking how the issue appears on our compute nodes over time, and setting slow nodes to drain. it's interesting that after they are drained of jobs, perhaps 1/2 or 2/3 of our compute nodes recover and no longer seem to have any grant or i/o issues. but some nodes still have 1 or more slow OSTs even when there are no user processes left on them.

before the slow nodes drained of jobs, I used lsof and lfs getstripe to find all files using the slow OSTs from those nodes. in all of these cases there were exe's ('txt' in lsof) or .so's ('mem') using the slow OSTs. sometimes there was a regular file as well, but in most cases the slow OST was only referenced by one exe (eg. 32 references to a 32-way OpenMP exe) or a few .so's (with eg. 22 refs each).

(( a caveat that this could be a skewed observation for many reasons - unknown node history prior to going slow / 2 user's jobs are on a disproportionate number of these nodes / there can be a lot of .so's open compared to regular files / ... but it seemed like a pattern. ))

so could this issue be something to do with mmap rather than regular i/o?
on the nodes that recovered it looked like these 'txt' and 'mem' files were holding the grant low until the processes ended. no idea why it was some nodes and not others that recovered... perhaps there is more than one thing going on...

probably way off-track, but I'm reminded of LU-13588 which judging from Oleg's comment was something to do with grant and mmap, and perhaps not entirely to do with quota.

also FYI max_dirty_mb=1000 doesn't seem to work. we set that on 2 login nodes after a reboot, and one was broken again within a day or so.

cheers,
robin

Comment by Andreas Dilger [ 13/Nov/20 ]

I was looking at where the cl_lost_grant might be coming from. According to a comment in the code:

/**
 * Free grant after IO is finished or canceled.
 *
 * @lost_grant is used to remember how many grants we have allocated but not
 * used, we should return these grants to OST. There're two cases where grants
 * can be lost:
 * 1. truncate;
 * 2. blocksize at OST is less than PAGE_SIZE and a partial page was
 *    written. In this case OST may use less chunks to serve this partial
 *    write. OSTs don't actually know the page size on the client side. so
 *    clients have to calculate lost grant by the blocksize on the OST.
 *    See filter_grant_check() for details.
 */
static void osc_free_grant(struct client_obd *cli, unsigned int nr_pages,
                           unsigned int lost_grant, unsigned int dirty_grant)

so it is possible that the cl_lost_grant overflow problem from Olaf's patch https://review.whamcloud.com/40615 "LU-14125 osc: prevent overflow of o_dropped" happens more frequently for ZFS OSTs. I can't imagine that this would be the case for ldiskfs OSTs.

There definitely still seems to be some kind of grant accounting error. Even on my idle single-client test system, the client "osc.*.cur_grant_bytes" (about 8MB) does not match what the OST "obdfilter.*.tot_granted" thinks it has granted (about 256KB).

Comment by Gerrit Updater [ 16/Nov/20 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/40659
Subject: LU-14125 osc: prevent overflow of o_dropped
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d80c9e075c0422994865435b35929f9af30476a7

Comment by Olaf Faaland [ 17/Nov/20 ]

Our Lustre 2.12 machines all got updates over the last week, to a new Lustre version that has the two grant patches. Everything started out without the symptom because of the client + server reboots. Still waiting for the issue to reappear.

Comment by Gerrit Updater [ 26/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40563/
Subject: LU-14125 obdclass: add grant fields to export procfile
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 53ee416097a9a77ca0ee352714af02e77489e3f8

Comment by Gerrit Updater [ 26/Nov/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40659/
Subject: LU-14125 osc: prevent overflow of o_dropped
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 82e9a11056a55289c880786da71d8b1125f357b2

Comment by Peter Jones [ 26/Nov/20 ]

The patches have landed for 2.14. Let's reopen if the problem still reappears with them in place.

Comment by Andreas Dilger [ 29/Nov/20 ]

Olaf, and news on the client grant front?

Comment by Nathan Dauchy (Inactive) [ 29/Nov/20 ]

Are the patches to prevent overflow of o_dropped and add debugging information available for 2.12.x?  We have a maintenance window coming up, during which we are planning to upgrade clients to 2.12.5+ anyway.

Comment by Andreas Dilger [ 30/Nov/20 ]

There is patch https://review.whamcloud.com/40615

Comment by Olaf Faaland [ 30/Nov/20 ]

Andreas,
No news. We haven't seen the issue now in about two weeks, but we believe that may be due to the system updates which resulted in everything getting a clean boot.

Comment by Olaf Faaland [ 25/Jan/21 ]

Since my last update in November our production machines for the most part have been running Lustre 2.12.5_10.llnl, which has these grant-related patches landed post 2.12.5:

  • 663e688 LU-12687 osc: consume grants for direct I/O
  • ef65452 LU-13763 osc: don't allow negative grants
  • fb8ae25 LU-13766 obdclass: add grant fields to export procfile

With that patch set, disabling grant_shrink and then umounting and remounting the OSTs not only gets cur_grant_bytes and tot_granted back in sync but also prevents grant starvation.

Our next update coming up in a few weeks will be running 2.12.6_3.llnl which also has:

  • 8d78d2e LU-14125 osc: prevent overflow of o_dropped

After that update we will re-enable grant_shrink and see if the problem recurs.

Comment by Cory Spitz [ 04/Feb/21 ]

> fb8ae25 LU-13766 obdclass: add grant fields to export procfile
ofaaland, can you say more about that one? I don't see it posted to gerrit yet.

Comment by Olaf Faaland [ 04/Feb/21 ]

Hi Cory,

https://review.whamcloud.com/#/c/39324/5

that patch has been on our 2.12 stack at LLNL for several weeks and hasn't caused us any trouble. I'm not sure why it hasn't landed to b2_12. I added gerrit gatekeeper as a reviewer and posted a query to that effect.

Comment by Andreas Dilger [ 05/Feb/21 ]

The landings to b2_12 have been paused while we focus efforts on getting 2.14.0 out the door. There are a number of patches queued up for 2.12.7 once Oleg has cycles to run them through his test rig.

Comment by Cory Spitz [ 11/Feb/21 ]

ofaaland, ah, thanks for pointing it out. That patch is from this ticket and not directly from LU-13766. I just didn't see a patch in gerrit for LU-13766 (https://review.whamcloud.com/#/q/LU-13766) and thought I was missing something. Thanks for letting us know!

Comment by Olaf Faaland [ 22/Feb/21 ]

My update from Jan 5 said:

Our next update coming up in a few weeks will be running 2.12.6_3.llnl which also has:

  • * 8d78d2e LU-14125 osc: prevent overflow of o_dropped

After that update we will re-enable grant_shrink and see if the problem recurs.

Our updates are delayed so it will be another week or two before we can re-enable grant_shrink and then another couple of weeks before we see if the symptoms reappear.

Comment by Andreas Dilger [ 25/Feb/21 ]

I'm reopening this, because I don't think it is clear the problem has been fixed. I believe it was just closed because the two LU-14125 patches were landed for 2.14.0.

Comment by Andreas Dilger [ 25/Feb/21 ]

Another site reported that just the o_dropped does not seem to have (fully?) resolved the problem, since they are still seeing clients with low grant. I'm still trying to find out whether the site is running with grant_shrink=1 or not.

One theory that I had was that something with the grant shrink is causing incremental loss of grant because it is using non-PAGE_SIZE grant amounts? I notice that the clients and servers have values that are not even PAGE_SIZE multiples, so maybe there is some kind of rounding problems between the client and server?

Comment by Gerrit Updater [ 04/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39324/
Subject: LU-14125 obdclass: add grant fields to export procfile
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 7a354e82d99d57103ed52cb7872cd64090b43383

Comment by Cory Spitz [ 05/Mar/21 ]

FWIW, we have a couple of reports. One customer running 2.12.6 saw trouble with performance. One set of nodes was set to grant_shrink=0, which mitigated the problem. Another customer running ~2.12.4 saw their grant problems disappear with this patch and LU-11409. Related to LU-11409: vsaveliev has spotted that tgt_grant_sanity_check() is a no-op and the grant check gets turned off with more than 100 exports.

Comment by Andreas Dilger [ 07/Mar/21 ]

Cory, thanks for the update. You may be conflating two issues here. LU-11409 is applicable to clients connecting to more than 100 OSTs. It would be useful to know from the reporters on this ticket if that applies to the systems where this problem is being seen.

Separately, tgt_grant_sanity_check() is a server-side verification of the grants given to the clients vs. the total granted counters for the target at disconnect time, which is disabled on systems with more than 100 connected clients because it adds significant overhead at that point (O(n^2) with the number of connected clients). However, that check is not being triggered in this case (AFAIK) because it isn't a problem with the per-export vs. global counters on the OST, but a disconnect between what the client is counting and what the server is counting.

Comment by Olaf Faaland [ 09/Mar/21 ]

Cory, thanks for the update. You may be conflating two issues here. LU-11409 is applicable to clients connecting to more than 100 OSTs. It would be useful to know from the reporters on this ticket if that applies to the systems where this problem is being seen.

At LLNL the clients where we're seeing this problem do connect to more than 100 OSTs (across 3 file systems).

Comment by Andreas Dilger [ 09/Mar/21 ]

Olaf, the backported patch https://review.whamcloud.com/40564 "LU-11409 osc: grant shrink shouldn't account skipped OSC" just landed to b2_12 but is not in any tag yet. This is a very simple client-only patch, so could be added to your clients relatively easily to see if it solves the problem.

Comment by Gerrit Updater [ 17/Mar/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40615/
Subject: LU-14125 osc: prevent overflow of o_dropped
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 1da8349175a136df0aadb28ae0e0f64ac0385961

Comment by Andreas Dilger [ 30/Mar/21 ]

There is a new patch https://review.whamcloud.com/42129 "LU-14543 target: prevent overflowing of tgd->tgd_tot_granted" that may be of interest here. I'm not 100% sure it is related, since it involves an underflow of tot_grant and/or tot_dirty AFAICS, so if that happened it would likely affect all clients, but worthwhile to mention it here.

Comment by Olaf Faaland [ 27/Apr/21 ]

We've seen the issue at LLNL again:

  • with 40615 "prevent overflow of o_dropped"
  • without 40564 "grant shrink shouldn't account skipped OSC"
  • without 42129 "prevent overflowing of tgd->tgd_tot_granted"
  • with grant_shrink enabled

The same clients mounted another file system which had grant_shrink disabled and those OSCs did not encounter the issue.

Our clients will get 40564 "grant shrink shouldn't account skipped OSC" in the next few weeks, but it typically takes weeks for the issue to become easily detectable.

 

Generated at Sat Feb 10 03:07:03 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.