[LU-14125] client starved for grant but OST has plenty of free space Created: 06/Nov/20 Updated: 29/Jul/22 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.14.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Mikhail Pershin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl, topllnl | ||
| Environment: |
zfs-0.7 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||||||
| Description |
|
Something is causing cur_grant_bytes for some OSCs to go below 1MB. Which OSCs and which nodes are apparently random. The OSTs themselves have many TB of free space. Sequential writes (e.g. dd if=/dev/urandom of=file_on_ost_3 bs=1M count=40 where the file has just one stripe) produces osc_enter_cache() debug entries reporting it is falling back to sync I/O. We also see osc_update_grant() report it got 0 extra grant. We have not been able to identify a workload or other trigger that pushes cur_grant_bytes low in the first place. We also have not been able to find a workaround that results in the OST returning extra grant. We set grant_shrink=0 on all clients (using set_param -P on the mgs) and then stopped and started all the OSTs on the file system. This did not change the symptoms in any obvious way. Client snippet with debug="+cache": 00000008:00000020:55.0:1604686702.850333:0:15766:0:(osc_cache.c:1613:osc_enter_cache()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }need:1703936
00000008:00000020:55.0:1604686702.850335:0:15766:0:(osc_cache.c:1543:osc_enter_cache_try()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }need:1703936
00000008:00000020:55.0:1604686702.850337:0:15766:0:(osc_cache.c:1658:osc_enter_cache()) lsrza-OST0003-osc-ffff8b2f37fc1000: grant { dirty: 0/512000 dirty_pages: 0/16450184 dropped: 0 avail: 997461, dirty_grant: 0, reserved: 0, flight: 0 } lru {in list: 9984, left: 256, waiters: 0 }no grant space, fall back to sync i/o
00000008:00400020:55.0:1604686702.850352:0:15766:0:(osc_io.c:127:osc_io_submit()) 256 1
00000008:00000020:55.0:1604686702.850385:0:15766:0:(osc_cache.c:1743:osc_update_pending()) obj ffff8b2c6bc58640 ready 0|-|- wr 256|+|- rd 0|- update pending cmd 2 delta 256.
00000008:00000020:55.0:1604686702.850387:0:15766:0:(osc_cache.c:2297:osc_io_unplug0()) Queue writeback work for client ffff8b1efb0d25e0.
00000008:00000020:19.0:1604686702.850400:0:20698:0:(osc_request.c:3171:brw_queue_work()) Run writeback work for client obd ffff8b1efb0d25e0.
00000008:00000020:19.0:1604686702.850402:0:20698:0:(osc_cache.c:2222:osc_check_rpcs()) obj ffff8b2c6bc58640 ready 0|-|- wr 256|+|- rd 0|- 0 in flight
00000008:00000020:19.0:1604686702.850404:0:20698:0:(osc_cache.c:1697:osc_makes_rpc()) high prio request forcing RPC
00000008:00000020:19.0:1604686702.850405:0:20698:0:(osc_cache.c:1888:try_to_add_extent_for_io()) extent ffff8b2c6b9c7c30@{[9984 -> 10239/10239], [1|0|+|lockdone|wShu|ffff8b2c6bc58640], [0|256|+|-| (null)|256| (null)]} trying to add this extent
00000008:00000020:19.0:1604686702.850408:0:20698:0:(osc_cache.c:1743:osc_update_pending()) obj ffff8b2c6bc58640 ready 0|-|- wr 0|-|- rd 0|- update pending cmd 2 delta -256.
00000008:00000020:19.0:1604686702.850441:0:20698:0:(osc_request.c:705:osc_announce_cached()) dirty: 0 undirty: 1879048191 dropped 0 grant: 997461
00000008:00000020:19.0:1604686702.850443:0:20698:0:(osc_request.c:714:osc_update_next_shrink()) next time 6200398 to shrink grant
00000008:00000020:60.0:1604686703.244890:0:20699:0:(osc_request.c:727:osc_update_grant()) got 0 extra grant
Server snippet: 00000020:00000020:1.0:1604686702.851666:0:15401:0:(tgt_grant.c:413:tgt_grant_statfs()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 free: 263316194721792 avail: 263316186333184 00000020:00000020:1.0:1604686702.851668:0:15401:0:(tgt_grant.c:477:tgt_grant_space_left()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 avail 263316186333184 left 262158115930112 unstable 3407872 tot_grant 1158069646026 pending 3407872 00000020:00000020:1.0:1604686702.851670:0:15401:0:(tgt_grant.c:519:tgt_grant_incoming()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 reports grant 997461 dropped 0, local 1882456063 00000020:00000020:1.0:1604686702.851672:0:15401:0:(tgt_grant.c:848:tgt_grant_check()) lsrza-OST0003: cli 726797b8-322a-1989-0cb5-3645daf9a6ce/ffff8fe6bb09d800 granted: 0 ungranted: 1703936 grant: 1882456063 dirty: 0 Each pool contains only one OST, and the storage is used for nothing else. All the OSTs have about the same amount of free space. This is the pool containing OST0003. [root@brass8:toss-4917-grant]# zpool list NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT brass8 580T 243T 337T - 25% 41% 1.00x ONLINE - |
| Comments |
| Comment by Olaf Faaland [ 06/Nov/20 ] |
|
We can get a debug patch on either the server or the client within about 1 week (it would need to be pushed to gerrit and pass tests first). For instance, there are no CDEBUGs in osc_should_shrink_grant(). We haven't identified a workaround other than stopping and starting servers, which is obviously not good enough. Workaround suggestions would be very useful, as this is significantly impacting our users. |
| Comment by Peter Jones [ 06/Nov/20 ] |
|
Mike Could you please advise Thanks Peter |
| Comment by Olaf Faaland [ 06/Nov/20 ] |
|
Thanks Peter |
| Comment by Olaf Faaland [ 06/Nov/20 ] |
|
For my tracking, my local ticket is TOSS-4917 |
| Comment by Andreas Dilger [ 06/Nov/20 ] |
|
There was a grant-related overflow fixed in patch https://review.whamcloud.com/39380 " The patch was landed to b2_12 but not until after 2.12.5 was released. Mike, for debugging purposes, it would be useful to add per-export grant parameters on the server (obdfilter.*.exports.*.grant*" to match the clients). That would allow comparing the values on the client and server and see what kind of difference there is. |
| Comment by Olaf Faaland [ 06/Nov/20 ] |
|
Thanks, Andreas.
Like this? |
| Comment by Olaf Faaland [ 06/Nov/20 ] |
Ah. Next week, the systems are scheduled to get an update with that patch. Specifically, this: |
| Comment by Andreas Dilger [ 06/Nov/20 ] |
|
Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40563 |
| Comment by James A Simmons [ 06/Nov/20 ] |
|
ORNL is also running into this problem and we have the grant overflow fix patch. |
| Comment by Olaf Faaland [ 06/Nov/20 ] |
Thanks James, that's good to know. |
| Comment by Olaf Faaland [ 06/Nov/20 ] |
|
I'd forgotten, the lustre version running on these two machines already has the per-export grant parameters patch. From the server: [root@brass8:192.168.128.133@o2ib32]# cat export
726797b8-322a-1989-0cb5-3645daf9a6ce:
name: lsrza-OST0003
client: 192.168.128.133@o2ib32
connect_flags: [ write_grant, server_lock, version, request_portal, truncate_lock, max_byte_per_rpc, early_lock_cancel, adaptive_timeouts, lru_resize, alt_checksum_algorithm, fid_is_enabled, version_recovery, grant_shrink, full20, layout_lock, 64bithash, object_max_bytes, jobstats, einprogress, grant_param, lvb_type, short_io, lfsck, bulk_mbits, second_flags, lockaheadv2 ]
connect_data:
flags: 0xa0425af2e3440478
instance: 42
target_version: 2.12.5.0
initial_grant: 2097152
max_brw_size: 1048576
grant_block_size: 1048576
grant_inode_size: 4096
grant_max_extent_size: 1073741824
grant_extent_tax: 655360
cksum_types: 0xf7
max_object_bytes: 9223372036854775807
export_flags: [ ]
grant:
granted: 1882456063
dirty: 0
pending: 0
From the client: [root@rztopaz133:toss-4917-grant]# lctl list_nids 192.168.128.133@o2ib32 [root@rztopaz133:toss-4917-grant]# lctl get_param osc.lsrza-OST0003*.cur_grant_bytes osc.lsrza-OST0003-osc-ffff8b2f37fc1000.cur_grant_bytes=997461 |
| Comment by Andreas Dilger [ 07/Nov/20 ] |
|
So 1882456063 = 0x7033ffff, which means that it is very close to overflowing, and likely had overflowed at some point in the past? Would it be possible to check all of the exports on that OSS (and/or other OSSes) and see if the high granted: line on the OSS, and a low cur_grant_bytes on the client correlates to those clients that are having slow writes? They should in theory be identical between client and OSS when the client does not have any dirty cached pages. It may not be 100% correlation, because some clients might not have actually overflowed yet. |
| Comment by Olaf Faaland [ 09/Nov/20 ] |
|
Hi Andreas, the high granted: line on the OST does not correlate well with low cur_grant_bytes on the clients. One cluster, rztopaz, has 6 clients with cur_grant_bytes < 1M for OST0003. The OST reports that 238 exports to nodes in that cluster have granted: value of 1882456063. It's interesting there are so many nodes with granted = 1882456063. /proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.102@o2ib32/export: granted: 1882456063 /proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.104@o2ib32/export: granted: 1882456063 /proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.108@o2ib32/export: granted: 1882456063 /proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.109@o2ib32/export: granted: 1882456063 /proc/fs/lustre/obdfilter/lsrza-OST0003/exports/192.168.128.110@o2ib32/export: granted: 1882456063 vs e102: osc.lsrza-OST0003-osc-ffff9bbe688de000.cur_grant_bytes=82790394 e104: osc.lsrza-OST0003-osc-ffff9b69c9dea000.cur_grant_bytes=1412694016 e108: osc.lsrza-OST0003-osc-ffffa1006893e000.cur_grant_bytes=271487350 e109: osc.lsrza-OST0003-osc-ffff8ac65aa26800.cur_grant_bytes=1072398336 e110: osc.lsrza-OST0003-osc-ffff8f737194f800.cur_grant_bytes=1875640319 |
| Comment by Andreas Dilger [ 09/Nov/20 ] |
|
Olaf, what values do you have on the client(s) for osc.*.max_pages_per_rpc, osc.*.max_rpcs_in_flight, and osc.*.max_dirty_mb? I'm trying to figure out if there is some kind of overflow in the calculation of the max grant. There are several patches in this area that could be related:
but this is just speculation so far. The other question is whether there are any error messages in the client logs like "dirty NNNN > dirty_max MMMM", or related to grant? Some things to try if this is relatively easily reproduced on a client after a fresh mount:
|
| Comment by Olaf Faaland [ 09/Nov/20 ] |
|
Andreas, we have (I just show OST0003 but the same for all OSTs and all clients) osc.lsrza-OST0003-osc-ffff8b2f37fc1000.max_pages_per_rpc=256 osc.lsrza-OST0003-osc-ffff8b2f37fc1000.max_rpcs_in_flight=8 osc.lsrza-OST0003-osc-ffff8b2f37fc1000.max_dirty_mb=2000 No console log warnings or errors on the clients at all, except for reconnections when we bounced OSTs or MDTs. The current patch stack does not include the negative grants / direct IO patches. It is: * 27e1e03 (tag: 2.12.5_5.llnl) log lfs setstripe paths to syslog * f7a3e6e (tag: 2.12.5_4.llnl) LU-13599 mdt: fix mti_big_lmm buffer usage * 8179deb LU-13599 mdt: fix logic of skipping local locks in reply_state * 52a48df (tag: 2.12.5_3.llnl) LU-13709 utils: 'lfs mkdir -i -1' doesn't work * 81479ad LU-13657 kernel: kernel update RHEL8.2 [4.18.0-193.6.3.el8_2] * 2744b2f LU-13421 kernel: kernel update RHEL8.1 [4.18.0-147.8.1.el8_1] * 0aaf383 LU-10395 osd: stop OI at device shutdown * fb8ae25 LU-13766 obdclass: add grant fields to export procfile * e6ee866 (tag: 2.12.5_2.chaos) LU-13667 ptlrpc: fix endless loop issue * 049ed85 LU-11623 llite: hash just created files if lock allows * e1e865f (tag: 2.12.5_1.llnl, tag: 2.12.5_1.chaos) Don't install lustre init script on systemd systems * 6eebb51 LLNL build customizations * 407ac6c TOSS-4431 build: build ldiskfs only for x86_64 * 78d712a (tag: v2_12_5, tag: 2.12.5) New release 2.12.5 * |
| Comment by Olaf Faaland [ 09/Nov/20 ] |
|
We have several clusters where we see this occur in fairly short order (4-8 hours) after a reboot. But so far we do not have a procedure for reproducing the issue. |
| Comment by Cameron Harr [ 09/Nov/20 ] |
|
I'm dropping max_dirty_mb to 1024 on Pascal; however, the CZ is relatively clean now after restarting all the OSTs last week (and remounting Lustre on a handful of uncooperative compute nodes), so I don't think we'd be hitting the issue anytime soon anyway. I will also set it on rztopaz which hasn't had the cleanup. |
| Comment by Gerrit Updater [ 11/Nov/20 ] |
|
Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/40615 |
| Comment by Nathan Dauchy (Inactive) [ 12/Nov/20 ] |
|
As another reference point, we seem to be hitting this issue at NOAA as well, with 2.12.3_ddn44 on the clients and lustre-2.12.3_ddn31 on the servers. Unmounting and remounting a client seems to improve performance, at least for a little while. |
| Comment by Nathan Dauchy (Inactive) [ 12/Nov/20 ] |
|
A colleague found a way to force reconnection between client and server when working a different issue, and I tried it in this case and was able to get cur_grant_bytes to renegotiate without the heavy-handed unmount. Is this a dangerous thing to do as a workaround? [root@Jet:k11 ~]# lctl set_param osc.*.grant_shrink=0 [root@Jet:k11 ~]# lctl get_param osc.*.cur_grant_bytes | sort -nk2 -t = | head -n 5 [root@Jet:k11 ~]# lctl device_list | egrep "lfs1-OST00(02|18)" [root@Jet:k11 ~]# lctl --device 4 deactivate; sleep 1; lctl --device 4 activate [root@Jet:k11 ~]# lctl get_param osc.*.cur_grant_bytes | egrep "lfs1-OST00(02|18)" |
| Comment by Andreas Dilger [ 12/Nov/20 ] |
|
Nathan, However, it should be noted that disconnecting and reconnecting the device like this could cause in-flight RPCs to that OST to be aborted, so this should only be done on quiescent clients after "lctl set_param ldlm.namespaces.*.lru_size=clear" to flush all dirty data from the client. So it is mostly ok, but not safe to do randomly during the day. |
| Comment by SC Admin (Inactive) [ 13/Nov/20 ] |
|
Hiya, using the small dd test from our before the slow nodes drained of jobs, I used lsof and lfs getstripe to find all files using the slow OSTs from those nodes. in all of these cases there were exe's ('txt' in lsof) or .so's ('mem') using the slow OSTs. sometimes there was a regular file as well, but in most cases the slow OST was only referenced by one exe (eg. 32 references to a 32-way OpenMP exe) or a few .so's (with eg. 22 refs each). (( a caveat that this could be a skewed observation for many reasons - unknown node history prior to going slow / 2 user's jobs are on a disproportionate number of these nodes / there can be a lot of .so's open compared to regular files / ... but it seemed like a pattern. )) so could this issue be something to do with mmap rather than regular i/o? probably way off-track, but I'm reminded of also FYI max_dirty_mb=1000 doesn't seem to work. we set that on 2 login nodes after a reboot, and one was broken again within a day or so. cheers, |
| Comment by Andreas Dilger [ 13/Nov/20 ] |
|
I was looking at where the cl_lost_grant might be coming from. According to a comment in the code: /** * Free grant after IO is finished or canceled. * * @lost_grant is used to remember how many grants we have allocated but not * used, we should return these grants to OST. There're two cases where grants * can be lost: * 1. truncate; * 2. blocksize at OST is less than PAGE_SIZE and a partial page was * written. In this case OST may use less chunks to serve this partial * write. OSTs don't actually know the page size on the client side. so * clients have to calculate lost grant by the blocksize on the OST. * See filter_grant_check() for details. */ static void osc_free_grant(struct client_obd *cli, unsigned int nr_pages, unsigned int lost_grant, unsigned int dirty_grant) so it is possible that the cl_lost_grant overflow problem from Olaf's patch https://review.whamcloud.com/40615 "LU-14125 osc: prevent overflow of o_dropped" happens more frequently for ZFS OSTs. I can't imagine that this would be the case for ldiskfs OSTs. There definitely still seems to be some kind of grant accounting error. Even on my idle single-client test system, the client "osc.*.cur_grant_bytes" (about 8MB) does not match what the OST "obdfilter.*.tot_granted" thinks it has granted (about 256KB). |
| Comment by Gerrit Updater [ 16/Nov/20 ] |
|
Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/40659 |
| Comment by Olaf Faaland [ 17/Nov/20 ] |
|
Our Lustre 2.12 machines all got updates over the last week, to a new Lustre version that has the two grant patches. Everything started out without the symptom because of the client + server reboots. Still waiting for the issue to reappear. |
| Comment by Gerrit Updater [ 26/Nov/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40563/ |
| Comment by Gerrit Updater [ 26/Nov/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40659/ |
| Comment by Peter Jones [ 26/Nov/20 ] |
|
The patches have landed for 2.14. Let's reopen if the problem still reappears with them in place. |
| Comment by Andreas Dilger [ 29/Nov/20 ] |
|
Olaf, and news on the client grant front? |
| Comment by Nathan Dauchy (Inactive) [ 29/Nov/20 ] |
|
Are the patches to prevent overflow of o_dropped and add debugging information available for 2.12.x? We have a maintenance window coming up, during which we are planning to upgrade clients to 2.12.5+ anyway. |
| Comment by Andreas Dilger [ 30/Nov/20 ] |
|
There is patch https://review.whamcloud.com/40615 |
| Comment by Olaf Faaland [ 30/Nov/20 ] |
|
Andreas, |
| Comment by Olaf Faaland [ 25/Jan/21 ] |
|
Since my last update in November our production machines for the most part have been running Lustre 2.12.5_10.llnl, which has these grant-related patches landed post 2.12.5:
With that patch set, disabling grant_shrink and then umounting and remounting the OSTs not only gets cur_grant_bytes and tot_granted back in sync but also prevents grant starvation. Our next update coming up in a few weeks will be running 2.12.6_3.llnl which also has:
After that update we will re-enable grant_shrink and see if the problem recurs. |
| Comment by Cory Spitz [ 04/Feb/21 ] |
|
> fb8ae25 |
| Comment by Olaf Faaland [ 04/Feb/21 ] |
|
Hi Cory, https://review.whamcloud.com/#/c/39324/5 that patch has been on our 2.12 stack at LLNL for several weeks and hasn't caused us any trouble. I'm not sure why it hasn't landed to b2_12. I added gerrit gatekeeper as a reviewer and posted a query to that effect. |
| Comment by Andreas Dilger [ 05/Feb/21 ] |
|
The landings to b2_12 have been paused while we focus efforts on getting 2.14.0 out the door. There are a number of patches queued up for 2.12.7 once Oleg has cycles to run them through his test rig. |
| Comment by Cory Spitz [ 11/Feb/21 ] |
|
ofaaland, ah, thanks for pointing it out. That patch is from this ticket and not directly from |
| Comment by Olaf Faaland [ 22/Feb/21 ] |
|
My update from Jan 5 said:
Our updates are delayed so it will be another week or two before we can re-enable grant_shrink and then another couple of weeks before we see if the symptoms reappear. |
| Comment by Andreas Dilger [ 25/Feb/21 ] |
|
I'm reopening this, because I don't think it is clear the problem has been fixed. I believe it was just closed because the two LU-14125 patches were landed for 2.14.0. |
| Comment by Andreas Dilger [ 25/Feb/21 ] |
|
Another site reported that just the o_dropped does not seem to have (fully?) resolved the problem, since they are still seeing clients with low grant. I'm still trying to find out whether the site is running with grant_shrink=1 or not. One theory that I had was that something with the grant shrink is causing incremental loss of grant because it is using non-PAGE_SIZE grant amounts? I notice that the clients and servers have values that are not even PAGE_SIZE multiples, so maybe there is some kind of rounding problems between the client and server? |
| Comment by Gerrit Updater [ 04/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/39324/ |
| Comment by Cory Spitz [ 05/Mar/21 ] |
|
FWIW, we have a couple of reports. One customer running 2.12.6 saw trouble with performance. One set of nodes was set to grant_shrink=0, which mitigated the problem. Another customer running ~2.12.4 saw their grant problems disappear with this patch and |
| Comment by Andreas Dilger [ 07/Mar/21 ] |
|
Cory, thanks for the update. You may be conflating two issues here. Separately, tgt_grant_sanity_check() is a server-side verification of the grants given to the clients vs. the total granted counters for the target at disconnect time, which is disabled on systems with more than 100 connected clients because it adds significant overhead at that point (O(n^2) with the number of connected clients). However, that check is not being triggered in this case (AFAIK) because it isn't a problem with the per-export vs. global counters on the OST, but a disconnect between what the client is counting and what the server is counting. |
| Comment by Olaf Faaland [ 09/Mar/21 ] |
At LLNL the clients where we're seeing this problem do connect to more than 100 OSTs (across 3 file systems). |
| Comment by Andreas Dilger [ 09/Mar/21 ] |
|
Olaf, the backported patch https://review.whamcloud.com/40564 " |
| Comment by Gerrit Updater [ 17/Mar/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40615/ |
| Comment by Andreas Dilger [ 30/Mar/21 ] |
|
There is a new patch https://review.whamcloud.com/42129 " |
| Comment by Olaf Faaland [ 27/Apr/21 ] |
|
We've seen the issue at LLNL again:
The same clients mounted another file system which had grant_shrink disabled and those OSCs did not encounter the issue. Our clients will get 40564 "grant shrink shouldn't account skipped OSC" in the next few weeks, but it typically takes weeks for the issue to become easily detectable.
|