[LU-14124] super slow i/o on client maybe related to low grant Created: 06/Nov/20  Updated: 09/Mar/22  Resolved: 03/Nov/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: Lustre 2.12.8, Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: SC Admin (Inactive) Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: None
Environment:

centos7, x86_64, OPA, lustre 2.12.5, zfs 0.8.5


Issue Links:
Duplicate
duplicates LU-14125 client starved for grant but OST has ... Reopened
Related
is related to LU-14901 consume grants if async write fallbac... Open
is related to LU-9704 ofd_grant_check() claims GRANT, real ... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Hi,

we're seeing slow i/o (~kB/s) to some OSTs from some clients. this seems to be correlated with low grant on the client.

here is a reproducer that sets up striped dirs (one per OST) and then does a small dd to each.

for f in {0..19}; do mkdir $f ; lfs setstripe -i $f $f; done
for f in {0..19}; do echo $f ; dd if=/tmp/50m of=$f/blah9i & done

where /tmp/50m is a almost incompressible file of size 64149552 bytes.

a few seconds later, some dd's have finished quickly, and others will literally take 6 hours

[me@farnarkle2 test]$ ls -l */blah9i | sort -n -k9
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 0/blah9i
-rw-r--r-- 1 me oz997   423424 Nov  6 18:58 1/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 2/blah9i
-rw-r--r-- 1 me oz997   409600 Nov  6 18:58 3/blah9i
-rw-r--r-- 1 me oz997   409088 Nov  6 18:58 4/blah9i
-rw-r--r-- 1 me oz997   253440 Nov  6 18:58 5/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 6/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 7/blah9i
-rw-r--r-- 1 me oz997   158720 Nov  6 18:58 8/blah9i
-rw-r--r-- 1 me oz997   570368 Nov  6 18:58 9/blah9i
-rw-r--r-- 1 me oz997   574464 Nov  6 18:58 10/blah9i
-rw-r--r-- 1 me oz997   569344 Nov  6 18:58 11/blah9i
-rw-r--r-- 1 me oz997   379904 Nov  6 18:58 12/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 13/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 14/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 15/blah9i
-rw-r--r-- 1 me oz997    84992 Nov  6 18:58 16/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 17/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 18/blah9i
-rw-r--r-- 1 me oz997   389120 Nov  6 18:58 19/blah9i

the slow i/o seems to mostly but not completely correlate with low grant on the client

[me@farnarkle2 test]$ grep -r . /proc/fs/lustre/osc/dagg-OST*/*grant*
/proc/fs/lustre/osc/dagg-OST0000-osc-ffff9f785aa2a800/cur_grant_bytes:3114734
/proc/fs/lustre/osc/dagg-OST0001-osc-ffff9f785aa2a800/cur_grant_bytes:1832833
/proc/fs/lustre/osc/dagg-OST0002-osc-ffff9f785aa2a800/cur_grant_bytes:1881800703
/proc/fs/lustre/osc/dagg-OST0003-osc-ffff9f785aa2a800/cur_grant_bytes:2050095
/proc/fs/lustre/osc/dagg-OST0004-osc-ffff9f785aa2a800/cur_grant_bytes:1902694
/proc/fs/lustre/osc/dagg-OST0005-osc-ffff9f785aa2a800/cur_grant_bytes:1946593
/proc/fs/lustre/osc/dagg-OST0006-osc-ffff9f785aa2a800/cur_grant_bytes:1879834623
/proc/fs/lustre/osc/dagg-OST0007-osc-ffff9f785aa2a800/cur_grant_bytes:1876295679
/proc/fs/lustre/osc/dagg-OST0008-osc-ffff9f785aa2a800/cur_grant_bytes:1867376
/proc/fs/lustre/osc/dagg-OST0009-osc-ffff9f785aa2a800/cur_grant_bytes:2035056
/proc/fs/lustre/osc/dagg-OST000a-osc-ffff9f785aa2a800/cur_grant_bytes:1674002
/proc/fs/lustre/osc/dagg-OST000b-osc-ffff9f785aa2a800/cur_grant_bytes:1790763
/proc/fs/lustre/osc/dagg-OST000c-osc-ffff9f785aa2a800/cur_grant_bytes:1979790
/proc/fs/lustre/osc/dagg-OST000d-osc-ffff9f785aa2a800/cur_grant_bytes:30273
/proc/fs/lustre/osc/dagg-OST000e-osc-ffff9f785aa2a800/cur_grant_bytes:60219801
/proc/fs/lustre/osc/dagg-OST000f-osc-ffff9f785aa2a800/cur_grant_bytes:251557488
/proc/fs/lustre/osc/dagg-OST0010-osc-ffff9f785aa2a800/cur_grant_bytes:1891954
/proc/fs/lustre/osc/dagg-OST0011-osc-ffff9f785aa2a800/cur_grant_bytes:33611568
/proc/fs/lustre/osc/dagg-OST0012-osc-ffff9f785aa2a800/cur_grant_bytes:9298722
/proc/fs/lustre/osc/dagg-OST0013-osc-ffff9f785aa2a800/cur_grant_bytes:1591979

this effect is most noticeable with the above default bs=512 byte dd. the slow files are writing at a few kB/s. the normal OSTs write this small i/o at ballpark 150 MB/s.
with dd bs=1M it's the same slow OSTs, but they run at a minimum of ~1 MB/s. normal OSTs write at ballpark 700 MB/s with 1M i/o's.
so it kinda feels like ops/second or rcps_in_flight is being limited rather than absolute MB/s, if you know what I mean.

from other client nodes with larger grant showing, we don't have the same issues.
the above client is a login node, so has a lot of random stuff going on.

rebooting the client, or rebooting the servers with the OSTs involved seems to clear the problem, but I don't have any other workarounds. I can't seem to find any grant tweakables on client or server, or in the manual, but perhaps I'm missing something.

there are ~250 client nodes. OSTs are very large at ~300T. OSS ram is 192g, but we've told zfs ARC to use 75% of it. reducing zfs_arc_max on the fly to try to give more OSS ram to lustre doesn't seem to immediately help.

we updated lustre on servers from 2.12.4 + zfs 0.8.3 to 2.12.5 + zfs 0.8.5 this week to see if this issue was resolved in 2.12.5, but the issue still seems to be there. clients have been 2.12.5 for ages.

I forget if lustre grant is related to quota. anyway, this is a group quota'd filesystem and the above oz997 group is not close.

Disk quotas for grp oz997 (gid 10273):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
          /fred  4.859T      0k     10T       -  609477       0 2000000       -

any ideas how to debug this further or for a less crappy workaround than rebooting things?
thanks.

cheers,
robin



 Comments   
Comment by SC Admin (Inactive) [ 09/Nov/20 ]

Hi,

I looked in lustre-discuss and found something similar from Mar, and a follow up from Oct.

it looks like the current work-around is to to set grant_shrink=0 and apply some patches to make this stick.
is that correct?

happy to hold off on this if you have better suggestions, or if eg. 2.12.6 is imminent with a real fix.

thanks.

cheers,
robin

Comment by Peter Jones [ 09/Nov/20 ]

Mike

Could you please advise?

Thanks

Peter

Comment by Andreas Dilger [ 09/Nov/20 ]

This sounds very similar to the issues being discussed in LU-14125.

It appears that the problem is caused by the client-side grant amount overflowing, so that the client has a small grant, but the server thinks that it has a large grant, so doesn't give out any more grant to the client, forcing it to issue synchronous writes to the server.

Comment by Andreas Dilger [ 01/Dec/20 ]

Were you able to test the patches from LU-14125 to see if that addresses the grant mismatch issue?

Comment by SC Admin (Inactive) [ 02/Dec/20 ]

Hi Andreas,

thanks for following up.

we've had 2.12.5 clients patched with "osc: prevent overflow of o_dropped" since approx Nov 15, and since then some of those have hit the slow i/o issue.
so that patch didn't fix the problem for us.

I built the current version of the patch (at the time) which was CWARN instead of CDEBUG. there have been no o_dropped messages in syslog.
they'd have gone to syslog I presume?

we shutdown all clients and servers for a downtime last week. we've been back up for 3 days now. since then the slow i/o issue hasn't re-appeared. we normally see it quite quickly - a couple of nodes a day - so that's a bit odd. perhaps the whole system shutdown (rather than just our usual rolling reboots of clients or servers) has wiped out some hysteresis?

during the downtime were going to add the "obdclass: add grant fields to export procfile" patch to the server's 2.12.5, but unfortunately we ran out of time.

cheers,
robin

Comment by SC Admin (Inactive) [ 16/Feb/21 ]

Hiya,

an update.

we turned off grant_shrink on clients 3 weeks ago as per Olaf's workaround in https://jira.whamcloud.com/browse/LU-14125?focusedCommentId=290273&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-290273 and that's working well so far. I haven't detected any slow i/o since then.

we have 2.12.6 on servers and clients now, with the o_dropped patch on client side, and grant_show patch on server side.

the grant numbers on clients in general don't match those on the servers, but they aren't wrong by much. when grants are summed across a subset of 121 clients they differ by about 1% - servers report ~4gb more grant than clients, out of total 3tb on one occasion and 4tb on another.

cheers,
robin

Comment by Gerrit Updater [ 19/Apr/21 ]

Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/43375
Subject: LU-14124 target: set OBD_MD_FLGRANT in read's reply
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3b5341a2f877d9a75bedb4bb64dff4624c10787c

Comment by Vladimir Saveliev [ 19/Apr/21 ]

The grant accounting disbalance could be caused by tgt_brw_read() which does not set OBD_MD_FLGRANT flag in reply:

        if (body->oa.o_valid & OBD_MD_FLCKSUM) {
..
                repbody->oa.o_valid = OBD_MD_FLCKSUM | OBD_MD_FLFLAGS;
..
        } else {
                repbody->oa.o_valid = 0;
        }

Then osc_update_grant() does not update client's grant counter:

static void osc_update_grant(struct client_obd *cli, struct ost_body *body)
{
        if (body->oa.o_valid & OBD_MD_FLGRANT) {
                CDEBUG(D_CACHE, "got %llu extra grant\n", body->oa.o_grant);
		__osc_update_grant(cli, body->oa.o_grant);
        }
}

If on server side tgt_grant_shrink() decides to not shrink grants and wants a client to recover grant counter

         if (left_space >= tgd->tgd_tot_granted_clients *
                          TGT_GRANT_SHRINK_LIMIT(exp))
                return;
..

we get the case where server did not decrease export's grant counter, but client did.

After several shrinks on read it may appear that client thinks that it owns only few grants, while server assumes that the client got max amount of grants already and limits the client with new grants. That may lead to performance degradation.

Comment by SC Admin (Inactive) [ 29/Apr/21 ]

thanks!
we've applied that patch to a couple of clients and will see how it goes.

cheers,
robin

Comment by SC Admin (Inactive) [ 17/May/21 ]

Hi,

we have a few clients with 2.12.6 +
https://review.whamcloud.com/#/c/40615/ "osc: prevent overflow of o_dropped"
https://review.whamcloud.com/#/c/43375/ "target: set OBD_MD_FLGRANT in read's reply"

hmm, should we also have this one? looks like I missed it...
https://review.whamcloud.com/#/c/42129/ "target: prevent overflowing of tgd->tgd_tot_granted"

servers have 2.12.6 +
https://review.whamcloud.com/#/c/40563/ "obdclass: add grant fields to export procfile"

below is the state of a node (with grant_shrink on) that's had lots of different jobs on it for about 3 days, but is now idle ie. no i/o or open files on this filesystem.

columns are ost, what the server thinks the grant is, what the client thinks the grant is, and server-client. ideally the last column would be zeros.

john65 192.168.44.165
   ost     svr_grant    client_grant     svr_grant-client_grant
 OST0000   1879048191   1055293440    823754751
 OST0001   1879048191   1879048191            0
 OST0002      2097152      2097152            0
 OST0003   1881800703   1881800703            0
 OST0004   1881800703   1881800703            0
 OST0005   1879048191    591069696   1287978495
 OST0006   1870790655   1870790655            0
 OST0007      2097152      2097152            0
 OST0008   1884553215   1414397952    470155263
 OST0009   1876295679   1876295679            0
 OST000a      2097152      2097152            0
 OST000b   1876295679   1876295679            0
 OST000c   1884553215   1414791168    469762047
 OST000d   1881800703   1411842048    469958655
 OST000e   1881800703   1411350528    470450175
 OST000f      2097152      2097152            0
 OST0010   1880489983   1880489983            0
 OST0011   1881800703   1411645440    470155263
 OST0012   1884553215    795617280   1088935935
 OST0013   1870790655   1870790655            0

the server vs. client grant imbalance tends to grow over time for each OST as the node runs more jobs, but (AFAICT) they get reset to zero again each time an OST disconnects and reconnects to the client (after what looks to be an hour of no i/o to an OST).

the OSTs with 0 in the last column above tend to be the ones that have reconnected most recently.

over May so far, this node has had 82 disconnect/reconnect events across the various OSTs.
server grant - client grant was 0 before the disconnect in 61 of these cases, so grant accounting was correct.

in the other 21 OST disconnect cases, an average of 673223605.9 bytes of grant was lost for each one. I guess this means grant accounting wasn't ok in those cases.

cheers,
robin

Comment by Gerrit Updater [ 03/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43375/
Subject: LU-14124 target: set OBD_MD_FLGRANT in read's reply
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 4894683342d77964daeded9fbc608fc46aa479ee

Comment by Peter Jones [ 03/Nov/21 ]

Landed for 2.15

Comment by Gerrit Updater [ 05/Nov/21 ]

"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45471
Subject: LU-14124 target: set OBD_MD_FLGRANT in read's reply
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 82f5c0b6581f503b0c41e0e24fa135c9c0ce01c8

Comment by Gerrit Updater [ 14/Nov/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45471/
Subject: LU-14124 target: set OBD_MD_FLGRANT in read's reply
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 557d7d35e30b1f602104026ae0b024cee1abb494

Generated at Sat Feb 10 03:07:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.