[LU-14124] super slow i/o on client maybe related to low grant Created: 06/Nov/20 Updated: 09/Mar/22 Resolved: 03/Nov/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.12.8, Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | SC Admin (Inactive) | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
centos7, x86_64, OPA, lustre 2.12.5, zfs 0.8.5 |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
Hi, we're seeing slow i/o (~kB/s) to some OSTs from some clients. this seems to be correlated with low grant on the client. here is a reproducer that sets up striped dirs (one per OST) and then does a small dd to each. for f in {0..19}; do mkdir $f ; lfs setstripe -i $f $f; done
for f in {0..19}; do echo $f ; dd if=/tmp/50m of=$f/blah9i & done
where /tmp/50m is a almost incompressible file of size 64149552 bytes. a few seconds later, some dd's have finished quickly, and others will literally take 6 hours [me@farnarkle2 test]$ ls -l */blah9i | sort -n -k9 -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 0/blah9i -rw-r--r-- 1 me oz997 423424 Nov 6 18:58 1/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 2/blah9i -rw-r--r-- 1 me oz997 409600 Nov 6 18:58 3/blah9i -rw-r--r-- 1 me oz997 409088 Nov 6 18:58 4/blah9i -rw-r--r-- 1 me oz997 253440 Nov 6 18:58 5/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 6/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 7/blah9i -rw-r--r-- 1 me oz997 158720 Nov 6 18:58 8/blah9i -rw-r--r-- 1 me oz997 570368 Nov 6 18:58 9/blah9i -rw-r--r-- 1 me oz997 574464 Nov 6 18:58 10/blah9i -rw-r--r-- 1 me oz997 569344 Nov 6 18:58 11/blah9i -rw-r--r-- 1 me oz997 379904 Nov 6 18:58 12/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 13/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 14/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 15/blah9i -rw-r--r-- 1 me oz997 84992 Nov 6 18:58 16/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 17/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 18/blah9i -rw-r--r-- 1 me oz997 389120 Nov 6 18:58 19/blah9i the slow i/o seems to mostly but not completely correlate with low grant on the client [me@farnarkle2 test]$ grep -r . /proc/fs/lustre/osc/dagg-OST*/*grant* /proc/fs/lustre/osc/dagg-OST0000-osc-ffff9f785aa2a800/cur_grant_bytes:3114734 /proc/fs/lustre/osc/dagg-OST0001-osc-ffff9f785aa2a800/cur_grant_bytes:1832833 /proc/fs/lustre/osc/dagg-OST0002-osc-ffff9f785aa2a800/cur_grant_bytes:1881800703 /proc/fs/lustre/osc/dagg-OST0003-osc-ffff9f785aa2a800/cur_grant_bytes:2050095 /proc/fs/lustre/osc/dagg-OST0004-osc-ffff9f785aa2a800/cur_grant_bytes:1902694 /proc/fs/lustre/osc/dagg-OST0005-osc-ffff9f785aa2a800/cur_grant_bytes:1946593 /proc/fs/lustre/osc/dagg-OST0006-osc-ffff9f785aa2a800/cur_grant_bytes:1879834623 /proc/fs/lustre/osc/dagg-OST0007-osc-ffff9f785aa2a800/cur_grant_bytes:1876295679 /proc/fs/lustre/osc/dagg-OST0008-osc-ffff9f785aa2a800/cur_grant_bytes:1867376 /proc/fs/lustre/osc/dagg-OST0009-osc-ffff9f785aa2a800/cur_grant_bytes:2035056 /proc/fs/lustre/osc/dagg-OST000a-osc-ffff9f785aa2a800/cur_grant_bytes:1674002 /proc/fs/lustre/osc/dagg-OST000b-osc-ffff9f785aa2a800/cur_grant_bytes:1790763 /proc/fs/lustre/osc/dagg-OST000c-osc-ffff9f785aa2a800/cur_grant_bytes:1979790 /proc/fs/lustre/osc/dagg-OST000d-osc-ffff9f785aa2a800/cur_grant_bytes:30273 /proc/fs/lustre/osc/dagg-OST000e-osc-ffff9f785aa2a800/cur_grant_bytes:60219801 /proc/fs/lustre/osc/dagg-OST000f-osc-ffff9f785aa2a800/cur_grant_bytes:251557488 /proc/fs/lustre/osc/dagg-OST0010-osc-ffff9f785aa2a800/cur_grant_bytes:1891954 /proc/fs/lustre/osc/dagg-OST0011-osc-ffff9f785aa2a800/cur_grant_bytes:33611568 /proc/fs/lustre/osc/dagg-OST0012-osc-ffff9f785aa2a800/cur_grant_bytes:9298722 /proc/fs/lustre/osc/dagg-OST0013-osc-ffff9f785aa2a800/cur_grant_bytes:1591979 this effect is most noticeable with the above default bs=512 byte dd. the slow files are writing at a few kB/s. the normal OSTs write this small i/o at ballpark 150 MB/s. from other client nodes with larger grant showing, we don't have the same issues. rebooting the client, or rebooting the servers with the OSTs involved seems to clear the problem, but I don't have any other workarounds. I can't seem to find any grant tweakables on client or server, or in the manual, but perhaps I'm missing something. there are ~250 client nodes. OSTs are very large at ~300T. OSS ram is 192g, but we've told zfs ARC to use 75% of it. reducing zfs_arc_max on the fly to try to give more OSS ram to lustre doesn't seem to immediately help. we updated lustre on servers from 2.12.4 + zfs 0.8.3 to 2.12.5 + zfs 0.8.5 this week to see if this issue was resolved in 2.12.5, but the issue still seems to be there. clients have been 2.12.5 for ages. I forget if lustre grant is related to quota. anyway, this is a group quota'd filesystem and the above oz997 group is not close. Disk quotas for grp oz997 (gid 10273):
Filesystem used quota limit grace files quota limit grace
/fred 4.859T 0k 10T - 609477 0 2000000 -
any ideas how to debug this further or for a less crappy workaround than rebooting things? cheers, |
| Comments |
| Comment by SC Admin (Inactive) [ 09/Nov/20 ] |
|
Hi, I looked in lustre-discuss and found something similar from Mar, and a follow up from Oct. it looks like the current work-around is to to set grant_shrink=0 and apply some patches to make this stick. happy to hold off on this if you have better suggestions, or if eg. 2.12.6 is imminent with a real fix. thanks. cheers, |
| Comment by Peter Jones [ 09/Nov/20 ] |
|
Mike Could you please advise? Thanks Peter |
| Comment by Andreas Dilger [ 09/Nov/20 ] |
|
This sounds very similar to the issues being discussed in LU-14125. It appears that the problem is caused by the client-side grant amount overflowing, so that the client has a small grant, but the server thinks that it has a large grant, so doesn't give out any more grant to the client, forcing it to issue synchronous writes to the server. |
| Comment by Andreas Dilger [ 01/Dec/20 ] |
|
Were you able to test the patches from LU-14125 to see if that addresses the grant mismatch issue? |
| Comment by SC Admin (Inactive) [ 02/Dec/20 ] |
|
Hi Andreas, thanks for following up. we've had 2.12.5 clients patched with "osc: prevent overflow of o_dropped" since approx Nov 15, and since then some of those have hit the slow i/o issue. I built the current version of the patch (at the time) which was CWARN instead of CDEBUG. there have been no o_dropped messages in syslog. we shutdown all clients and servers for a downtime last week. we've been back up for 3 days now. since then the slow i/o issue hasn't re-appeared. we normally see it quite quickly - a couple of nodes a day - so that's a bit odd. perhaps the whole system shutdown (rather than just our usual rolling reboots of clients or servers) has wiped out some hysteresis? during the downtime were going to add the "obdclass: add grant fields to export procfile" patch to the server's 2.12.5, but unfortunately we ran out of time. cheers, |
| Comment by SC Admin (Inactive) [ 16/Feb/21 ] |
|
Hiya, an update. we turned off grant_shrink on clients 3 weeks ago as per Olaf's workaround in https://jira.whamcloud.com/browse/LU-14125?focusedCommentId=290273&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-290273 and that's working well so far. I haven't detected any slow i/o since then. we have 2.12.6 on servers and clients now, with the o_dropped patch on client side, and grant_show patch on server side. the grant numbers on clients in general don't match those on the servers, but they aren't wrong by much. when grants are summed across a subset of 121 clients they differ by about 1% - servers report ~4gb more grant than clients, out of total 3tb on one occasion and 4tb on another. cheers, |
| Comment by Gerrit Updater [ 19/Apr/21 ] |
|
Vladimir Saveliev (c17830@cray.com) uploaded a new patch: https://review.whamcloud.com/43375 |
| Comment by Vladimir Saveliev [ 19/Apr/21 ] |
|
The grant accounting disbalance could be caused by tgt_brw_read() which does not set OBD_MD_FLGRANT flag in reply:
if (body->oa.o_valid & OBD_MD_FLCKSUM) {
..
repbody->oa.o_valid = OBD_MD_FLCKSUM | OBD_MD_FLFLAGS;
..
} else {
repbody->oa.o_valid = 0;
}
Then osc_update_grant() does not update client's grant counter: static void osc_update_grant(struct client_obd *cli, struct ost_body *body) { if (body->oa.o_valid & OBD_MD_FLGRANT) { CDEBUG(D_CACHE, "got %llu extra grant\n", body->oa.o_grant); __osc_update_grant(cli, body->oa.o_grant); } } If on server side tgt_grant_shrink() decides to not shrink grants and wants a client to recover grant counter
if (left_space >= tgd->tgd_tot_granted_clients *
TGT_GRANT_SHRINK_LIMIT(exp))
return;
..
we get the case where server did not decrease export's grant counter, but client did. After several shrinks on read it may appear that client thinks that it owns only few grants, while server assumes that the client got max amount of grants already and limits the client with new grants. That may lead to performance degradation. |
| Comment by SC Admin (Inactive) [ 29/Apr/21 ] |
|
thanks! cheers, |
| Comment by SC Admin (Inactive) [ 17/May/21 ] |
|
Hi, we have a few clients with 2.12.6 + hmm, should we also have this one? looks like I missed it... servers have 2.12.6 + below is the state of a node (with grant_shrink on) that's had lots of different jobs on it for about 3 days, but is now idle ie. no i/o or open files on this filesystem. columns are ost, what the server thinks the grant is, what the client thinks the grant is, and server-client. ideally the last column would be zeros. john65 192.168.44.165 ost svr_grant client_grant svr_grant-client_grant OST0000 1879048191 1055293440 823754751 OST0001 1879048191 1879048191 0 OST0002 2097152 2097152 0 OST0003 1881800703 1881800703 0 OST0004 1881800703 1881800703 0 OST0005 1879048191 591069696 1287978495 OST0006 1870790655 1870790655 0 OST0007 2097152 2097152 0 OST0008 1884553215 1414397952 470155263 OST0009 1876295679 1876295679 0 OST000a 2097152 2097152 0 OST000b 1876295679 1876295679 0 OST000c 1884553215 1414791168 469762047 OST000d 1881800703 1411842048 469958655 OST000e 1881800703 1411350528 470450175 OST000f 2097152 2097152 0 OST0010 1880489983 1880489983 0 OST0011 1881800703 1411645440 470155263 OST0012 1884553215 795617280 1088935935 OST0013 1870790655 1870790655 0 the server vs. client grant imbalance tends to grow over time for each OST as the node runs more jobs, but (AFAICT) they get reset to zero again each time an OST disconnects and reconnects to the client (after what looks to be an hour of no i/o to an OST). the OSTs with 0 in the last column above tend to be the ones that have reconnected most recently. over May so far, this node has had 82 disconnect/reconnect events across the various OSTs. in the other 21 OST disconnect cases, an average of 673223605.9 bytes of grant was lost for each one. I guess this means grant accounting wasn't ok in those cases. cheers, |
| Comment by Gerrit Updater [ 03/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/43375/ |
| Comment by Peter Jones [ 03/Nov/21 ] |
|
Landed for 2.15 |
| Comment by Gerrit Updater [ 05/Nov/21 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45471 |
| Comment by Gerrit Updater [ 14/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45471/ |