Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.5
-
None
-
centos7, x86_64, OPA, lustre 2.12.5, zfs 0.8.5
-
3
-
9223372036854775807
Description
Hi,
we're seeing slow i/o (~kB/s) to some OSTs from some clients. this seems to be correlated with low grant on the client.
here is a reproducer that sets up striped dirs (one per OST) and then does a small dd to each.
for f in {0..19}; do mkdir $f ; lfs setstripe -i $f $f; done for f in {0..19}; do echo $f ; dd if=/tmp/50m of=$f/blah9i & done
where /tmp/50m is a almost incompressible file of size 64149552 bytes.
a few seconds later, some dd's have finished quickly, and others will literally take 6 hours
[me@farnarkle2 test]$ ls -l */blah9i | sort -n -k9 -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 0/blah9i -rw-r--r-- 1 me oz997 423424 Nov 6 18:58 1/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 2/blah9i -rw-r--r-- 1 me oz997 409600 Nov 6 18:58 3/blah9i -rw-r--r-- 1 me oz997 409088 Nov 6 18:58 4/blah9i -rw-r--r-- 1 me oz997 253440 Nov 6 18:58 5/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 6/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 7/blah9i -rw-r--r-- 1 me oz997 158720 Nov 6 18:58 8/blah9i -rw-r--r-- 1 me oz997 570368 Nov 6 18:58 9/blah9i -rw-r--r-- 1 me oz997 574464 Nov 6 18:58 10/blah9i -rw-r--r-- 1 me oz997 569344 Nov 6 18:58 11/blah9i -rw-r--r-- 1 me oz997 379904 Nov 6 18:58 12/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 13/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 14/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 15/blah9i -rw-r--r-- 1 me oz997 84992 Nov 6 18:58 16/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 17/blah9i -rw-r--r-- 1 me oz997 64149552 Nov 6 18:56 18/blah9i -rw-r--r-- 1 me oz997 389120 Nov 6 18:58 19/blah9i
the slow i/o seems to mostly but not completely correlate with low grant on the client
[me@farnarkle2 test]$ grep -r . /proc/fs/lustre/osc/dagg-OST*/*grant* /proc/fs/lustre/osc/dagg-OST0000-osc-ffff9f785aa2a800/cur_grant_bytes:3114734 /proc/fs/lustre/osc/dagg-OST0001-osc-ffff9f785aa2a800/cur_grant_bytes:1832833 /proc/fs/lustre/osc/dagg-OST0002-osc-ffff9f785aa2a800/cur_grant_bytes:1881800703 /proc/fs/lustre/osc/dagg-OST0003-osc-ffff9f785aa2a800/cur_grant_bytes:2050095 /proc/fs/lustre/osc/dagg-OST0004-osc-ffff9f785aa2a800/cur_grant_bytes:1902694 /proc/fs/lustre/osc/dagg-OST0005-osc-ffff9f785aa2a800/cur_grant_bytes:1946593 /proc/fs/lustre/osc/dagg-OST0006-osc-ffff9f785aa2a800/cur_grant_bytes:1879834623 /proc/fs/lustre/osc/dagg-OST0007-osc-ffff9f785aa2a800/cur_grant_bytes:1876295679 /proc/fs/lustre/osc/dagg-OST0008-osc-ffff9f785aa2a800/cur_grant_bytes:1867376 /proc/fs/lustre/osc/dagg-OST0009-osc-ffff9f785aa2a800/cur_grant_bytes:2035056 /proc/fs/lustre/osc/dagg-OST000a-osc-ffff9f785aa2a800/cur_grant_bytes:1674002 /proc/fs/lustre/osc/dagg-OST000b-osc-ffff9f785aa2a800/cur_grant_bytes:1790763 /proc/fs/lustre/osc/dagg-OST000c-osc-ffff9f785aa2a800/cur_grant_bytes:1979790 /proc/fs/lustre/osc/dagg-OST000d-osc-ffff9f785aa2a800/cur_grant_bytes:30273 /proc/fs/lustre/osc/dagg-OST000e-osc-ffff9f785aa2a800/cur_grant_bytes:60219801 /proc/fs/lustre/osc/dagg-OST000f-osc-ffff9f785aa2a800/cur_grant_bytes:251557488 /proc/fs/lustre/osc/dagg-OST0010-osc-ffff9f785aa2a800/cur_grant_bytes:1891954 /proc/fs/lustre/osc/dagg-OST0011-osc-ffff9f785aa2a800/cur_grant_bytes:33611568 /proc/fs/lustre/osc/dagg-OST0012-osc-ffff9f785aa2a800/cur_grant_bytes:9298722 /proc/fs/lustre/osc/dagg-OST0013-osc-ffff9f785aa2a800/cur_grant_bytes:1591979
this effect is most noticeable with the above default bs=512 byte dd. the slow files are writing at a few kB/s. the normal OSTs write this small i/o at ballpark 150 MB/s.
with dd bs=1M it's the same slow OSTs, but they run at a minimum of ~1 MB/s. normal OSTs write at ballpark 700 MB/s with 1M i/o's.
so it kinda feels like ops/second or rcps_in_flight is being limited rather than absolute MB/s, if you know what I mean.
from other client nodes with larger grant showing, we don't have the same issues.
the above client is a login node, so has a lot of random stuff going on.
rebooting the client, or rebooting the servers with the OSTs involved seems to clear the problem, but I don't have any other workarounds. I can't seem to find any grant tweakables on client or server, or in the manual, but perhaps I'm missing something.
there are ~250 client nodes. OSTs are very large at ~300T. OSS ram is 192g, but we've told zfs ARC to use 75% of it. reducing zfs_arc_max on the fly to try to give more OSS ram to lustre doesn't seem to immediately help.
we updated lustre on servers from 2.12.4 + zfs 0.8.3 to 2.12.5 + zfs 0.8.5 this week to see if this issue was resolved in 2.12.5, but the issue still seems to be there. clients have been 2.12.5 for ages.
I forget if lustre grant is related to quota. anyway, this is a group quota'd filesystem and the above oz997 group is not close.
Disk quotas for grp oz997 (gid 10273): Filesystem used quota limit grace files quota limit grace /fred 4.859T 0k 10T - 609477 0 2000000 -
any ideas how to debug this further or for a less crappy workaround than rebooting things?
thanks.
cheers,
robin