Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.12.8, Lustre 2.15.0
Affects Version/s: Lustre 2.12.5
Labels:
None
Environment:
centos7, x86_64, OPA, lustre 2.12.5, zfs 0.8.5

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

Hi,

we're seeing slow i/o (~kB/s) to some OSTs from some clients. this seems to be correlated with low grant on the client.

here is a reproducer that sets up striped dirs (one per OST) and then does a small dd to each.

for f in {0..19}; do mkdir $f ; lfs setstripe -i $f $f; done
for f in {0..19}; do echo $f ; dd if=/tmp/50m of=$f/blah9i & done

where /tmp/50m is a almost incompressible file of size 64149552 bytes.

a few seconds later, some dd's have finished quickly, and others will literally take 6 hours

[me@farnarkle2 test]$ ls -l */blah9i | sort -n -k9
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 0/blah9i
-rw-r--r-- 1 me oz997   423424 Nov  6 18:58 1/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 2/blah9i
-rw-r--r-- 1 me oz997   409600 Nov  6 18:58 3/blah9i
-rw-r--r-- 1 me oz997   409088 Nov  6 18:58 4/blah9i
-rw-r--r-- 1 me oz997   253440 Nov  6 18:58 5/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 6/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 7/blah9i
-rw-r--r-- 1 me oz997   158720 Nov  6 18:58 8/blah9i
-rw-r--r-- 1 me oz997   570368 Nov  6 18:58 9/blah9i
-rw-r--r-- 1 me oz997   574464 Nov  6 18:58 10/blah9i
-rw-r--r-- 1 me oz997   569344 Nov  6 18:58 11/blah9i
-rw-r--r-- 1 me oz997   379904 Nov  6 18:58 12/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 13/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 14/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 15/blah9i
-rw-r--r-- 1 me oz997    84992 Nov  6 18:58 16/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 17/blah9i
-rw-r--r-- 1 me oz997 64149552 Nov  6 18:56 18/blah9i
-rw-r--r-- 1 me oz997   389120 Nov  6 18:58 19/blah9i

the slow i/o seems to mostly but not completely correlate with low grant on the client

[me@farnarkle2 test]$ grep -r . /proc/fs/lustre/osc/dagg-OST*/*grant*
/proc/fs/lustre/osc/dagg-OST0000-osc-ffff9f785aa2a800/cur_grant_bytes:3114734
/proc/fs/lustre/osc/dagg-OST0001-osc-ffff9f785aa2a800/cur_grant_bytes:1832833
/proc/fs/lustre/osc/dagg-OST0002-osc-ffff9f785aa2a800/cur_grant_bytes:1881800703
/proc/fs/lustre/osc/dagg-OST0003-osc-ffff9f785aa2a800/cur_grant_bytes:2050095
/proc/fs/lustre/osc/dagg-OST0004-osc-ffff9f785aa2a800/cur_grant_bytes:1902694
/proc/fs/lustre/osc/dagg-OST0005-osc-ffff9f785aa2a800/cur_grant_bytes:1946593
/proc/fs/lustre/osc/dagg-OST0006-osc-ffff9f785aa2a800/cur_grant_bytes:1879834623
/proc/fs/lustre/osc/dagg-OST0007-osc-ffff9f785aa2a800/cur_grant_bytes:1876295679
/proc/fs/lustre/osc/dagg-OST0008-osc-ffff9f785aa2a800/cur_grant_bytes:1867376
/proc/fs/lustre/osc/dagg-OST0009-osc-ffff9f785aa2a800/cur_grant_bytes:2035056
/proc/fs/lustre/osc/dagg-OST000a-osc-ffff9f785aa2a800/cur_grant_bytes:1674002
/proc/fs/lustre/osc/dagg-OST000b-osc-ffff9f785aa2a800/cur_grant_bytes:1790763
/proc/fs/lustre/osc/dagg-OST000c-osc-ffff9f785aa2a800/cur_grant_bytes:1979790
/proc/fs/lustre/osc/dagg-OST000d-osc-ffff9f785aa2a800/cur_grant_bytes:30273
/proc/fs/lustre/osc/dagg-OST000e-osc-ffff9f785aa2a800/cur_grant_bytes:60219801
/proc/fs/lustre/osc/dagg-OST000f-osc-ffff9f785aa2a800/cur_grant_bytes:251557488
/proc/fs/lustre/osc/dagg-OST0010-osc-ffff9f785aa2a800/cur_grant_bytes:1891954
/proc/fs/lustre/osc/dagg-OST0011-osc-ffff9f785aa2a800/cur_grant_bytes:33611568
/proc/fs/lustre/osc/dagg-OST0012-osc-ffff9f785aa2a800/cur_grant_bytes:9298722
/proc/fs/lustre/osc/dagg-OST0013-osc-ffff9f785aa2a800/cur_grant_bytes:1591979

this effect is most noticeable with the above default bs=512 byte dd. the slow files are writing at a few kB/s. the normal OSTs write this small i/o at ballpark 150 MB/s.
with dd bs=1M it's the same slow OSTs, but they run at a minimum of ~1 MB/s. normal OSTs write at ballpark 700 MB/s with 1M i/o's.
so it kinda feels like ops/second or rcps_in_flight is being limited rather than absolute MB/s, if you know what I mean.

from other client nodes with larger grant showing, we don't have the same issues.
the above client is a login node, so has a lot of random stuff going on.

rebooting the client, or rebooting the servers with the OSTs involved seems to clear the problem, but I don't have any other workarounds. I can't seem to find any grant tweakables on client or server, or in the manual, but perhaps I'm missing something.

there are ~250 client nodes. OSTs are very large at ~300T. OSS ram is 192g, but we've told zfs ARC to use 75% of it. reducing zfs_arc_max on the fly to try to give more OSS ram to lustre doesn't seem to immediately help.

we updated lustre on servers from 2.12.4 + zfs 0.8.3 to 2.12.5 + zfs 0.8.5 this week to see if this issue was resolved in 2.12.5, but the issue still seems to be there. clients have been 2.12.5 for ages.

I forget if lustre grant is related to quota. anyway, this is a group quota'd filesystem and the above oz997 group is not close.

Disk quotas for grp oz997 (gid 10273):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
          /fred  4.859T      0k     10T       -  609477       0 2000000       -

any ideas how to debug this further or for a less crappy workaround than rebooting things?
thanks.

cheers,
robin

Attachments

Issue Links

duplicates

LU-14125 client starved for grant but OST has plenty of free space

Reopened

is related to

LU-14901 consume grants if async write fallback sync io

Open

LU-9704 ofd_grant_check() claims GRANT, real grant 0

Resolved

super slow i/o on client maybe related to low grant

Details

Description

Attachments

Issue Links

Activity

People

Dates