[LU-11798] cur_grant goes to 0 and never increases with 2.8 client and 2.10 server Created: 17/Dec/18  Updated: 13/Apr/19  Resolved: 01/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.10.5
Fix Version/s: Lustre 2.13.0, Lustre 2.12.1

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

client:
lustre-2.8.2_5.chaos-1.ch6.x86_64

servers:
zfs-0.7.11-4llnl.ch6.x86_64
lustre-2.10.5_2.chaos-3.ch6.x86_64


Attachments: HTML File p0     File toss-4377.catalyst107.dec13.tgz     File toss-4377.ost0.dec13.tar.gz    
Issue Links:
Related
is related to LU-11663 corrupt data after page-unaligned wri... Resolved
is related to LU-11844 IO pattern causing writes to hang to OST Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

The Lustre 2.10.5 / zfs 0.7.11 OST is not allocating grant to the clients.

I set +cache debug flag on both client and server.

On the client, under /proc/fs/lustre/osc/lustre3-OST....:
cur_dirty_bytes == 0
cur_grant_bytes starts out as 2097152

I then issue
dd if=/dev/urandom of=/p/lustre3/foo/toss-4377/ost0/test.10M bs=5K count=2000

cur_grant_bytes goes down, and never up, as writes to the OST occur. Eventually it reaches 0 and then stays == 0.

There is no connection issue - no log messages indicating connections lost/reconnects, and osc/lustre3-OSTXXXX.*/state has "current_state: FULL" for all OSTs.



 Comments   
Comment by Olaf Faaland [ 17/Dec/18 ]

debug logs records on the OST show the following:

tgt_grant_check()) lustre3-OST0000: cli 74c91b6f-783d-4891-fd8c-4d39dd24af9b/ffff88e174d62000 granted: 0 ungranted: 1703936 grant: 0 dirty: 0

and no other CDEBUG message from tgt_grant_check(). These messages always show granted == 0 and ungranted > 0, grant == 0, dirty == 0. The logs contain no messages from tgt_grant_alloc() or tgt_grant_shrink(), which I believe implies OBD_MD_FLGRANT is not set.

The connection flags reported by the client are:
flags=0x20404af0e3440478
write_grant server_lock version request_portal truncate_lock max_byte_per_rpc early_lock_cancel adaptive_timeouts lru_resize alt_checksum_algorithm fid_is_enabled version_recovery full20 layout_lock 64bithash object_max_bytes jobstats einprogress lvb_type lfsck bulk_mbits

so neither OBD_CONNECT_GRANT_SHRINK nor OBD_CONNECT_GRANT_PARAM are set.

Comment by Olaf Faaland [ 17/Dec/18 ]

I'm unable to reproduce this on our test clusters so far, unfortunately. Still working on it.

Comment by Andreas Dilger [ 17/Dec/18 ]

This could be why you were able to trigger LU-11663 fairly easily. When the client has no grant it will immediately send synchronous writes to the OSS because it doesn't know if it is safe to cache a write if there is no space on the OST.

Comment by Olaf Faaland [ 17/Dec/18 ]

I agree. I forgot to add the link, thanks for adding that.

Comment by Olaf Faaland [ 17/Dec/18 ]

@paf here is the ticket for grant going to 0 with 2.10.5 server.

Comment by Olaf Faaland [ 18/Dec/18 ]

I've uploaded debug logs from OST 0 and from a client.

The client version is 2.8.2_6.chaos plus two patches which add some CDEBUGs to the client, see branch 2.8.2-llnl-lu-11798-troubleshoot in the lustre-release-fe-llnl project. The extra CDEBUGs are not likely helpful, but I'm making it available to you so the line numbers you see in the client debug logs match up with the code.

toss-4377.catalyst107.dec13.tgz toss-4377.ost0.dec13.tar.gz

Comment by Peter Jones [ 18/Dec/18 ]

Hi Alex

Could you please investigate?

Thanks

Peter

Comment by Alex Zhuravlev [ 18/Dec/18 ]

did you try to reproduce with 2.10 client? I'm building 2.8 locally, but just in case you tried already..

 

Comment by Olaf Faaland [ 18/Dec/18 ]

Alex, I wasn't able to test with 2.10 client on the catalyst cluster.

Comment by Alex Zhuravlev [ 18/Dec/18 ]

thanks..  just tried to reproduce with 2.10 - seem to work well so far.

 

Comment by Patrick Farrell (Inactive) [ 19/Dec/18 ]

There's some bug which I cannot find right now where the ZFS block size confuses the grant code with older clients...  And the server... gets confused about how much grant to send to them or something?  It is resolved by the various grant fixes to enable OBD_GRANT_PARAM and the fixes atop it.  I cannot recall a single specific patch.

I talked about this at LUG this year with some of the DDN Lustre guys, because IU (Cray client, DDN server) was having trouble with it.  Perhaps Li Xi remembers better...?  (I believe that's who I talked to, but I'm not 100% sure.)

Comment by Patrick Farrell (Inactive) [ 19/Dec/18 ]

Found some notes.  Olaf - What is your record size on ZFS?  Reducing it from 1 MiB to 128K made the problem disappear for us, so that would be a good indication if it's the same issue.

Comment by Patrick Farrell (Inactive) [ 19/Dec/18 ]

OK, finally found what I was looking for.

Here's the details of a very similar problem we saw with a 2.5 client/2.10 server interop with a 2.5 client that did not have the OBD_GRANT_PARAM patches.  We did not track down the precise fix that resolved the issue - We just had them use Cray's 2.7, which has the OBD_GRANT_PARAM and other related fixes integrated and worked fine.

Note that the DDN 2.5 client was OK - I never confirmed, but I suspect DDN had the OBD_GRANT_PARAM fixes integrated in to that client.
"

With 1M record sizes, the server is not giving the client any grant:
00002000:00000001:27.0:1522265343.971722:0:109178:0:(tgt_grant.c:909:tgt_grant_alloc()) Process leaving (rc=0 : 0 : 0)

With 128K record sizes, it is:
00002000:00000001:16.0:1522265535.951623:0:34090:0:(tgt_grant.c:951:tgt_grant_alloc()) Process leaving (rc=33554432 : 33554432 : 2000000)

The result of all this is that the client can only send 1 RPC at a time. This works better at with 4 MiB RPCs than 1 MiB RPCs, for obvious reasons, but sending only 1 RPC at a time is the real problem.

There are a few different possible fixes on the client side, but again, it will be interesting to know what DDN has here.

The check which is failing in tgt_grant_alloc appears to be:
/* Grant some fraction of the client's requested grant space so that

  • they are not always waiting for write credits (not all of it to
  • avoid overgranting in face of multiple RPCs in flight). This
  • essentially will be able to control the OSC_MAX_RIF for a client.
    *
  • If we do have a large disparity between what the client thinks it
  • has and what we think it has, don't grant very much and let the
  • client consume its grant first. Either it just has lots of RPCs
  • in flight, or it was evicted and its grants will soon be used up. */
    if (curgrant >= want || curgrant >= ted->ted_grant + chunk)
    RETURN(0);

"

(As noted at the top, I didn't actually root cause the problem - We just noticed Cray's 2.7 [which likely has other fixes than those in your 2.8] worked and moved on.)

Comment by Olaf Faaland [ 19/Dec/18 ]

Thank you Patrick. In this case tgt_grant_alloc is never called (nor tgt_grant_shrink), based on the debug log contents. So I believe OBD_MD_FLGRANT is not set. For some reason that's not the case on another 2.8 client / 2.10 server instance I have for comparison.

Comment by Alex Zhuravlev [ 20/Dec/18 ]

ofaaland please check I'm using correct version to reproduce:

commit 5c48e8c507bed4d4c35b10a3c61ae17e57ec1372 (tag: 2.8.2_6.chaos, origin/2.8.2-llnl)

Author: Jian Yu <yujian@whamcloud.com>

Date:   Sun Oct 28 11:28:23 2018 -0700

 

Comment by Alex Zhuravlev [ 20/Dec/18 ]

I just tried with 2.8.2-chaos branch (client) and master (server) - failed to reproduce. and same with 2.10.5 on the server. ZFS record size changed nothing. 

[root@cl4 tests]# ../utils/lctl get_param osc.*.*grant*
osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_grant_bytes=790528
osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_lost_grant_bytes=0
osc.lustre-OST0000-osc-ffff8801d7ebe800.grant_shrink_interval=1200
[root@cl4 tests]# dd if=/dev/urandom of=/mnt/lustre/test.10M bs=5K count=2000 >&/dev/null
[root@cl4 tests]# ../utils/lctl get_param osc.*.*grant*
osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_grant_bytes=12288
osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_lost_grant_bytes=0
osc.lustre-OST0000-osc-ffff8801d7ebe800.grant_shrink_interval=1200
[root@cl4 tests]# sync
[root@cl4 tests]# ../utils/lctl get_param osc.*.*grant*
osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_grant_bytes=761856
osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_lost_grant_bytes=0
osc.lustre-OST0000-osc-ffff8801d7ebe800.grant_shrink_interval=1200
 
Comment by Alex Zhuravlev [ 20/Dec/18 ]

AFAICS, the client was asking for grant every RPC:

00000008:00000020:1.0:1544742792.404120:0:65915:0:(osc_request.c:611:osc_announce_cached()) dirty: 0 undirty: 33554432 dropped 0 grant: 0

undirty (which is how much the client asks for) is 32MB

for old pre-GRANT_PARAM clients we've got the following:

if (!exp_grant_param_supp(exp)) {
        oa->o_undirty = tgt_grant_inflate(tgd, oa->o_undirty); 

static inline u64 tgt_grant_inflate(struct tg_grants_data *tgd, u64 val)
{
	if (tgd->tgd_blockbits > COMPAT_BSIZE_SHIFT)
		/* Client does not support such large block size, grant
		 * is thus inflated. We already significantly overestimate
		 * overhead, no need to add the extent tax in this case */
		return val << (tgd->tgd_blockbits - COMPAT_BSIZE_SHIFT);

#define COMPAT_BSIZE_SHIFT 12

 and finally:

#define o_undirty o_mode 
 __u32 o_mode;

my understanding that for blocksize=512K (1^19) and larger tgt_grant_inflate() generates a number overflowing o_undirty and making it 0

my previous try to reproduce that failed because blocksize for grants is learnt at mount, it doesn't reflect subsequent changes made with zfs set recordsize=.
if the recordsize is set before mount, then it's trivial to reproduce.

Comment by Olaf Faaland [ 20/Dec/18 ]

That's right, recordsize was different on the test system than it was on the production system, and setting recordsize=1M, then stopping and starting the target, triggers the issue on the test system.

Thank you Alex, thank you Patrick!

Comment by Alex Zhuravlev [ 21/Dec/18 ]

just attached a quick "fix" which essentially limits client's "wish" on the server side.
this is far from a good solution, I'm afraid, but at least not every 4K IO is sync.
with this approach the larger ZFS block size, the more sync IO.

Comment by Peter Jones [ 21/Dec/18 ]

ofaaland is this kind of stopgap fix sufficient to allow you to move forward? I know that you are planning to complete your migration to 2.10.x in the new year.

Comment by Olaf Faaland [ 26/Dec/18 ]

Peter, I think this will work.

We have a couple of file systems with ~7000 clients, and so this would mean unused grant would "occupy" about 7TB per OST. The OSTs each have >400TB free right now, so that's OK.

This fix is not ideal, as Alex says, but the existing code is badly broken. Is there a reason not to put this patch through the usual system and get it landed to the 2.10 branch?

Comment by Alex Zhuravlev [ 27/Dec/18 ]

I think the best solution is to upgrade the clients

Comment by Olaf Faaland [ 27/Dec/18 ]

I think the best solution is to upgrade the clients

Alex, I have no argument with that! But a significant number of sites still use lustre versions < 2.10 according to the last survey results I saw:
https://www.eofs.eu/_media/events/lad18/02_peter_jones_community_release_update_lad_2018_final.pdf

So this needs to be fixed, or at least made less bad, in master. And since you're about to stop maintaining 2.10, it seems a shame to leave it this broken.

Comment by Olaf Faaland [ 02/Jan/19 ]

So this needs to be fixed, or at least made less bad, in master. And since you're about to stop maintaining 2.10, it seems a shame to leave it this broken.

Happy New Year!   What say you?  Or is there a patch in gerrit I overlooked? Thanks.

Comment by Gerrit Updater [ 02/Jan/19 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/33948
Subject: LU-11798 grant: prevent overflow of o_undirty
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 0b13ea79c3477ed835f40c12a03e38d92e7b77d4

Comment by Olaf Faaland [ 02/Jan/19 ]

I've pushed p0 as a patch against Master to gerrit, which I will cherry-pick from for our local build.

Comment by Peter Jones [ 02/Jan/19 ]

Olaf

Thanks for pushing the patch to master. Alex is out for the holidays but I chatted briefly with him earlier and he confirmed that we could land this fix to master (and then to b2_10 and b2_12 afterwards)

Peter

Comment by Olaf Faaland [ 02/Jan/19 ]

Great, thank you both.

Comment by Gerrit Updater [ 03/Jan/19 ]

Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33957
Subject: LU-11798 grant: prevent overflow of o_undirty
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: a8d73df2469a4966f93d61fc4668f581c1dd7422

Comment by Gerrit Updater [ 01/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33948/
Subject: LU-11798 grant: prevent overflow of o_undirty
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d6f5219162116778e49865d83724276b17b9ce3a

Comment by Peter Jones [ 01/Apr/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 01/Apr/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34561
Subject: LU-11798 grant: prevent overflow of o_undirty
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 9786a2a7bf1d27539ea5aaa27200a5f651d9b430

Comment by Gerrit Updater [ 08/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34561/
Subject: LU-11798 grant: prevent overflow of o_undirty
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: e13cc954a776f6dec06f01299a55a55018ae6235

Generated at Sat Feb 10 02:47:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.