[LU-11798] cur_grant goes to 0 and never increases with 2.8 client and 2.10 server Created: 17/Dec/18 Updated: 13/Apr/19 Resolved: 01/Apr/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.10.5 |
| Fix Version/s: | Lustre 2.13.0, Lustre 2.12.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
client: servers: |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
The Lustre 2.10.5 / zfs 0.7.11 OST is not allocating grant to the clients. I set +cache debug flag on both client and server. On the client, under /proc/fs/lustre/osc/lustre3-OST....: I then issue cur_grant_bytes goes down, and never up, as writes to the OST occur. Eventually it reaches 0 and then stays == 0. There is no connection issue - no log messages indicating connections lost/reconnects, and osc/lustre3-OSTXXXX.*/state has "current_state: FULL" for all OSTs. |
| Comments |
| Comment by Olaf Faaland [ 17/Dec/18 ] |
|
debug logs records on the OST show the following: tgt_grant_check()) lustre3-OST0000: cli 74c91b6f-783d-4891-fd8c-4d39dd24af9b/ffff88e174d62000 granted: 0 ungranted: 1703936 grant: 0 dirty: 0 and no other CDEBUG message from tgt_grant_check(). These messages always show granted == 0 and ungranted > 0, grant == 0, dirty == 0. The logs contain no messages from tgt_grant_alloc() or tgt_grant_shrink(), which I believe implies OBD_MD_FLGRANT is not set. The connection flags reported by the client are: so neither OBD_CONNECT_GRANT_SHRINK nor OBD_CONNECT_GRANT_PARAM are set. |
| Comment by Olaf Faaland [ 17/Dec/18 ] |
|
I'm unable to reproduce this on our test clusters so far, unfortunately. Still working on it. |
| Comment by Andreas Dilger [ 17/Dec/18 ] |
|
This could be why you were able to trigger |
| Comment by Olaf Faaland [ 17/Dec/18 ] |
|
I agree. I forgot to add the link, thanks for adding that. |
| Comment by Olaf Faaland [ 17/Dec/18 ] |
|
@paf here is the ticket for grant going to 0 with 2.10.5 server. |
| Comment by Olaf Faaland [ 18/Dec/18 ] |
|
I've uploaded debug logs from OST 0 and from a client. The client version is 2.8.2_6.chaos plus two patches which add some CDEBUGs to the client, see branch 2.8.2-llnl-lu-11798-troubleshoot in the lustre-release-fe-llnl project. The extra CDEBUGs are not likely helpful, but I'm making it available to you so the line numbers you see in the client debug logs match up with the code. |
| Comment by Peter Jones [ 18/Dec/18 ] |
|
Hi Alex Could you please investigate? Thanks Peter |
| Comment by Alex Zhuravlev [ 18/Dec/18 ] |
|
did you try to reproduce with 2.10 client? I'm building 2.8 locally, but just in case you tried already..
|
| Comment by Olaf Faaland [ 18/Dec/18 ] |
|
Alex, I wasn't able to test with 2.10 client on the catalyst cluster. |
| Comment by Alex Zhuravlev [ 18/Dec/18 ] |
|
thanks.. just tried to reproduce with 2.10 - seem to work well so far.
|
| Comment by Patrick Farrell (Inactive) [ 19/Dec/18 ] |
|
There's some bug which I cannot find right now where the ZFS block size confuses the grant code with older clients... And the server... gets confused about how much grant to send to them or something? It is resolved by the various grant fixes to enable OBD_GRANT_PARAM and the fixes atop it. I cannot recall a single specific patch. I talked about this at LUG this year with some of the DDN Lustre guys, because IU (Cray client, DDN server) was having trouble with it. Perhaps Li Xi remembers better...? (I believe that's who I talked to, but I'm not 100% sure.) |
| Comment by Patrick Farrell (Inactive) [ 19/Dec/18 ] |
|
Found some notes. Olaf - What is your record size on ZFS? Reducing it from 1 MiB to 128K made the problem disappear for us, so that would be a good indication if it's the same issue. |
| Comment by Patrick Farrell (Inactive) [ 19/Dec/18 ] |
|
OK, finally found what I was looking for. Here's the details of a very similar problem we saw with a 2.5 client/2.10 server interop with a 2.5 client that did not have the OBD_GRANT_PARAM patches. We did not track down the precise fix that resolved the issue - We just had them use Cray's 2.7, which has the OBD_GRANT_PARAM and other related fixes integrated and worked fine. Note that the DDN 2.5 client was OK - I never confirmed, but I suspect DDN had the OBD_GRANT_PARAM fixes integrated in to that client. With 1M record sizes, the server is not giving the client any grant: With 128K record sizes, it is: The result of all this is that the client can only send 1 RPC at a time. This works better at with 4 MiB RPCs than 1 MiB RPCs, for obvious reasons, but sending only 1 RPC at a time is the real problem. There are a few different possible fixes on the client side, but again, it will be interesting to know what DDN has here. The check which is failing in tgt_grant_alloc appears to be:
" (As noted at the top, I didn't actually root cause the problem - We just noticed Cray's 2.7 [which likely has other fixes than those in your 2.8] worked and moved on.) |
| Comment by Olaf Faaland [ 19/Dec/18 ] |
|
Thank you Patrick. In this case tgt_grant_alloc is never called (nor tgt_grant_shrink), based on the debug log contents. So I believe OBD_MD_FLGRANT is not set. For some reason that's not the case on another 2.8 client / 2.10 server instance I have for comparison. |
| Comment by Alex Zhuravlev [ 20/Dec/18 ] |
|
ofaaland please check I'm using correct version to reproduce: commit 5c48e8c507bed4d4c35b10a3c61ae17e57ec1372 (tag: 2.8.2_6.chaos, origin/2.8.2-llnl) Author: Jian Yu <yujian@whamcloud.com> Date: Sun Oct 28 11:28:23 2018 -0700
|
| Comment by Alex Zhuravlev [ 20/Dec/18 ] |
|
I just tried with 2.8.2-chaos branch (client) and master (server) - failed to reproduce. and same with 2.10.5 on the server. ZFS record size changed nothing. [root@cl4 tests]# ../utils/lctl get_param osc.*.*grant* osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_grant_bytes=790528 osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_lost_grant_bytes=0 osc.lustre-OST0000-osc-ffff8801d7ebe800.grant_shrink_interval=1200 [root@cl4 tests]# dd if=/dev/urandom of=/mnt/lustre/test.10M bs=5K count=2000 >&/dev/null [root@cl4 tests]# ../utils/lctl get_param osc.*.*grant* osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_grant_bytes=12288 osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_lost_grant_bytes=0 osc.lustre-OST0000-osc-ffff8801d7ebe800.grant_shrink_interval=1200 [root@cl4 tests]# sync [root@cl4 tests]# ../utils/lctl get_param osc.*.*grant* osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_grant_bytes=761856 osc.lustre-OST0000-osc-ffff8801d7ebe800.cur_lost_grant_bytes=0 osc.lustre-OST0000-osc-ffff8801d7ebe800.grant_shrink_interval=1200 |
| Comment by Alex Zhuravlev [ 20/Dec/18 ] |
|
AFAICS, the client was asking for grant every RPC: 00000008:00000020:1.0:1544742792.404120:0:65915:0:(osc_request.c:611:osc_announce_cached()) dirty: 0 undirty: 33554432 dropped 0 grant: 0 undirty (which is how much the client asks for) is 32MB for old pre-GRANT_PARAM clients we've got the following: if (!exp_grant_param_supp(exp)) { oa->o_undirty = tgt_grant_inflate(tgd, oa->o_undirty); static inline u64 tgt_grant_inflate(struct tg_grants_data *tgd, u64 val) { if (tgd->tgd_blockbits > COMPAT_BSIZE_SHIFT) /* Client does not support such large block size, grant * is thus inflated. We already significantly overestimate * overhead, no need to add the extent tax in this case */ return val << (tgd->tgd_blockbits - COMPAT_BSIZE_SHIFT); #define COMPAT_BSIZE_SHIFT 12 and finally: #define o_undirty o_mode __u32 o_mode; my understanding that for blocksize=512K (1^19) and larger tgt_grant_inflate() generates a number overflowing o_undirty and making it 0 my previous try to reproduce that failed because blocksize for grants is learnt at mount, it doesn't reflect subsequent changes made with zfs set recordsize=. |
| Comment by Olaf Faaland [ 20/Dec/18 ] |
|
That's right, recordsize was different on the test system than it was on the production system, and setting recordsize=1M, then stopping and starting the target, triggers the issue on the test system. Thank you Alex, thank you Patrick! |
| Comment by Alex Zhuravlev [ 21/Dec/18 ] |
|
just attached a quick "fix" which essentially limits client's "wish" on the server side. |
| Comment by Peter Jones [ 21/Dec/18 ] |
|
ofaaland is this kind of stopgap fix sufficient to allow you to move forward? I know that you are planning to complete your migration to 2.10.x in the new year. |
| Comment by Olaf Faaland [ 26/Dec/18 ] |
|
Peter, I think this will work. We have a couple of file systems with ~7000 clients, and so this would mean unused grant would "occupy" about 7TB per OST. The OSTs each have >400TB free right now, so that's OK. This fix is not ideal, as Alex says, but the existing code is badly broken. Is there a reason not to put this patch through the usual system and get it landed to the 2.10 branch? |
| Comment by Alex Zhuravlev [ 27/Dec/18 ] |
|
I think the best solution is to upgrade the clients |
| Comment by Olaf Faaland [ 27/Dec/18 ] |
Alex, I have no argument with that! But a significant number of sites still use lustre versions < 2.10 according to the last survey results I saw: So this needs to be fixed, or at least made less bad, in master. And since you're about to stop maintaining 2.10, it seems a shame to leave it this broken. |
| Comment by Olaf Faaland [ 02/Jan/19 ] |
Happy New Year! What say you? Or is there a patch in gerrit I overlooked? Thanks. |
| Comment by Gerrit Updater [ 02/Jan/19 ] |
|
Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/33948 |
| Comment by Olaf Faaland [ 02/Jan/19 ] |
|
I've pushed p0 as a patch against Master to gerrit, which I will cherry-pick from for our local build. |
| Comment by Peter Jones [ 02/Jan/19 ] |
|
Olaf Thanks for pushing the patch to master. Alex is out for the holidays but I chatted briefly with him earlier and he confirmed that we could land this fix to master (and then to b2_10 and b2_12 afterwards) Peter |
| Comment by Olaf Faaland [ 02/Jan/19 ] |
|
Great, thank you both. |
| Comment by Gerrit Updater [ 03/Jan/19 ] |
|
Alex Zhuravlev (bzzz@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33957 |
| Comment by Gerrit Updater [ 01/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33948/ |
| Comment by Peter Jones [ 01/Apr/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 01/Apr/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/34561 |
| Comment by Gerrit Updater [ 08/Apr/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/34561/ |