[LU-9704] ofd_grant_check() claims GRANT, real grant 0 Created: 22/Jun/17 Updated: 16/Mar/22 Resolved: 20/Nov/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | Lustre 2.12.8, Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Stephane Thiell | Assignee: | Mikhail Pershin |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | None | ||
| Environment: |
3.10.0-514.16.1.el7_lustre.x86_64, lustre-2.9.0_srcc7-1.el7.centos.x86_64 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Hi, After about 2 months of production, our new Lustre 2.9-based system (Oak) started to show the following Lustre errors in the logs on our OSS servers, just a few days ago: Jun 22 10:45:06 oak-io1-s1 kernel: LustreError: 26807:0:(ofd_grant.c:641:ofd_grant_check()) oak-OST0014: cli 1b1c6319-5ec1-dbcc-f6e7-f85575c95c4c claims 131072 GRANT, real grant 0 Attaching an lquota subsystem debug trace on one of the OSS and a splunk graph showing that these messages are actually new. I first thought that this could be related to Thanks, |
| Comments |
| Comment by Peter Jones [ 24/Jun/17 ] |
|
Niu Could you please assist with this one? Thanks Peter |
| Comment by Niu Yawei (Inactive) [ 26/Jun/17 ] |
|
The log has only one line of such message at the very beginning of the log file, do you have other logs which includes more messages before and after this error? What about the disk usage on this OST? Is it almost full or quite empty? Any disconnect/reconnect happened when this error showed up? Thanks. |
| Comment by Stephane Thiell [ 26/Jun/17 ] |
|
Hi Niu, The error is gone, I cannot see it anymore... I found another trace that I took at almost the same time (attaching as oak_dbg_2_oak-io1-s1.log just in case). The disk usage is pretty low on all OSTs, from 1% to 31% (we recently added more OSTs, hence the disparity in OST usage). I will update the case if I see this error again. Thanks! Stephane |
| Comment by Niu Yawei (Inactive) [ 27/Jun/17 ] |
|
Looks the 'granted' value on client and server is inconsistent somehow, and it happened only on one client (with one OST server), I guess such inconsistency could be brought by some rare defect in reconnecting. Please keep updating when you see it again, thanks in advance. |
| Comment by Brian Andrus [ 30/Jun/17 ] |
|
I can consistently have this happen with 2.9.59_15_g107b2cb by running As I increase the size option, the number of skipped messages also increases. |
| Comment by Stephane Thiell [ 11/Aug/17 ] |
|
Hey Niu, The error messages are back for us too, after having performed OST failovers (might give some hint...). We're still on 2.9 though. [2682214.198182] LustreError: 361698:0:(ofd_grant.c:641:ofd_grant_check()) oak-OST0003: cli bf793f06-a9f4-4e4a-85c9-79e10ac88ed7 claims 32768 GRANT, real grant 0 [2682214.214014] LustreError: 361698:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 1647 previous similar messages [2682817.076669] LustreError: 361674:0:(ofd_grant.c:641:ofd_grant_check()) oak-OST0003: cli bf793f06-a9f4-4e4a-85c9-79e10ac88ed7 claims 40960 GRANT, real grant 0 [2682817.092530] LustreError: 361674:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 748 previous similar messages [2683417.826667] LustreError: 361684:0:(ofd_grant.c:641:ofd_grant_check()) oak-OST0011: cli 7b5cc9f0-1221-60e2-0871-2105cd878e21 claims 32768 GRANT, real grant 0 [2683417.842515] LustreError: 361684:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 760 previous similar messages [2684036.203234] LustreError: 156433:0:(ofd_grant.c:641:ofd_grant_check()) oak-OST0011: cli 875f290d-d0a2-a84a-c1f4-05a972b31461 claims 847872 GRANT, real grant 0 [2684036.219161] LustreError: 156433:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 736 previous similar messages [2684646.921608] LustreError: 156521:0:(ofd_grant.c:641:ofd_grant_check()) oak-OST0025: cli 0e6972c3-d679-e536-aa5f-ceff2d42a2c4 claims 28672 GRANT, real grant 0 [2684646.937477] LustreError: 156521:0:(ofd_grant.c:641:ofd_grant_check()) Skipped 917 previous similar messages |
| Comment by Artem Blagodarenko (Inactive) [ 08/Dec/17 ] |
|
Faced with this error message on one of the systems claims 4218880 GRANT, real grant 0 Do you have an idea where to move investigating this issue? There are other credits-related messages in logs kernel: LustreError: 28855:0:(ofd_grant.c:358:ofd_grant_space_left()) ... left 2232417341440 < tot_grant 2232417341504 unstable 0 pending 0 dirty 29528064
|
| Comment by Peter Jones [ 08/Dec/17 ] |
|
Mike Can you please advise? Peter |
| Comment by Malcolm Haak - NCI (Inactive) [ 07/Mar/18 ] |
|
We are also seeing this issue on v2_10_2 servers with v2_7_21 client. [Fri Mar 2 23:12:13 2018] LustreError: 15798:0:(tgt_grant.c:733:tgt_grant_check()) gdata1b-OST0029: cli 6e512e04-c8a3-bdcc-f61e-bb0e3977633d claims 4849664 GRANT, real grant 0 Can we also get this looked at? We are downgrading the client to our stable 2.5 build so we no longer have exp_grant_param_supp flag and fall back to the other behaviour of simply expanding the grant to match. This is a workaround but it looks like the client is having its data dropped: [Sat Mar 03 00:32:42 2018] LustreError: 11-0: gdata1b-OST000c-osc-ffff8807f41f1c00: operation ost_write to node 10.112.1.53@o2ib8 failed: rc = -107 |
| Comment by James A Simmons [ 21/Mar/18 ] |
|
I'm seeing this while testing on our ARM system running 2.11 and 2.11 servers on x86 boxes. |
| Comment by Malcolm Haak - NCI (Inactive) [ 21/Mar/18 ] |
|
I'm now seeing this with v2_10_2 servers and v2_5_43 client. Interestingly I have the same client on other nodes who aren't seeing the issue. Could this be MOFED related? It's one of the few differences between the nodes. |
| Comment by Patrick Farrell (Inactive) [ 18/Apr/18 ] |
|
I'm not sure whether or not the grant messages and the disconnect (that's what -107 is) are related or not. I'm curious, has anyone explored a possible connection to Basically, the client (with or without OBD_GRANT_PARAM, it's a long story) has several grant bugs where it can ask for more than it is allowed or the grant calculation can overflow the 32 bit integer used to store it, resulting in weirdly small grant #s showing up at various points. That patch should resolve all of them in a slightly brute force manner by fixing the arithmetic and capping the resultant request at 2 GiB - some margin. If the problem is relatively easy to reproduce, it would be very interesting to see the effect of that patch on the clients. (Note that patch doesn't change anything on the server - It modifies a server-side file, but only to replace a numeric constant with a #define. No behavioral change to the server code.) It would also be interesting if that patch resolved the grant message but the disconnects and lost data continued. (In fact, that's sort of what I would expect to happen if I'm right.) |
| Comment by Malcolm Haak - NCI (Inactive) [ 19/Apr/18 ] |
|
I had a look at However I do think there is value in testing this patch. |
| Comment by nasf (Inactive) [ 21/May/18 ] |
|
We hit similar trouble on BGI site, server is 2.10.2, client is 2.7.19.10:
|
| Comment by Stephane Thiell [ 25/Mar/19 ] |
|
I still don't know if these messages have a bad effect but they appeared again for us on a full 2.12.0 config (servers and clients) yesterday (on March 23, 20:11:59) and this is still ongoing. I checked and this is only on one OST with a single client uuid: fir-io2-s1: Mar 23 20:11:59 fir-io2-s1 kernel: LustreError: 55051:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST000c: cli b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 claims 2256896 GRANT, real grant 0 fir-io2-s1: Mar 23 20:11:59 fir-io2-s1 kernel: LustreError: 55051:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 68 previous similar messages fir-io2-s1: Mar 23 20:16:59 fir-io2-s1 kernel: LustreError: 54991:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST000c: cli b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 claims 4218880 GRANT, real grant 0 ... ... ... fir-io2-s1: Mar 24 22:47:52 fir-io2-s1 kernel: LustreError: 55075:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST000c: cli b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 claims 2224128 GRANT, real grant 0 fir-io2-s1: Mar 24 22:47:52 fir-io2-s1 kernel: LustreError: 55075:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 40 previous similar messages $ cat /proc/fs/lustre/obdfilter/fir-OST000c/exports/10.8.1.31@o2ib6/uuid b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 client is sh-01-31, in production, quiet / no kernel logs at all (since Mar 23 02:52:28). On this node, I have found only one job and so a single user that is accessing the filesystem at the moment, and his quota values seem ok, I guess... $ lfs quota -u ieisenbe /fir
Disk quotas for usr ieisenbe (uid 256460):
Filesystem kbytes quota limit grace files quota limit grace
/fir 6629070564 80530636800 80530636800 - 1844715 75000000 75000000 -
$ lfs quota -g $(id -g ieisenbe) /fir
Disk quotas for grp 254778 (gid 254778):
Filesystem kbytes quota limit grace files quota limit grace
/fir 35467077100 91268055040 91268055040 - 14727670 85000000 85000000 -
|
| Comment by James A Simmons [ 23/Nov/20 ] |
|
Is this still valid? A solution landed to 2.12. |
| Comment by Gerrit Updater [ 26/Oct/21 ] |
|
"Vladimir Saveliev <vlaidimir.saveliev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45371 |
| Comment by Vladimir Saveliev [ 26/Oct/21 ] |
|
The following scenario makes a message like "claims 28672 GRANT, real 1. client owns X grants and run rpcs to shrink part of those |
| Comment by Gerrit Updater [ 06/Nov/21 ] |
|
"Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45474 |
| Comment by Gerrit Updater [ 17/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45474/ |
| Comment by Gerrit Updater [ 20/Nov/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45371/ |
| Comment by Peter Jones [ 20/Nov/21 ] |
|
Landed for 2.15 |