Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9704

ofd_grant_check() claims GRANT, real grant 0

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.8, Lustre 2.15.0
    • Lustre 2.9.0
    • None
    • 3.10.0-514.16.1.el7_lustre.x86_64, lustre-2.9.0_srcc7-1.el7.centos.x86_64
    • 3
    • 9223372036854775807

    Description

      Hi,

      After about 2 months of production, our new Lustre 2.9-based system (Oak) started to show the following Lustre errors in the logs on our OSS servers, just a few days ago:

      Jun 22 10:45:06 oak-io1-s1 kernel: LustreError: 26807:0:(ofd_grant.c:641:ofd_grant_check()) oak-OST0014: cli 1b1c6319-5ec1-dbcc-f6e7-f85575c95c4c claims 131072 GRANT, real grant 0
      
      

      Attaching an lquota subsystem debug trace on one of the OSS and a splunk graph showing that these messages are actually new. I first thought that this could be related to LU-9671 but it might not be after all.

      Thanks,
      Stephane

      Attachments

        Issue Links

          Activity

            [LU-9704] ofd_grant_check() claims GRANT, real grant 0
            pjones Peter Jones added a comment -

            Landed for 2.15

            pjones Peter Jones added a comment - Landed for 2.15

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45371/
            Subject: LU-9704 grant: ignore grant info on read resend
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 38c78ac2e390b30106f3e185d8c4d92b8cb19c2b

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45371/ Subject: LU-9704 grant: ignore grant info on read resend Project: fs/lustre-release Branch: master Current Patch Set: Commit: 38c78ac2e390b30106f3e185d8c4d92b8cb19c2b

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45474/
            Subject: LU-9704 grant: ignore grant info on read resend
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: e1d132acf58c3e6a90a527a0a09cdd0fff7fc392

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/45474/ Subject: LU-9704 grant: ignore grant info on read resend Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: e1d132acf58c3e6a90a527a0a09cdd0fff7fc392

            "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45474
            Subject: LU-9704 grant: ignore grant info on read resend
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 4162f0d61b762f70e5eb6d099291fad9f836de9d

            gerrit Gerrit Updater added a comment - "Mike Pershin <mpershin@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/45474 Subject: LU-9704 grant: ignore grant info on read resend Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 4162f0d61b762f70e5eb6d099291fad9f836de9d

            The following scenario makes a message like "claims 28672 GRANT, real
            grant 0" to appear:

            1. client owns X grants and run rpcs to shrink part of those
            2. server fails over so that the shrink rpc is to be resent.
            3. on the clinet reconnect server and client sync on initial amount
            of grants for the client.
            4. shrink rpc is resend, if server disk space is enough, shrink does
            not happen and the client adds amount of grants it was going to
            shrink to its newly initial amount of grants. Now, client thinks that
            it owns more grants than it does from server points of view.
            5. the client consumes grants and sends rpcs to server. Server avoids
            allocating new grants for the client if the current amount of grant
            is big enough:
            static long tgt_grant_alloc(struct obd_export *exp, u64 curgrant,
            ...
            if (curgrant >= want || curgrant >= ted->ted_grant + chunk)
            RETURN(0);
            6. client continues grants consuming which eventually leads to
            complains like "claims 28672 GRANT, real grant 0".

            vsaveliev Vladimir Saveliev added a comment - The following scenario makes a message like "claims 28672 GRANT, real grant 0" to appear: 1. client owns X grants and run rpcs to shrink part of those 2. server fails over so that the shrink rpc is to be resent. 3. on the clinet reconnect server and client sync on initial amount of grants for the client. 4. shrink rpc is resend, if server disk space is enough, shrink does not happen and the client adds amount of grants it was going to shrink to its newly initial amount of grants. Now, client thinks that it owns more grants than it does from server points of view. 5. the client consumes grants and sends rpcs to server. Server avoids allocating new grants for the client if the current amount of grant is big enough: static long tgt_grant_alloc(struct obd_export *exp, u64 curgrant, ... if (curgrant >= want || curgrant >= ted->ted_grant + chunk) RETURN(0); 6. client continues grants consuming which eventually leads to complains like "claims 28672 GRANT, real grant 0".

            "Vladimir Saveliev <vlaidimir.saveliev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45371
            Subject: LU-9704 grant: ignore grant info on read resend
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 18332a73752be688e69ead3e8d5ac8edb7d45c82

            gerrit Gerrit Updater added a comment - "Vladimir Saveliev <vlaidimir.saveliev@hpe.com>" uploaded a new patch: https://review.whamcloud.com/45371 Subject: LU-9704 grant: ignore grant info on read resend Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 18332a73752be688e69ead3e8d5ac8edb7d45c82

            Is this still valid? A solution landed to 2.12.

            simmonsja James A Simmons added a comment - Is this still valid? A solution landed to 2.12.

            I still don't know if these messages have a bad effect but they appeared again for us on a full 2.12.0 config (servers and clients) yesterday (on March 23, 20:11:59) and this is still ongoing.

            I checked and this is only on one OST with a single client uuid:

            fir-io2-s1: Mar 23 20:11:59 fir-io2-s1 kernel: LustreError: 55051:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST000c: cli b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 claims 2256896 GRANT, real grant 0
            fir-io2-s1: Mar 23 20:11:59 fir-io2-s1 kernel: LustreError: 55051:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 68 previous similar messages
            fir-io2-s1: Mar 23 20:16:59 fir-io2-s1 kernel: LustreError: 54991:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST000c: cli b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 claims 4218880 GRANT, real grant 0
            ...
            ...
            ...
            fir-io2-s1: Mar 24 22:47:52 fir-io2-s1 kernel: LustreError: 55075:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST000c: cli b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 claims 2224128 GRANT, real grant 0
            fir-io2-s1: Mar 24 22:47:52 fir-io2-s1 kernel: LustreError: 55075:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 40 previous similar messages
            
            $ cat /proc/fs/lustre/obdfilter/fir-OST000c/exports/10.8.1.31@o2ib6/uuid
            b312248d-5ec3-9bfd-dd3f-1b6622ac23e2
            

            client is sh-01-31, in production, quiet / no kernel logs at all (since Mar 23 02:52:28). On this node, I have found only one job and so a single user that is accessing the filesystem at the moment, and his quota values seem ok, I guess...

            $ lfs quota -u ieisenbe /fir
            Disk quotas for usr ieisenbe (uid 256460):
                 Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
                       /fir 6629070564  80530636800 80530636800       - 1844715  75000000 75000000       -
            $ lfs quota -g $(id -g ieisenbe) /fir
            Disk quotas for grp 254778 (gid 254778):
                 Filesystem  kbytes   quota   limit   grace   files   quota   limit   grace
                       /fir 35467077100  91268055040 91268055040       - 14727670  85000000 85000000       -
            

             

            sthiell Stephane Thiell added a comment - I still don't know if these messages have a bad effect but they appeared again for us on a full 2.12.0 config (servers and clients) yesterday (on March 23, 20:11:59) and this is still ongoing. I checked and this is only on one OST with a single client uuid: fir-io2-s1: Mar 23 20:11:59 fir-io2-s1 kernel: LustreError: 55051:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST000c: cli b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 claims 2256896 GRANT, real grant 0 fir-io2-s1: Mar 23 20:11:59 fir-io2-s1 kernel: LustreError: 55051:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 68 previous similar messages fir-io2-s1: Mar 23 20:16:59 fir-io2-s1 kernel: LustreError: 54991:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST000c: cli b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 claims 4218880 GRANT, real grant 0 ... ... ... fir-io2-s1: Mar 24 22:47:52 fir-io2-s1 kernel: LustreError: 55075:0:(tgt_grant.c:742:tgt_grant_check()) fir-OST000c: cli b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 claims 2224128 GRANT, real grant 0 fir-io2-s1: Mar 24 22:47:52 fir-io2-s1 kernel: LustreError: 55075:0:(tgt_grant.c:742:tgt_grant_check()) Skipped 40 previous similar messages $ cat /proc/fs/lustre/obdfilter/fir-OST000c/exports/10.8.1.31@o2ib6/uuid b312248d-5ec3-9bfd-dd3f-1b6622ac23e2 client is sh-01-31 , in production, quiet / no kernel logs at all (since Mar 23 02:52:28). On this node, I have found only one job and so a single user that is accessing the filesystem at the moment, and his quota values seem ok, I guess... $ lfs quota -u ieisenbe /fir Disk quotas for usr ieisenbe (uid 256460): Filesystem kbytes quota limit grace files quota limit grace /fir 6629070564 80530636800 80530636800 - 1844715 75000000 75000000 - $ lfs quota -g $(id -g ieisenbe) /fir Disk quotas for grp 254778 (gid 254778): Filesystem kbytes quota limit grace files quota limit grace /fir 35467077100 91268055040 91268055040 - 14727670 85000000 85000000 -  

            We hit similar trouble on BGI site, server is 2.10.2, client is 2.7.19.10:

            00002000:00020000:9.0:1526810613.427919:0:16020:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 4849664 GRANT, real grant 0
            00002000:00020000:10.0:1526810619.171036:0:16033:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 917504 GRANT, real grant 0
            00002000:00020000:0.0:1526810624.176488:0:16007:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 917504 GRANT, real grant 0
            00002000:00020000:6.0:1526810648.251551:0:16035:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 4849664 GRANT, real grant 0
            00002000:00020000:11.0:1526810649.207223:0:16007:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 786432 GRANT, real grant 0
            00002000:00020000:11.0:1526810649.655100:0:16035:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 4849664 GRANT, real grant 0

            yong.fan nasf (Inactive) added a comment - We hit similar trouble on BGI site, server is 2.10.2, client is 2.7.19.10: 00002000:00020000:9.0:1526810613.427919:0:16020:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 4849664 GRANT, real grant 0 00002000:00020000:10.0:1526810619.171036:0:16033:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 917504 GRANT, real grant 0 00002000:00020000:0.0:1526810624.176488:0:16007:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 917504 GRANT, real grant 0 00002000:00020000:6.0:1526810648.251551:0:16035:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 4849664 GRANT, real grant 0 00002000:00020000:11.0:1526810649.207223:0:16007:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 786432 GRANT, real grant 0 00002000:00020000:11.0:1526810649.655100:0:16035:0:(tgt_grant.c:733:tgt_grant_check()) zfsz2-OST0009: cli ab95b9a6-6564-3cd9-cdc8-75af4a40f6b9 claims 4849664 GRANT, real grant 0

            I had a look at LU-10776. We are using 1MB RPC's and only 8 in flight RPC's. So I'm not sure if that is related.

            However I do think there is value in testing this patch.

            mhaakddn Malcolm Haak - NCI (Inactive) added a comment - I had a look at LU-10776 . We are using 1MB RPC's and only 8 in flight RPC's. So I'm not sure if that is related. However I do think there is value in testing this patch.

            People

              tappro Mikhail Pershin
              sthiell Stephane Thiell
              Votes:
              3 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: