Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-13766

tgt_grant_check() lsrza-OST000a: cli dfdf1aff-07d9-53b3-5632-c18a78027eb2 claims 1703936 GRANT, real grant 0

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • None
    • Lustre 2.12.4
    • brass
      zfs-0.7.11-9.4llnl.ch6.x86_64
      lustre-2.12.4_6.chaos-1.ch6.x86_64
      (other lustre clusters as well including those at lustre 2.10.8)
    • 3
    • 9223372036854775807

    Description

      Many thousands of console log messages like this one on the lustre OSS nodes after servers were rebooted while clients stayed up:

      Jun 25 03:45:08 brass21 kernel: LustreError: 27913:0:(tgt_grant.c:758:tgt_grant_check()) lsrza-OST0010: cli ac60c141-9de9-1a2e-5d0d-fd1e525ff506 claims 1703936 GRANT, real grant 0
      Jun 25 03:45:08 brass21 kernel: LustreError: 27913:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 237 previous similar messages
      Jun 25 03:47:35 brass10 kernel: LustreError: 20031:0:(tgt_grant.c:758:tgt_grant_check()) lsrza-OST0005: cli f6897b82-71ad-5bc7-b60d-554c4cbbcdf7 claims 1703936 GRANT, real grant 0
      Jun 25 03:47:35 brass10 kernel: LustreError: 20031:0:(tgt_grant.c:758:tgt_grant_check()) Skipped 433 previous similar messages
      

      This server cluster has 4 MDTs and 18 OSTs.

      The number of these messages dropped significantly over time. Roughly, in thousands, counts per day for all of brass were:

      2020-06-24 469
      2020-06-25 417
      2020-06-26 39
      2020-06-27 27
      2020-06-28 16
      2020-06-29 19

      From what I can see, under Lustre 2.12.4 (at least) the clients all have some notion of their allocated grant, and when the server is restarted, the server loses all record of what grant it allocated. They then appear to sync up as clients issue new writes using grant they were given, but that the server does not know about. Eventually they would use up that "old grant" and be back in sync again.

      The pattern above seems consistent with that. But why is the number of such messages so large?

      There are 18 OSTs, and they report 967 exports, so that works out to about (987,000 messages / 18,000 OST_client combinations) = about 54,000 such messages per OST_client combination. It seems strange it would take 54,000 writes for the grant to be synced up between an OST and a client after some disturbance like a reboot.

      Attachments

        Issue Links

          Activity

            [LU-13766] tgt_grant_check() lsrza-OST000a: cli dfdf1aff-07d9-53b3-5632-c18a78027eb2 claims 1703936 GRANT, real grant 0
            ofaaland Olaf Faaland added a comment -

            Thanks Mike

            ofaaland Olaf Faaland added a comment - Thanks Mike

            Olaf, I've found the reason of failures, patch should work now

            tappro Mikhail Pershin added a comment - Olaf, I've found the reason of failures, patch should work now

            Olaf, I am working on that right now, it seems that just taking one patch from master was not enough, some other related changes are needed.

            tappro Mikhail Pershin added a comment - Olaf, I am working on that right now, it seems that just taking one patch from master was not enough, some other related changes are needed.

            Hi Mike,

            Are you able to look at the test failures on https://review.whamcloud.com/#/c/39386/ ?

            thanks

            ofaaland Olaf Faaland added a comment - Hi Mike, Are you able to look at the test failures on https://review.whamcloud.com/#/c/39386/ ? thanks
            ofaaland Olaf Faaland added a comment -

            Vladimir,
            No, those OSTs had >350T free each.

            ofaaland Olaf Faaland added a comment - Vladimir, No, those OSTs had >350T free each.

            Olaf,

            Jun 25 03:45:08 brass21 kernel: LustreError: 27913:0:(tgt_grant.c:758:tgt_grant_check()) lsrza-OST0010: cli ac60c141-9de9-1a2e-5d0d-fd1e525ff506 claims 1703936 GRANT, real grant 0
            

            weren't OSTs mentioned in such messages running out of space by that time by chance?

            vsaveliev Vladimir Saveliev added a comment - Olaf, Jun 25 03:45:08 brass21 kernel: LustreError: 27913:0:(tgt_grant.c:758:tgt_grant_check()) lsrza-OST0010: cli ac60c141-9de9-1a2e-5d0d-fd1e525ff506 claims 1703936 GRANT, real grant 0 weren't OSTs mentioned in such messages running out of space by that time by chance?
            ofaaland Olaf Faaland added a comment -

            OK, good. Thanks.

            ofaaland Olaf Faaland added a comment - OK, good. Thanks.

            I checked locally new patch tests 64e/f and they are working, let's see Maloo test results

            tappro Mikhail Pershin added a comment - I checked locally new patch tests 64e/f and they are working, let's see Maloo test results
            ofaaland Olaf Faaland added a comment -

            Thanks, Mikhail. I ported that grant patch for direct io also, and in my local test (using FSTYPE=zfs llmount.sh, and dd oflag=direct) it did not work. Unfortunately, I just got that far yesterday before I had to stop, so I don't know yet why. Our backports look the same to me. Did you test it successfully, or are you waiting for auto testing results for that?

            ofaaland Olaf Faaland added a comment - Thanks, Mikhail. I ported that grant patch for direct io also, and in my local test (using FSTYPE=zfs llmount.sh, and dd oflag=direct) it did not work. Unfortunately, I just got that far yesterday before I had to stop, so I don't know yet why. Our backports look the same to me. Did you test it successfully, or are you waiting for auto testing results for that?

            Olaf, I've ported it to b2_12 if needed: https://review.whamcloud.com/39386

            tappro Mikhail Pershin added a comment - Olaf, I've ported it to b2_12 if needed: https://review.whamcloud.com/39386

            I confirmed that I can reproduce LU-12687 under Lustre 2.12.15 (probably no surprise to you).

            ofaaland Olaf Faaland added a comment - I confirmed that I can reproduce LU-12687 under Lustre 2.12.15 (probably no surprise to you).

            People

              tappro Mikhail Pershin
              ofaaland Olaf Faaland
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: