Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10055

mdt_fill_lvbo() message spew on MDS console

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.7
    • Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running the (almost) latest version of b2_10 (see LU-9983 for details), seeing quite a few of these on the MDS console:

      /scratch/logs/syslog/soak-8.log:Oct  1 22:37:25 soak-8 kernel: LustreError: 8097:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 944 actual 416.
      /scratch/logs/syslog/soak-9.log:Oct  1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 6 previous similar messages
      /scratch/logs/syslog/soak-9.log:Oct  1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 872 actual 416.
      /scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 10 previous similar messages
      /scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
      /scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 4181:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
      /scratch/logs/syslog/soak-10.log:Oct  1 22:44:04 soak-10 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
      /scratch/logs/syslog/soak-9.log:Oct  1 22:44:04 soak-9 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 848 actual 416.
      /scratch/logs/syslog/soak-10.log:Oct  1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 8 previous similar messages
      /scratch/logs/syslog/soak-10.log:Oct  1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
      /scratch/logs/syslog/soak-9.log:Oct  1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 9 previous similar messages
      /scratch/logs/syslog/soak-9.log:Oct  1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 800 actual 416.
      /scratch/logs/syslog/soak-9.log:Oct  1 22:59:06 soak-9 kernel: LustreError: 2357:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 776 actual 416.
      

      Attachments

        Issue Links

          Activity

            [LU-10055] mdt_fill_lvbo() message spew on MDS console

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33133
            Subject: LU-10055 mdt: use max_mdsize in reply for layout intent
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 6e89fbc66336b20821c219fbe38de643ff925afa

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33133 Subject: LU-10055 mdt: use max_mdsize in reply for layout intent Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 6e89fbc66336b20821c219fbe38de643ff925afa

            Can we get a backport to 2.10.5. We are see this error on our 2.10.5 servers.

            mhanafi Mahmoud Hanafi added a comment - Can we get a backport to 2.10.5. We are see this error on our 2.10.5 servers.
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30004/
            Subject: LU-10055 mdt: use max_mdsize in reply for layout intent
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4f27911cadf10d0b2fd6451569e688233eaf50d1

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30004/ Subject: LU-10055 mdt: use max_mdsize in reply for layout intent Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4f27911cadf10d0b2fd6451569e688233eaf50d1

            I don't think we need to change ldlm_handle_enqueue() for this. This problem occurs in two cases:

            1) mdt_max_mdsize is smaller than the layout size and client pack request with not enough size, in that case there will be resend with bigger buffer. This is how that code in mdt_lvbo_fill() is intended to work originally. I think this case don't need to be fixed, it causes such messages quite rare if mdt_max_mdsize is not synced on server and client.

            2) mdt_max_mdsize is already big enough and client knows it. But  mdt_intent_layout() pack reply buffer with smaller size. It is not about max_mdsize on client and server at all, it is just wrong size packed because it uses current EA size of file which will be updated to the new EA, so this packed size is wrong from the beginning in most cases. And exactly this case produced a lot of messages in log, because it happens each time with bigger EA size than packed.

            Patch solves case 2) by setting reply size to max_mdsize if layout is going to be updated and shrinking it later. This is better than intercepting that in ldlm_handle_enqueue0() and expanding buffer because expanding is more expensive operation then shrinking, the shrinking is part of every reply processing now while expanding is an exception for rare cases. 

            tappro Mikhail Pershin added a comment - I don't think we need to change ldlm_handle_enqueue() for this. This problem occurs in two cases: 1) mdt_max_mdsize is smaller than the layout size and client pack request with not enough size, in that case there will be resend with bigger buffer. This is how that code in mdt_lvbo_fill() is intended to work originally. I think this case don't need to be fixed, it causes such messages quite rare if mdt_max_mdsize is not synced on server and client. 2) mdt_max_mdsize is already big enough and client knows it. But  mdt_intent_layout() pack reply buffer with smaller size. It is not about max_mdsize on client and server at all, it is just wrong size packed because it uses current EA size of file which will be updated to the new EA, so this packed size is wrong from the beginning in most cases. And exactly this case produced a lot of messages in log, because it happens each time with bigger EA size than packed. Patch solves case 2) by setting reply size to max_mdsize if layout is going to be updated and shrinking it later. This is better than intercepting that in ldlm_handle_enqueue0() and expanding buffer because expanding is more expensive operation then shrinking, the shrinking is part of every reply processing now while expanding is an exception for rare cases. 

            It is fixed in that patch:

            https://review.whamcloud.com/#/c/30004/

            I will change ticket number if patch will be refreshed

             

            tappro Mikhail Pershin added a comment - It is fixed in that patch: https://review.whamcloud.com/#/c/30004/ I will change ticket number if patch will be refreshed  
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I think the major change would be in ldlm_handle_enqueue0(), where it should expand the reply buffer if found too small.

            Actually I tend to think it has nothing to do with client. If the reply buffer is turned out too small on client, the reply will be truncated and client should be able to resend the RPC with bigger reply buffer.

            jay Jinshan Xiong (Inactive) added a comment - - edited I think the major change would be in ldlm_handle_enqueue0(), where it should expand the reply buffer if found too small. Actually I tend to think it has nothing to do with client. If the reply buffer is turned out too small on client, the reply will be truncated and client should be able to resend the RPC with bigger reply buffer.

            It would make sense for clients to just assume enough space for a PFL file to begin with, maybe 3-4 component headers in addition to the stripes in the file. That would quiet the errors on the MDS.

            Jinshan, any idea what code path this is affecting? Layout return in LVB for lock enquirer? It doesn't appear to be causing visible errors, but I'm not sure what application that Cliff is running that generates this, or whether it is checking for correctness.

            adilger Andreas Dilger added a comment - It would make sense for clients to just assume enough space for a PFL file to begin with, maybe 3-4 component headers in addition to the stripes in the file. That would quiet the errors on the MDS. Jinshan, any idea what code path this is affecting? Layout return in LVB for lock enquirer? It doesn't appear to be causing visible errors, but I'm not sure what application that Cliff is running that generates this, or whether it is checking for correctness.

            It should be introduced by PFL where the size of layout becomes larger than it's reserved due to component instantiation. This problem can be solved by extending the reply buffer as we discussed before.

            jay Jinshan Xiong (Inactive) added a comment - It should be introduced by PFL where the size of layout becomes larger than it's reserved due to component instantiation. This problem can be solved by extending the reply buffer as we discussed before.

            Initially I thought this was a harmless message caused by the layout xattr being smaller than expected, but in fact it is the reverse. The LVB buffer is not large enough for the xattr being read from the file.

            I suspect that this is caused by PFL and doenn't , as it has also been seen on previous testing (LU-9825).

            adilger Andreas Dilger added a comment - Initially I thought this was a harmless message caused by the layout xattr being smaller than expected, but in fact it is the reverse. The LVB buffer is not large enough for the xattr being read from the file. I suspect that this is caused by PFL and doenn't , as it has also been seen on previous testing ( LU-9825 ).

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: