Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-10055

mdt_fill_lvbo() message spew on MDS console

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.11.0, Lustre 2.10.7
    • Lustre 2.10.0, Lustre 2.10.1, Lustre 2.11.0
    • None
    • 3
    • 9223372036854775807

    Description

      Running the (almost) latest version of b2_10 (see LU-9983 for details), seeing quite a few of these on the MDS console:

      /scratch/logs/syslog/soak-8.log:Oct  1 22:37:25 soak-8 kernel: LustreError: 8097:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0000: expected 944 actual 416.
      /scratch/logs/syslog/soak-9.log:Oct  1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 6 previous similar messages
      /scratch/logs/syslog/soak-9.log:Oct  1 22:42:25 soak-9 kernel: LustreError: 2165:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 872 actual 416.
      /scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 10 previous similar messages
      /scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 2401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
      /scratch/logs/syslog/soak-10.log:Oct  1 22:42:26 soak-10 kernel: LustreError: 4181:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
      /scratch/logs/syslog/soak-10.log:Oct  1 22:44:04 soak-10 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
      /scratch/logs/syslog/soak-9.log:Oct  1 22:44:04 soak-9 kernel: LustreError: 2351:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 848 actual 416.
      /scratch/logs/syslog/soak-10.log:Oct  1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 8 previous similar messages
      /scratch/logs/syslog/soak-10.log:Oct  1 22:57:27 soak-10 kernel: LustreError: 4296:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 872 actual 416.
      /scratch/logs/syslog/soak-9.log:Oct  1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 9 previous similar messages
      /scratch/logs/syslog/soak-9.log:Oct  1 22:57:27 soak-9 kernel: LustreError: 2329:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 800 actual 416.
      /scratch/logs/syslog/soak-9.log:Oct  1 22:59:06 soak-9 kernel: LustreError: 2357:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0001: expected 776 actual 416.
      

      Attachments

        Issue Links

          Activity

            [LU-10055] mdt_fill_lvbo() message spew on MDS console

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33133/
            Subject: LU-10055 mdt: use max_mdsize in reply for layout intent
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set:
            Commit: 9c57139f261292bbd20e5f304beba4e209a54a40

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33133/ Subject: LU-10055 mdt: use max_mdsize in reply for layout intent Project: fs/lustre-release Branch: b2_10 Current Patch Set: Commit: 9c57139f261292bbd20e5f304beba4e209a54a40
            sarah Sarah Liu added a comment - - edited

            Hit the problem on 2.10.6-RC3 EL7.6 on soak. IOR-fpp and IOR-ssf failed most of the cases(with only run these 2 kinds of tests), only about 10% passed

            see these messages on 4 MDS

            [16142.979907] Lustre: Skipped 1 previous similar message
            [16161.065259] LustreError: 11-0: soaked-MDT0000-osp-MDT0002: operation out_update to node 192.168.1.109@o2ib failed: rc = -107
            [16161.080956] Lustre: soaked-MDT0000-osp-MDT0002: Connection to soaked-MDT0000 (at 192.168.1.109@o2ib) was lost; in progress operations using this service will wait for re
            covery to complete
            [16167.921747] LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation obd_ping to node 192.168.1.109@o2ib failed: rc = -107
            [16167.937337] Lustre: soaked-MDT0000-lwp-MDT0002: Connection to soaked-MDT0000 (at 192.168.1.109@o2ib) was lost; in progress operations using this service will wait for re
            covery to complete
            [16199.921160] Lustre: 12223:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544058876/real 1544058876]  req@ffff8ac2615
            02a00 x1619047196845600/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.109@o2ib:26/25 lens 224/224 e 0 to 1 dl 1544058883 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1
            [16199.962451] Lustre: 12223:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages
            [16199.976625] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.109@o2ib) was lost; in progress operations using this service will fail
            [16199.999063] LustreError: 167-0: soaked-MDT0000-lwp-MDT0002: This client was evicted by soaked-MDT0000; in progress operations using this service will fail.
            [16200.024960] Lustre: soaked-MDT0000-lwp-MDT0002: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib)
            [16200.039830] Lustre: Skipped 1 previous similar message
            [16224.998656] Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib_0) after server handle changed from 0xbd1dde8c58853e94 to 0xec588ff078b4018
            [16250.512699] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
            [16250.526220] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 33 previous similar messages
            [17087.376275] LustreError: 12385:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
            [91933.141411] LustreError: 17081:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
            [91933.155290] LustreError: 17081:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 31 previous similar messages
            [92285.346959] LustreError: 12401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
            [92285.360864] LustreError: 12401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 2 previous similar messages
            [93937.039221] LustreError: 17076:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
            [93937.053175] LustreError: 17076:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 6 previous similar messages
            [94742.219984] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
            [94742.233998] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 1 previous similar message
            [96280.081566] LustreError: 12286:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
            [96280.095067] LustreError: 12286:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 1 previous similar message
            [96281.231050] LustreError: 12407:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344.
            [96281.244965] LustreError: 12407:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 11 previous similar messages
            [96284.246271] LustreError: 12286:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 776 actual 416.
            
            sarah Sarah Liu added a comment - - edited Hit the problem on 2.10.6-RC3 EL7.6 on soak. IOR-fpp and IOR-ssf failed most of the cases(with only run these 2 kinds of tests), only about 10% passed see these messages on 4 MDS [16142.979907] Lustre: Skipped 1 previous similar message [16161.065259] LustreError: 11-0: soaked-MDT0000-osp-MDT0002: operation out_update to node 192.168.1.109@o2ib failed: rc = -107 [16161.080956] Lustre: soaked-MDT0000-osp-MDT0002: Connection to soaked-MDT0000 (at 192.168.1.109@o2ib) was lost; in progress operations using this service will wait for re covery to complete [16167.921747] LustreError: 11-0: soaked-MDT0000-lwp-MDT0002: operation obd_ping to node 192.168.1.109@o2ib failed: rc = -107 [16167.937337] Lustre: soaked-MDT0000-lwp-MDT0002: Connection to soaked-MDT0000 (at 192.168.1.109@o2ib) was lost; in progress operations using this service will wait for re covery to complete [16199.921160] Lustre: 12223:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1544058876/real 1544058876] req@ffff8ac2615 02a00 x1619047196845600/t0(0) o400->MGC192.168.1.108@o2ib@192.168.1.109@o2ib:26/25 lens 224/224 e 0 to 1 dl 1544058883 ref 1 fl Rpc:XN/0/ffffffff rc 0/-1 [16199.962451] Lustre: 12223:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 7 previous similar messages [16199.976625] LustreError: 166-1: MGC192.168.1.108@o2ib: Connection to MGS (at 192.168.1.109@o2ib) was lost; in progress operations using this service will fail [16199.999063] LustreError: 167-0: soaked-MDT0000-lwp-MDT0002: This client was evicted by soaked-MDT0000; in progress operations using this service will fail. [16200.024960] Lustre: soaked-MDT0000-lwp-MDT0002: Connection restored to 192.168.1.108@o2ib (at 192.168.1.108@o2ib) [16200.039830] Lustre: Skipped 1 previous similar message [16224.998656] Lustre: Evicted from MGS (at MGC192.168.1.108@o2ib_0) after server handle changed from 0xbd1dde8c58853e94 to 0xec588ff078b4018 [16250.512699] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344. [16250.526220] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 33 previous similar messages [17087.376275] LustreError: 12385:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344. [91933.141411] LustreError: 17081:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344. [91933.155290] LustreError: 17081:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 31 previous similar messages [92285.346959] LustreError: 12401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344. [92285.360864] LustreError: 12401:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 2 previous similar messages [93937.039221] LustreError: 17076:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344. [93937.053175] LustreError: 17076:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 6 previous similar messages [94742.219984] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344. [94742.233998] LustreError: 12282:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 1 previous similar message [96280.081566] LustreError: 12286:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344. [96280.095067] LustreError: 12286:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 1 previous similar message [96281.231050] LustreError: 12407:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 416 actual 344. [96281.244965] LustreError: 12407:0:(mdt_lvb.c:163:mdt_lvbo_fill()) Skipped 11 previous similar messages [96284.246271] LustreError: 12286:0:(mdt_lvb.c:163:mdt_lvbo_fill()) soaked-MDT0002: expected 776 actual 416.
            gerrit Gerrit Updater added a comment - - edited

            deleted unused patch

            gerrit Gerrit Updater added a comment - - edited deleted unused patch

            Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33133
            Subject: LU-10055 mdt: use max_mdsize in reply for layout intent
            Project: fs/lustre-release
            Branch: b2_10
            Current Patch Set: 1
            Commit: 6e89fbc66336b20821c219fbe38de643ff925afa

            gerrit Gerrit Updater added a comment - Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33133 Subject: LU-10055 mdt: use max_mdsize in reply for layout intent Project: fs/lustre-release Branch: b2_10 Current Patch Set: 1 Commit: 6e89fbc66336b20821c219fbe38de643ff925afa

            Can we get a backport to 2.10.5. We are see this error on our 2.10.5 servers.

            mhanafi Mahmoud Hanafi added a comment - Can we get a backport to 2.10.5. We are see this error on our 2.10.5 servers.
            pjones Peter Jones added a comment -

            Landed for 2.11

            pjones Peter Jones added a comment - Landed for 2.11

            Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30004/
            Subject: LU-10055 mdt: use max_mdsize in reply for layout intent
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 4f27911cadf10d0b2fd6451569e688233eaf50d1

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/30004/ Subject: LU-10055 mdt: use max_mdsize in reply for layout intent Project: fs/lustre-release Branch: master Current Patch Set: Commit: 4f27911cadf10d0b2fd6451569e688233eaf50d1

            I don't think we need to change ldlm_handle_enqueue() for this. This problem occurs in two cases:

            1) mdt_max_mdsize is smaller than the layout size and client pack request with not enough size, in that case there will be resend with bigger buffer. This is how that code in mdt_lvbo_fill() is intended to work originally. I think this case don't need to be fixed, it causes such messages quite rare if mdt_max_mdsize is not synced on server and client.

            2) mdt_max_mdsize is already big enough and client knows it. But  mdt_intent_layout() pack reply buffer with smaller size. It is not about max_mdsize on client and server at all, it is just wrong size packed because it uses current EA size of file which will be updated to the new EA, so this packed size is wrong from the beginning in most cases. And exactly this case produced a lot of messages in log, because it happens each time with bigger EA size than packed.

            Patch solves case 2) by setting reply size to max_mdsize if layout is going to be updated and shrinking it later. This is better than intercepting that in ldlm_handle_enqueue0() and expanding buffer because expanding is more expensive operation then shrinking, the shrinking is part of every reply processing now while expanding is an exception for rare cases. 

            tappro Mikhail Pershin added a comment - I don't think we need to change ldlm_handle_enqueue() for this. This problem occurs in two cases: 1) mdt_max_mdsize is smaller than the layout size and client pack request with not enough size, in that case there will be resend with bigger buffer. This is how that code in mdt_lvbo_fill() is intended to work originally. I think this case don't need to be fixed, it causes such messages quite rare if mdt_max_mdsize is not synced on server and client. 2) mdt_max_mdsize is already big enough and client knows it. But  mdt_intent_layout() pack reply buffer with smaller size. It is not about max_mdsize on client and server at all, it is just wrong size packed because it uses current EA size of file which will be updated to the new EA, so this packed size is wrong from the beginning in most cases. And exactly this case produced a lot of messages in log, because it happens each time with bigger EA size than packed. Patch solves case 2) by setting reply size to max_mdsize if layout is going to be updated and shrinking it later. This is better than intercepting that in ldlm_handle_enqueue0() and expanding buffer because expanding is more expensive operation then shrinking, the shrinking is part of every reply processing now while expanding is an exception for rare cases. 

            It is fixed in that patch:

            https://review.whamcloud.com/#/c/30004/

            I will change ticket number if patch will be refreshed

             

            tappro Mikhail Pershin added a comment - It is fixed in that patch: https://review.whamcloud.com/#/c/30004/ I will change ticket number if patch will be refreshed  
            jay Jinshan Xiong (Inactive) added a comment - - edited

            I think the major change would be in ldlm_handle_enqueue0(), where it should expand the reply buffer if found too small.

            Actually I tend to think it has nothing to do with client. If the reply buffer is turned out too small on client, the reply will be truncated and client should be able to resend the RPC with bigger reply buffer.

            jay Jinshan Xiong (Inactive) added a comment - - edited I think the major change would be in ldlm_handle_enqueue0(), where it should expand the reply buffer if found too small. Actually I tend to think it has nothing to do with client. If the reply buffer is turned out too small on client, the reply will be truncated and client should be able to resend the RPC with bigger reply buffer.

            People

              tappro Mikhail Pershin
              adilger Andreas Dilger
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: