Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-5803

This server is not able to keep up with request traffic

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • Lustre 2.5.4
    • Lustre 2.5.3
    • 3
    • 16270

    Description

      The following was with our local lustre version 2.5.3-1chaos (see github.com/chaos/lustre), using the zfs osd.

      I am noticing that with this 2.5.3 release that every time I boot the entire server cluster, the OSS nodes all have the following block of console noise. We are always past the deadline by "-6".

      I hadn't seen this noise with the 2.4 branch.

      2014-10-23 18:36:30 LustreError: 13a-8: Failed to get MGS log params and no local copy.
      2014-10-23 18:36:41 LustreError: 7401:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8810142b3800 x1482806576283684/t0(0) o253->MGC10.1.1.169@o2ib9@10.1.1.1
      69@o2ib9:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
      2014-10-23 18:36:41 LustreError: 7401:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 1 previous similar message
      2014-10-23 18:36:42 Lustre: lcy-OST0001: Will be in recovery for at least 5:00, or until 135 clients reconnect.
      2014-10-23 18:37:01 LustreError: 137-5: lcy-OST0000_UUID: not available for connect from 192.168.121.102@o2ib2 (no target). If you are running an HA pair check that the target is mounted
       on the other server.
      2014-10-23 18:37:01 LustreError: Skipped 27 previous similar messages
      2014-10-23 18:37:33 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
      2014-10-23 18:37:33 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=2 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
      2014-10-23 18:37:33 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
      ffff88081a1cb850 x1482683500276896/t0(0) o400->c6423062-7cd6-6647-06fc-f8be419a9edc@192.168.121.111@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114647 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
      2014-10-23 18:37:34 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
      2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
      2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
      ffff881013dab850 x1482683530936232/t0(0) o400->393de519-5053-8c5e-d1f5-ab0fa701f469@192.168.121.140@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114648 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
      2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      2014-10-23 18:37:36 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
      2014-10-23 18:37:36 Lustre: Skipped 1 previous similar message
      2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
      2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 1 previous similar message
      2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
      ffff88081a1c0050 x1482683492672308/t0(0) o400->c3d0ea3a-8781-d75b-9934-fd211665f796@192.168.121.19@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114650 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
      2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
      2014-10-23 18:37:39 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
      2014-10-23 18:37:39 Lustre: Skipped 2 previous similar messages
      2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
      2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 2 previous similar messages
      2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
      ffff88081a1b5050 x1482683502391272/t0(0) o400->ea71043b-0040-df9e-f053-fc396ff5a07b@192.168.121.137@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114653 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
      2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages
      2014-10-23 18:37:43 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
      2014-10-23 18:37:43 Lustre: Skipped 5 previous similar messages
      2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
      2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 5 previous similar messages
      2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
      ffff881013d8a850 x1482683531137984/t0(0) o400->84c85d3d-b1c2-dfa0-5dad-84193a0250a9@192.168.121.151@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114657 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
      2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages
      2014-10-23 18:37:52 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
      2014-10-23 18:37:52 Lustre: Skipped 28 previous similar messages
      2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=5 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
      2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 28 previous similar messages
      2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
      ffff88081a20e050 x1482683498963872/t0(0) o400->770acbfa-1b07-454e-d9b5-c1becbe0bcee@192.168.121.87@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114666 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
      2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 84 previous similar messages
      

      Attachments

        Issue Links

          Activity

            [LU-5803] This server is not able to keep up with request traffic

            I tried the http://review.whamcloud.com/12365 patch. It looks like we swapped one set of noisy console messages for another. See experience in LU-5805.

            morrone Christopher Morrone (Inactive) added a comment - I tried the http://review.whamcloud.com/12365 patch. It looks like we swapped one set of noisy console messages for another. See experience in LU-5805 .

            So if I understand correctly, we are talking about this patch:

            master: http://review.whamcloud.com/11213/
            b2_5: http://review.whamcloud.com/12365

            But at least the b2_5 version has been identified as causing a new regression test failure.

            I am wondering if reverting 9298b92e69d095b3e7809f26d47ae22e646d55c2 , "LU-4578 ptlrpc: Early replies need to honor at_max", might not be the better way to go.

            morrone Christopher Morrone (Inactive) added a comment - So if I understand correctly, we are talking about this patch: master: http://review.whamcloud.com/11213/ b2_5: http://review.whamcloud.com/12365 But at least the b2_5 version has been identified as causing a new regression test failure. I am wondering if reverting 9298b92e69d095b3e7809f26d47ae22e646d55c2 , " LU-4578 ptlrpc: Early replies need to honor at_max", might not be the better way to go.

            I was also going to suggest LU-5079, since that fixes a busy-wait loop on the MDS and OSS during recovery that also causes a large amount of memory consumption (LU-5077).

            adilger Andreas Dilger added a comment - I was also going to suggest LU-5079 , since that fixes a busy-wait loop on the MDS and OSS during recovery that also causes a large amount of memory consumption ( LU-5077 ).
            pjones Peter Jones added a comment -

            Yu, Jian

            Do you agree that these failures seem consistent with those seen with LU-5079?

            Peter

            pjones Peter Jones added a comment - Yu, Jian Do you agree that these failures seem consistent with those seen with LU-5079 ? Peter

            Thanks. I'll should have time to give that a try tomorrow.

            morrone Christopher Morrone (Inactive) added a comment - Thanks. I'll should have time to give that a try tomorrow.
            ezell Matt Ezell added a comment -

            Chris-

            We've seen a lot of issues with recovery in 2.5.3 in production. Check out LU-5724. In our testbed, we pulled in the patch from LU-5079 and have seen better results, but we aren't running that in production yet. You might give it a try to see if the problem goes away. We'd be interested in seeing your results.

            ezell Matt Ezell added a comment - Chris- We've seen a lot of issues with recovery in 2.5.3 in production. Check out LU-5724 . In our testbed, we pulled in the patch from LU-5079 and have seen better results, but we aren't running that in production yet. You might give it a try to see if the problem goes away. We'd be interested in seeing your results.

            People

              yujian Jian Yu
              morrone Christopher Morrone (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: