[LU-5803] This server is not able to keep up with request traffic Created: 24/Oct/14  Updated: 13/Oct/21  Resolved: 13/Oct/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.3
Fix Version/s: Lustre 2.5.4

Type: Bug Priority: Minor
Reporter: Christopher Morrone Assignee: Jian Yu
Resolution: Cannot Reproduce Votes: 1
Labels: llnl

Issue Links:
Related
is related to LU-5079 conf-sanity test_47 timeout Resolved
Severity: 3
Rank (Obsolete): 16270

 Description   

The following was with our local lustre version 2.5.3-1chaos (see github.com/chaos/lustre), using the zfs osd.

I am noticing that with this 2.5.3 release that every time I boot the entire server cluster, the OSS nodes all have the following block of console noise. We are always past the deadline by "-6".

I hadn't seen this noise with the 2.4 branch.

2014-10-23 18:36:30 LustreError: 13a-8: Failed to get MGS log params and no local copy.
2014-10-23 18:36:41 LustreError: 7401:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff8810142b3800 x1482806576283684/t0(0) o253->MGC10.1.1.169@o2ib9@10.1.1.1
69@o2ib9:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
2014-10-23 18:36:41 LustreError: 7401:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 1 previous similar message
2014-10-23 18:36:42 Lustre: lcy-OST0001: Will be in recovery for at least 5:00, or until 135 clients reconnect.
2014-10-23 18:37:01 LustreError: 137-5: lcy-OST0000_UUID: not available for connect from 192.168.121.102@o2ib2 (no target). If you are running an HA pair check that the target is mounted
 on the other server.
2014-10-23 18:37:01 LustreError: Skipped 27 previous similar messages
2014-10-23 18:37:33 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:33 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=2 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:33 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff88081a1cb850 x1482683500276896/t0(0) o400->c6423062-7cd6-6647-06fc-f8be419a9edc@192.168.121.111@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114647 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:34 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff881013dab850 x1482683530936232/t0(0) o400->393de519-5053-8c5e-d1f5-ab0fa701f469@192.168.121.140@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114648 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
2014-10-23 18:37:36 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:36 Lustre: Skipped 1 previous similar message
2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 1 previous similar message
2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff88081a1c0050 x1482683492672308/t0(0) o400->c3d0ea3a-8781-d75b-9934-fd211665f796@192.168.121.19@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114650 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message
2014-10-23 18:37:39 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:39 Lustre: Skipped 2 previous similar messages
2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 2 previous similar messages
2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff88081a1b5050 x1482683502391272/t0(0) o400->ea71043b-0040-df9e-f053-fc396ff5a07b@192.168.121.137@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114653 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages
2014-10-23 18:37:43 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:43 Lustre: Skipped 5 previous similar messages
2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 5 previous similar messages
2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff881013d8a850 x1482683531137984/t0(0) o400->84c85d3d-b1c2-dfa0-5dad-84193a0250a9@192.168.121.151@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114657 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages
2014-10-23 18:37:52 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
2014-10-23 18:37:52 Lustre: Skipped 28 previous similar messages
2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=5 reqQ=0 recA=0, svcEst=45, delay=0(jiff)
2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 28 previous similar messages
2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)?  req@
ffff88081a20e050 x1482683498963872/t0(0) o400->770acbfa-1b07-454e-d9b5-c1becbe0bcee@192.168.121.87@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114666 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 84 previous similar messages


 Comments   
Comment by Matt Ezell [ 24/Oct/14 ]

Chris-

We've seen a lot of issues with recovery in 2.5.3 in production. Check out LU-5724. In our testbed, we pulled in the patch from LU-5079 and have seen better results, but we aren't running that in production yet. You might give it a try to see if the problem goes away. We'd be interested in seeing your results.

Comment by Christopher Morrone [ 24/Oct/14 ]

Thanks. I'll should have time to give that a try tomorrow.

Comment by Peter Jones [ 24/Oct/14 ]

Yu, Jian

Do you agree that these failures seem consistent with those seen with LU-5079?

Peter

Comment by Andreas Dilger [ 24/Oct/14 ]

I was also going to suggest LU-5079, since that fixes a busy-wait loop on the MDS and OSS during recovery that also causes a large amount of memory consumption (LU-5077).

Comment by Christopher Morrone [ 24/Oct/14 ]

So if I understand correctly, we are talking about this patch:

master: http://review.whamcloud.com/11213/
b2_5: http://review.whamcloud.com/12365

But at least the b2_5 version has been identified as causing a new regression test failure.

I am wondering if reverting 9298b92e69d095b3e7809f26d47ae22e646d55c2 , "LU-4578 ptlrpc: Early replies need to honor at_max", might not be the better way to go.

Comment by Christopher Morrone [ 24/Oct/14 ]

I tried the http://review.whamcloud.com/12365 patch. It looks like we swapped one set of noisy console messages for another. See experience in LU-5805.

Generated at Sat Feb 10 01:54:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.