[LU-5803] This server is not able to keep up with request traffic Created: 24/Oct/14 Updated: 13/Oct/21 Resolved: 13/Oct/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.3 |
| Fix Version/s: | Lustre 2.5.4 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Christopher Morrone | Assignee: | Jian Yu |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | llnl | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 16270 | ||||||||
| Description |
|
The following was with our local lustre version 2.5.3-1chaos (see github.com/chaos/lustre), using the zfs osd. I am noticing that with this 2.5.3 release that every time I boot the entire server cluster, the OSS nodes all have the following block of console noise. We are always past the deadline by "-6". I hadn't seen this noise with the 2.4 branch. 2014-10-23 18:36:30 LustreError: 13a-8: Failed to get MGS log params and no local copy. 2014-10-23 18:36:41 LustreError: 7401:0:(client.c:1083:ptlrpc_import_delay_req()) @@@ send limit expired req@ffff8810142b3800 x1482806576283684/t0(0) o253->MGC10.1.1.169@o2ib9@10.1.1.1 69@o2ib9:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1 2014-10-23 18:36:41 LustreError: 7401:0:(client.c:1083:ptlrpc_import_delay_req()) Skipped 1 previous similar message 2014-10-23 18:36:42 Lustre: lcy-OST0001: Will be in recovery for at least 5:00, or until 135 clients reconnect. 2014-10-23 18:37:01 LustreError: 137-5: lcy-OST0000_UUID: not available for connect from 192.168.121.102@o2ib2 (no target). If you are running an HA pair check that the target is mounted on the other server. 2014-10-23 18:37:01 LustreError: Skipped 27 previous similar messages 2014-10-23 18:37:33 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound). 2014-10-23 18:37:33 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=2 reqQ=0 recA=0, svcEst=45, delay=0(jiff) 2014-10-23 18:37:33 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)? req@ ffff88081a1cb850 x1482683500276896/t0(0) o400->c6423062-7cd6-6647-06fc-f8be419a9edc@192.168.121.111@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114647 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 2014-10-23 18:37:34 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound). 2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff) 2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)? req@ ffff881013dab850 x1482683530936232/t0(0) o400->393de519-5053-8c5e-d1f5-ab0fa701f469@192.168.121.140@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114648 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 2014-10-23 18:37:34 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message 2014-10-23 18:37:36 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound). 2014-10-23 18:37:36 Lustre: Skipped 1 previous similar message 2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff) 2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 1 previous similar message 2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)? req@ ffff88081a1c0050 x1482683492672308/t0(0) o400->c3d0ea3a-8781-d75b-9934-fd211665f796@192.168.121.19@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114650 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 2014-10-23 18:37:36 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 1 previous similar message 2014-10-23 18:37:39 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound). 2014-10-23 18:37:39 Lustre: Skipped 2 previous similar messages 2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff) 2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 2 previous similar messages 2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)? req@ ffff88081a1b5050 x1482683502391272/t0(0) o400->ea71043b-0040-df9e-f053-fc396ff5a07b@192.168.121.137@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114653 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 2014-10-23 18:37:39 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 2 previous similar messages 2014-10-23 18:37:43 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound). 2014-10-23 18:37:43 Lustre: Skipped 5 previous similar messages 2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=45, delay=0(jiff) 2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 5 previous similar messages 2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)? req@ ffff881013d8a850 x1482683531137984/t0(0) o400->84c85d3d-b1c2-dfa0-5dad-84193a0250a9@192.168.121.151@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114657 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 2014-10-23 18:37:43 Lustre: 7599:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 7 previous similar messages 2014-10-23 18:37:52 Lustre: ost: This server is not able to keep up with request traffic (cpu-bound). 2014-10-23 18:37:52 Lustre: Skipped 28 previous similar messages 2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) earlyQ=5 reqQ=0 recA=0, svcEst=45, delay=0(jiff) 2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1512:ptlrpc_at_check_timed()) Skipped 28 previous similar messages 2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-6s), not sending early reply. Consider increasing at_early_margin (10)? req@ ffff88081a20e050 x1482683498963872/t0(0) o400->770acbfa-1b07-454e-d9b5-c1becbe0bcee@192.168.121.87@o2ib2:0/0 lens 224/0 e 1 to 0 dl 1414114666 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 2014-10-23 18:37:52 Lustre: 7555:0:(service.c:1309:ptlrpc_at_send_early_reply()) Skipped 84 previous similar messages |
| Comments |
| Comment by Matt Ezell [ 24/Oct/14 ] |
|
Chris- We've seen a lot of issues with recovery in 2.5.3 in production. Check out |
| Comment by Christopher Morrone [ 24/Oct/14 ] |
|
Thanks. I'll should have time to give that a try tomorrow. |
| Comment by Peter Jones [ 24/Oct/14 ] |
|
Yu, Jian Do you agree that these failures seem consistent with those seen with Peter |
| Comment by Andreas Dilger [ 24/Oct/14 ] |
|
I was also going to suggest |
| Comment by Christopher Morrone [ 24/Oct/14 ] |
|
So if I understand correctly, we are talking about this patch: master: http://review.whamcloud.com/11213/ But at least the b2_5 version has been identified as causing a new regression test failure. I am wondering if reverting 9298b92e69d095b3e7809f26d47ae22e646d55c2 , " |
| Comment by Christopher Morrone [ 24/Oct/14 ] |
|
I tried the http://review.whamcloud.com/12365 patch. It looks like we swapped one set of noisy console messages for another. See experience in |