Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
Lustre 2.7.0
-
None
-
3
-
15412
Description
The following assertion failure was seen on an OSS:
Aug 19 17:32:08 lola-2 kernel: Lustre: ost: This server is not able to keep up with request traffic (cpu-bound). Aug 19 17:32:08 lola-2 kernel: Lustre: 5309:0:(service.c:1509:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=30, delay=0(jiff) Aug 19 17:32:08 lola-2 kernel: Lustre: 5309:0:(service.c:1306:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-1s), not sending earl y reply. Consider increasing at_early_margin (5)? req@ffff880415a7b050 x1476487418415744/t0(0) o400->d8ca812e-ca2b-b357-39ed-b1b134fb6dbd@19 2.168.1.126@o2ib1:0/0 lens 224/0 e 586846 to 0 dl 1408494727 ref 2 fl Complete:H/c0/ffffffff rc 0/-1 Aug 19 17:32:09 lola-2 kernel: Lustre: soaked-OST0000: Client 87e86655-cbf2-ba09-92c2-7853a9b2c942 (at 192.168.1.119@o2ib1) reconnecting, wai ting for 14 clients in recovery for 1:27 Aug 19 17:32:09 lola-2 kernel: LustreError: 5366:0:(ldlm_lib.c:2689:target_bulk_io()) @@@ timeout on bulk GET after 0+0s req@ffff88083a61b40 0 x1476486691018500/t0(4300509964) o4->8dda3382-83f8-6445-5eea-828fd59e4a06@192.168.1.116@o2ib1:0/0 lens 504/448 e 391470 to 0 dl 1408494729 ref 2 fl Complete:/4/0 rc 0/0 Aug 19 17:32:09 lola-2 kernel: LustreError: 5432:0:(niobuf.c:550:ptlrpc_send_reply()) ASSERTION( req->rq_no_reply == 0 ) failed: Aug 19 17:32:09 lola-2 kernel: Lustre: soaked-OST0000: Bulk IO write error with 8dda3382-83f8-6445-5eea-828fd59e4a06 (at 192.168.1.116@o2ib1) , client will retry: rc -110 Aug 19 17:32:09 lola-2 kernel: LustreError: 5432:0:(niobuf.c:550:ptlrpc_send_reply()) LBUG Aug 19 17:32:09 lola-2 kernel: Pid: 5432, comm: ll_ost_io03_003 Aug 19 17:32:09 lola-2 kernel: Aug 19 17:32:09 lola-2 kernel: Call Trace: Aug 19 17:32:09 lola-2 kernel: [<ffffffffa0641895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] Aug 19 17:32:09 lola-2 kernel: [<ffffffffa0641e97>] lbug_with_loc+0x47/0xb0 [libcfs] Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09cda4c>] ptlrpc_send_reply+0x4ec/0x7f0 [ptlrpc] Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09d4aae>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09e4d75>] ptlrpc_at_check_timed+0xcd5/0x1370 [ptlrpc] Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09dc1e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc] Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09e66f8>] ptlrpc_main+0x12e8/0x1990 [ptlrpc] Aug 19 17:32:09 lola-2 kernel: [<ffffffff81069290>] ? pick_next_task_fair+0xd0/0x130 Aug 19 17:32:09 lola-2 kernel: [<ffffffff81529246>] ? schedule+0x176/0x3b0 Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09e5410>] ? ptlrpc_main+0x0/0x1990 [ptlrpc] Aug 19 17:32:09 lola-2 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0 Aug 19 17:32:09 lola-2 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 Aug 19 17:32:09 lola-2 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 Aug 19 17:32:09 lola-2 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
It appears to be a race between a BRW timeout and an attempt to send an early reply.