[LU-5537] ptlrpc_send_reply(): ASSERTION( req->rq_no_reply == 0 ) failed Created: 22/Aug/14  Updated: 05/Jun/15  Resolved: 25/Nov/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Minor
Reporter: Li Wei (Inactive) Assignee: Li Wei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 15412

 Description   

The following assertion failure was seen on an OSS:

Aug 19 17:32:08 lola-2 kernel: Lustre: ost: This server is not able to keep up with request traffic (cpu-bound).
Aug 19 17:32:08 lola-2 kernel: Lustre: 5309:0:(service.c:1509:ptlrpc_at_check_timed()) earlyQ=1 reqQ=0 recA=0, svcEst=30, delay=0(jiff)
Aug 19 17:32:08 lola-2 kernel: Lustre: 5309:0:(service.c:1306:ptlrpc_at_send_early_reply()) @@@ Already past deadline (-1s), not sending earl
y reply. Consider increasing at_early_margin (5)?  req@ffff880415a7b050 x1476487418415744/t0(0) o400->d8ca812e-ca2b-b357-39ed-b1b134fb6dbd@19
2.168.1.126@o2ib1:0/0 lens 224/0 e 586846 to 0 dl 1408494727 ref 2 fl Complete:H/c0/ffffffff rc 0/-1
Aug 19 17:32:09 lola-2 kernel: Lustre: soaked-OST0000: Client 87e86655-cbf2-ba09-92c2-7853a9b2c942 (at 192.168.1.119@o2ib1) reconnecting, wai
ting for 14 clients in recovery for 1:27
Aug 19 17:32:09 lola-2 kernel: LustreError: 5366:0:(ldlm_lib.c:2689:target_bulk_io()) @@@ timeout on bulk GET after 0+0s  req@ffff88083a61b40
0 x1476486691018500/t0(4300509964) o4->8dda3382-83f8-6445-5eea-828fd59e4a06@192.168.1.116@o2ib1:0/0 lens 504/448 e 391470 to 0 dl 1408494729
ref 2 fl Complete:/4/0 rc 0/0
Aug 19 17:32:09 lola-2 kernel: LustreError: 5432:0:(niobuf.c:550:ptlrpc_send_reply()) ASSERTION( req->rq_no_reply == 0 ) failed:
Aug 19 17:32:09 lola-2 kernel: Lustre: soaked-OST0000: Bulk IO write error with 8dda3382-83f8-6445-5eea-828fd59e4a06 (at 192.168.1.116@o2ib1)
, client will retry: rc -110
Aug 19 17:32:09 lola-2 kernel: LustreError: 5432:0:(niobuf.c:550:ptlrpc_send_reply()) LBUG
Aug 19 17:32:09 lola-2 kernel: Pid: 5432, comm: ll_ost_io03_003
Aug 19 17:32:09 lola-2 kernel:
Aug 19 17:32:09 lola-2 kernel: Call Trace:
Aug 19 17:32:09 lola-2 kernel: [<ffffffffa0641895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
Aug 19 17:32:09 lola-2 kernel: [<ffffffffa0641e97>] lbug_with_loc+0x47/0xb0 [libcfs]
Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09cda4c>] ptlrpc_send_reply+0x4ec/0x7f0 [ptlrpc]
Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09d4aae>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09e4d75>] ptlrpc_at_check_timed+0xcd5/0x1370 [ptlrpc]
Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09dc1e9>] ? ptlrpc_wait_event+0xa9/0x2d0 [ptlrpc]
Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09e66f8>] ptlrpc_main+0x12e8/0x1990 [ptlrpc]
Aug 19 17:32:09 lola-2 kernel: [<ffffffff81069290>] ? pick_next_task_fair+0xd0/0x130
Aug 19 17:32:09 lola-2 kernel: [<ffffffff81529246>] ? schedule+0x176/0x3b0
Aug 19 17:32:09 lola-2 kernel: [<ffffffffa09e5410>] ? ptlrpc_main+0x0/0x1990 [ptlrpc]
Aug 19 17:32:09 lola-2 kernel: [<ffffffff8109abf6>] kthread+0x96/0xa0
Aug 19 17:32:09 lola-2 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Aug 19 17:32:09 lola-2 kernel: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
Aug 19 17:32:09 lola-2 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20

It appears to be a race between a BRW timeout and an attempt to send an early reply.



 Comments   
Comment by Li Wei (Inactive) [ 03/Sep/14 ]

http://review.whamcloud.com/11740

Comment by Gerrit Updater [ 20/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/11740/
Subject: LU-5537 ptlrpc: Fix an rq_no_reply assertion failure
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: a8d448e4cd5978c546911f98067232bcdd30b651

Comment by Li Wei (Inactive) [ 24/Nov/14 ]

The patch has landed to master; resolving issue.

Comment by Jian Yu [ 26/Nov/14 ]

Hi Li Wei,

Could you please check whether this issue exists on Lustre b2_5 or not? And if yes, could you please back-port the patch? Thank you!

Comment by Li Wei (Inactive) [ 27/Nov/14 ]

I took a closer look at b2_5 and realized the problem does not exist there.

Comment by Andriy Skulysh [ 05/Jun/15 ]

It can happen on b2_5 also

[2319692.184264] LustreError: 3415:0:(ldlm_lib.c:2724:target_bulk_io()) @@@ Reconnect on bulk PUT  req@ffff88043ea04c00 x1499139419601196/t0(0) o3->85f63be7-8ccc-f8bf-ce43-5c0b15598965@273@gni1:0/0 lens 488/432 e 0 to 0 dl 1431173956 ref 1 fl Interpret:/0/0 rc 0/0
[2319692.209262] LustreError: 3415:0:(ldlm_lib.c:2724:target_bulk_io()) Skipped 4 previous similar messages
[2319692.219701] Lustre: snx11128-OST003c: Bulk IO read error with 85f63be7-8ccc-f8bf-ce43-5c0b15598965 (at 273@gni1), client will retry: rc -110
[2319692.548327] LustreError: 65406:0:(niobuf.c:545:ptlrpc_send_reply()) ASSERTION( req->rq_no_reply == 0 ) failed:
[2319692.559511] LustreError: 65406:0:(niobuf.c:545:ptlrpc_send_reply()) LBUG
[2319692.566910] Pid: 65406, comm: ll_ost_io02_008
Generated at Sat Feb 10 01:52:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.