[LU-3698] sec.c:1060:sptlrpc_cli_unwrap_reply() ASSERTION( req->rq_repdata == ((void *)0) ) failed Created: 05/Aug/13 Updated: 24/Sep/13 Resolved: 24/Sep/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.5.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Keith Mannthey (Inactive) | Assignee: | Keith Mannthey (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 9544 |
| Description |
|
As seen on a Hyperion client iwc1 There was serious memory pressure lots of allocation were failing. Build: Lustre: Lustre: Build Version: jenkins-arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel-1594-gbdf591f-PRISTINE-2.6.32-358.11.1.el6.x86_64 2013-08-05 10:27:53 LustreError: 84566:0:(osc_request.c:2161:osc_build_rpc()) prep_req failed: -12 2013-08-05 10:27:53 LustreError: 84566:0:(osc_request.c:2161:osc_build_rpc()) Skipped 4 previous similar messages 2013-08-05 10:27:53 LustreError: 84566:0:(osc_cache.c:2091:osc_check_rpcs()) Write request failed with -12 2013-08-05 10:27:53 LustreError: 84566:0:(osc_cache.c:2091:osc_check_rpcs()) Skipped 5 previous similar messages 2013-08-05 10:27:53 LustreError: 84566:0:(sec.c:1060:sptlrpc_cli_unwrap_reply()) ASSERTION( req->rq_repdata == ((void *)0) ) failed: 2013-08-05 10:27:53 LustreError: 84566:0:(sec.c:1060:sptlrpc_cli_unwrap_reply()) LBUG 2013-08-05 10:27:53 Pid: 84566, comm: ptlrpcd_7 2013-08-05 10:27:53 2013-08-05 10:27:53 Call Trace: 2013-08-05 10:27:53 [<ffffffffa056e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 2013-08-05 10:27:53 [<ffffffffa056ee97>] lbug_with_loc+0x47/0xb0 [libcfs] 2013-08-05 10:27:53 [<ffffffffa0c52256>] sptlrpc_cli_unwrap_reply+0x1d6/0x240 [ptlrpc] 2013-08-05 10:27:53 [<ffffffffa0c1735f>] after_reply+0x6f/0xd90 [ptlrpc] 2013-08-05 10:27:53 [<ffffffffa057efb1>] ? libcfs_debug_msg+0x41/0x50 [libcfs] 2013-08-05 10:27:53 [<ffffffffa0c1c941>] ptlrpc_check_set+0xfd1/0x1b20 [ptlrpc] 2013-08-05 10:27:53 [<ffffffffa0c4752b>] ptlrpcd_check+0x53b/0x560 [ptlrpc] 2013-08-05 10:27:53 [<ffffffffa0c47a43>] ptlrpcd+0x223/0x380 [ptlrpc] 2013-08-05 10:27:53 [<ffffffff81063310>] ? default_wake_function+0x0/0x20 2013-08-05 10:27:53 [<ffffffffa0c47820>] ? ptlrpcd+0x0/0x380 [ptlrpc] 2013-08-05 10:27:53 [<ffffffff81096936>] kthread+0x96/0xa0 2013-08-05 10:27:53 [<ffffffff8100c0ca>] child_rip+0xa/0x20 2013-08-05 10:27:53 [<ffffffff810968a0>] ? kthread+0x0/0xa0 2013-08-05 10:27:53 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 There should be crashdump for this that I will take a look at if needed. I will do a little more digging and further updating of this LU it looks like we are not handling the ENOMEM (-12) quite right in this code path. |
| Comments |
| Comment by Keith Mannthey (Inactive) [ 06/Aug/13 ] |
|
The kernel log from the boot. |
| Comment by Keith Mannthey (Inactive) [ 06/Aug/13 ] |
|
Well I took a quick look at the code and I think this may be a little more complicated than a usual out of memory issue" osc_check_rpcs has the following code: osc_object_lock(osc);
if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) {
rc = osc_send_write_rpc(env, cli, osc, pol);
if (rc < 0) {
CERROR("Write request failed with %d\n", rc);
/* osc_send_write_rpc failed, mostly because of
* memory pressure.
*
* It can't break here, because if:
* - a page was submitted by osc_io_submit, so
* page locked;
* - no request in flight
* - no subsequent request
* The system will be in live-lock state,
* because there is no chance to call
* osc_io_unplug() and osc_check_rpcs() any
* more. pdflush can't help in this case,
* because it might be blocked at grabbing
* the page lock as we mentioned.
*
* Anyway, continue to drain pages. */
/* break; */
}
}
So we don't handle the enomem on write it seems?? It seems like this might be the outcome of this decision? |
| Comment by Doug Oucharek (Inactive) [ 08/Aug/13 ] |
|
Jinshan: can you comment on Keith's question. |
| Comment by Peng Tao [ 13/Aug/13 ] |
|
We just saw this on one of our test machines with b2_4 HEAD as well. I took a look at related code and it seems that the problem is that in ptl_send_rpc() req is not sent and req flags are not reset, so in ptlrpc_check_set() reply flag checking thinks req has got reply and goes on to process the reply and thus LBUG. I guess something like following should fix the LBUG. But I'm not familiar with ptlrpc code so not entirely sure.
diff --git a/lustre/ptlrpc/client.c b/lustre/ptlrpc/client.c
index 8b96c5d..9008e62 100644
--- a/lustre/ptlrpc/client.c
+++ b/lustre/ptlrpc/client.c
@@ -1737,6 +1737,7 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
spin_lock(&req->rq_lock);
req->rq_net_err = 1;
spin_unlock(&req->rq_lock);
+ continue;
}
/* need to reset the timeout */
force_timer_recalc = 1;
|
| Comment by Keith Mannthey (Inactive) [ 21/Aug/13 ] |
|
Peng Tao, I agree with your idea and have created a patch for it. http://review.whamcloud.com/7411 A rpc expert will need to take a good look at this idea. |
| Comment by Peter Jones [ 24/Sep/13 ] |
|
Landed for 2.5.0 |