[LU-3698] sec.c:1060:sptlrpc_cli_unwrap_reply() ASSERTION( req->rq_repdata == ((void *)0) ) failed Created: 05/Aug/13  Updated: 24/Sep/13  Resolved: 24/Sep/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.0
Fix Version/s: Lustre 2.5.0

Type: Bug Priority: Critical
Reporter: Keith Mannthey (Inactive) Assignee: Keith Mannthey (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Attachments: File iwc1-console    
Severity: 3
Rank (Obsolete): 9544

 Description   

As seen on a Hyperion client iwc1

There was serious memory pressure lots of allocation were failing.

Build:
http://build.whamcloud.com/job/lustre-master/1594/

Lustre: Lustre: Build Version: jenkins-arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel-1594-gbdf591f-PRISTINE-2.6.32-358.11.1.el6.x86_64
2013-08-05 10:27:53 LustreError: 84566:0:(osc_request.c:2161:osc_build_rpc()) prep_req failed: -12
2013-08-05 10:27:53 LustreError: 84566:0:(osc_request.c:2161:osc_build_rpc()) Skipped 4 previous similar messages
2013-08-05 10:27:53 LustreError: 84566:0:(osc_cache.c:2091:osc_check_rpcs()) Write request failed with -12
2013-08-05 10:27:53 LustreError: 84566:0:(osc_cache.c:2091:osc_check_rpcs()) Skipped 5 previous similar messages
2013-08-05 10:27:53 LustreError: 84566:0:(sec.c:1060:sptlrpc_cli_unwrap_reply()) ASSERTION( req->rq_repdata == ((void *)0) ) failed:
2013-08-05 10:27:53 LustreError: 84566:0:(sec.c:1060:sptlrpc_cli_unwrap_reply()) LBUG
2013-08-05 10:27:53 Pid: 84566, comm: ptlrpcd_7
2013-08-05 10:27:53
2013-08-05 10:27:53 Call Trace:
2013-08-05 10:27:53  [<ffffffffa056e895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
2013-08-05 10:27:53  [<ffffffffa056ee97>] lbug_with_loc+0x47/0xb0 [libcfs]
2013-08-05 10:27:53  [<ffffffffa0c52256>] sptlrpc_cli_unwrap_reply+0x1d6/0x240 [ptlrpc]
2013-08-05 10:27:53  [<ffffffffa0c1735f>] after_reply+0x6f/0xd90 [ptlrpc]
2013-08-05 10:27:53  [<ffffffffa057efb1>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
2013-08-05 10:27:53  [<ffffffffa0c1c941>] ptlrpc_check_set+0xfd1/0x1b20 [ptlrpc]
2013-08-05 10:27:53  [<ffffffffa0c4752b>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
2013-08-05 10:27:53  [<ffffffffa0c47a43>] ptlrpcd+0x223/0x380 [ptlrpc]
2013-08-05 10:27:53  [<ffffffff81063310>] ? default_wake_function+0x0/0x20
2013-08-05 10:27:53  [<ffffffffa0c47820>] ? ptlrpcd+0x0/0x380 [ptlrpc]
2013-08-05 10:27:53  [<ffffffff81096936>] kthread+0x96/0xa0
2013-08-05 10:27:53  [<ffffffff8100c0ca>] child_rip+0xa/0x20
2013-08-05 10:27:53  [<ffffffff810968a0>] ? kthread+0x0/0xa0
2013-08-05 10:27:53  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

There should be crashdump for this that I will take a look at if needed.

I will do a little more digging and further updating of this LU it looks like we are not handling the ENOMEM (-12) quite right in this code path.



 Comments   
Comment by Keith Mannthey (Inactive) [ 06/Aug/13 ]

The kernel log from the boot.

Comment by Keith Mannthey (Inactive) [ 06/Aug/13 ]

Well I took a quick look at the code and I think this may be a little more complicated than a usual out of memory issue"

osc_check_rpcs has the following code:

                osc_object_lock(osc);
                if (osc_makes_rpc(cli, osc, OBD_BRW_WRITE)) {
                        rc = osc_send_write_rpc(env, cli, osc, pol);
                        if (rc < 0) {
                                CERROR("Write request failed with %d\n", rc);

                                /* osc_send_write_rpc failed, mostly because of
                                 * memory pressure.
                                 *
                                 * It can't break here, because if:
                                 *  - a page was submitted by osc_io_submit, so
                                 *    page locked;
                                 *  - no request in flight
                                 *  - no subsequent request
                                 * The system will be in live-lock state,
                                 * because there is no chance to call
                                 * osc_io_unplug() and osc_check_rpcs() any
                                 * more. pdflush can't help in this case,
                                 * because it might be blocked at grabbing
                                 * the page lock as we mentioned.
                                 *
                                 * Anyway, continue to drain pages. */
                                /* break; */
                        }
                }

So we don't handle the enomem on write it seems??

It seems like this might be the outcome of this decision?

Comment by Doug Oucharek (Inactive) [ 08/Aug/13 ]

Jinshan: can you comment on Keith's question.

Comment by Peng Tao [ 13/Aug/13 ]

We just saw this on one of our test machines with b2_4 HEAD as well.

I took a look at related code and it seems that the problem is that in ptl_send_rpc() req is not sent and req flags are not reset, so in ptlrpc_check_set() reply flag checking thinks req has got reply and goes on to process the reply and thus LBUG.

I guess something like following should fix the LBUG. But I'm not familiar with ptlrpc code so not entirely sure.

 
diff --git a/lustre/ptlrpc/client.c b/lustre/ptlrpc/client.c
index 8b96c5d..9008e62 100644
--- a/lustre/ptlrpc/client.c
+++ b/lustre/ptlrpc/client.c
@@ -1737,6 +1737,7 @@ int ptlrpc_check_set(const struct lu_env *env, struct ptlrpc_request_set *set)
                                        spin_lock(&req->rq_lock);
                                        req->rq_net_err = 1;
                                        spin_unlock(&req->rq_lock);
+                                       continue;
                                }
                                /* need to reset the timeout */
                                force_timer_recalc = 1;
Comment by Keith Mannthey (Inactive) [ 21/Aug/13 ]

Peng Tao, I agree with your idea and have created a patch for it. http://review.whamcloud.com/7411

A rpc expert will need to take a good look at this idea.

Comment by Peter Jones [ 24/Sep/13 ]

Landed for 2.5.0

Generated at Sat Feb 10 01:36:08 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.