[LU-782] sanity.sh subtest test_220 failed with 3, error -107 -ENOTCONN Created: 21/Oct/11  Updated: 27/Feb/17  Resolved: 27/Feb/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Zhenyu Xu
Resolution: Cannot Reproduce Votes: 0
Labels: None

Attachments: File debuglog-LU-782.tar.xz     File issue-20120702.tar.gz    
Severity: 3
Rank (Obsolete): 4352

 Description   

This issue was created by maloo for Chris Gearing <chris@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/af3669de-f9cd-11e0-b486-52540025f9af.

The sub-test test_220 failed with the following error:

test_220 failed with 3

Info required for matching: sanity 220



 Comments   
Comment by Peter Jones [ 31/Oct/11 ]

Bobijam

Could you please look into this regular failure. Note the comments from Tappro on LU748 and the work in this area done by Andreas on LU614

Thanks

Peter

Comment by Zhenyu Xu [ 01/Nov/11 ]

another hit https://maloo.whamcloud.com/sub_tests/ec9a1b44-f451-11e0-908b-52540025f9af

With unknown reason, while creating object, MDS reports ENOTCONN

(lov_request.c:569:lov_update_create_set()) error creating fid 0x78d0 sub-object on OST idx 0/1: rc = -107

But I havn't found the evidence of network error between MDS0 and OST0.

Comment by Zhenyu Xu [ 02/Nov/11 ]

Couldn't find the cause of this -107 error, need -1 bug log to reveal more details.

Comment by Peter Jones [ 25/Nov/11 ]

Sarah

Could you please try and repeat this test manually and gather the -1 debug log?

Thanks

Peter

Comment by Sarah Liu [ 27/Nov/11 ]

sure, will keep you updated

Comment by nasf (Inactive) [ 22/Dec/11 ]

Another failure instance:

https://maloo.whamcloud.com/test_sets/7d12e158-2c63-11e1-ab2e-5254004bbbd3

Comment by Sebastien Piechurski [ 02/Jul/12 ]

We seem to hit the same issue.
Some file creation onto specific OSTs fail after 50 seconds. The next available OST is then assigned to this file, with the MDS reporting the "error creating fix 0xXX sub-objects on OST idx XX/1: rc= -107" message.
The failing OSTs are not always the same, but persist enough time to reproduce the issue for several minutes consistently.
I will attach soon full debug logs from the client, the MDS and the OSS serving the failing OST during a 'lfs setstripe -c 1 -o 3 toto' which shows the problem.
This looks like a communication problem between the MDS and OSS, but communication with other OSTs on the same OSS does not cause any problem.

In the MDS debug log, Request x1405820398383011 is never seen as "Sent" and also never appears in the OSS debug logs.

Comment by Sebastien Piechurski [ 02/Jul/12 ]

Full debug log of lfs setstripe -c 1 -o 3 toto on client, MDS and OSS.

Comment by Sebastien Piechurski [ 06/Jul/12 ]

A crashdump of the MDS in this state is now available if needed.

Comment by Zhenyu Xu [ 08/Jul/12 ]

Is it easy to reproduce it? If so, would you mind collecting -1 debug log for it?

Comment by Sebastien Piechurski [ 09/Jul/12 ]

This is easy to reproduce in the current state of the cluster.
The tarball I already uploaded (issue-20120702.tar.gz) already contains a -1 debug log for it on client+mds+oss.
The debug log on the mds mentions a RPC which times out after 50s, but this RPC number was never logged as "Sent", and the OSS log never mentions it as "Received".

Comment by Zhenyu Xu [ 09/Jul/12 ]

I think the log has not covered the life cycle of x1405820398383011 request, the first appearance of it in MDS is its timeout event, the log shows it timed out and it seems that it's a no-resend request, mostly goes through following path.

ptlrpc_check_set()
 //
                                if (ptlrpc_import_delay_req(imp, req, &status)){
                                        /* put on delay list - only if we wait
                                         * recovery finished - before send */
                                        cfs_list_del_init(&req->rq_list);
                                        cfs_list_add_tail(&req->rq_list,
                                                          &imp-> \
                                                          imp_delayed_list);
                                        cfs_spin_unlock(&imp->imp_lock);
                                        continue;
                                }

                                if (status != 0)  {
                                        req->rq_status = status;
                                        ptlrpc_rqphase_move(req,
                                                RQ_PHASE_INTERPRET);
                                        cfs_spin_unlock(&imp->imp_lock);
                                        GOTO(interpret, req->rq_status);
                                }
                                if (ptlrpc_no_resend(req) && !req->rq_wait_ctx) {        /////// HERE
                                        req->rq_status = -ENOTCONN;
                                        ptlrpc_rqphase_move(req,
                                                RQ_PHASE_INTERPRET);
                                        cfs_spin_unlock(&imp->imp_lock);
                                        GOTO(interpret, req->rq_status);
                                }

and the request never got send, is it possible to get a log covering the life cycle of the -107 request?

ps. what Lustre version do you use?

Comment by Sebastien Piechurski [ 09/Jul/12 ]

Here is a new debug log session.
MDS is helios17, OST used is scratch-OST000d which is mounted on helios26 and client is helios86.
Their nids are respectively 10.4.72.51@o2ib, 10.4.72.8@o2ib and 10.4.67.2@o2ib.

The request number leading to the -107 is 1406627952493697 and once again appears only once it times out on the MDS. I let 5 seconds between the activation of debug and trigger of the issue, so the whole lifecycle of the RPC should appear.

We are using our internal Bull version Bull.2.213 which is based upon 2.1.1 + a few patches, including ORNL-22 and LU-1144.
The introduced ptlrpc_bind_policy parameter is set to 3.

We should soon test the behavior with ptlrpc_bind_policy=1 to "revert" to the old code path.

Comment by Zhenyu Xu [ 09/Jul/12 ]

strangely the request birth still escaped our monitoring.

$ grep ffff880bcb72f800 lustre-debug-20120709.helios17

154252:00000100:00000040:11.0:1341848952.925668:0:5006:0:(client.c:1794:ptlrpc_expire_one_request()) @@@ Request x1406627952493697 sent from scratch-OST000d-osc-MDT0000 to NID 10.4.72.8@o2ib has timed out for sent delay: [sent 1341848902] [real_sent 0] [current 1341848952] [deadline 50s] [delay 0s] req@ffff880bcb72f800 x1406627952493697/t0(0) o-1->scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1

154265:00000100:00000040:11.0:1341848952.925691:0:5006:0:(lustre_net.h:1675:ptlrpc_rqphase_move()) @@@ move req "Rpc" > "Interpret" req@ffff880bcb72f800 x1406627952493697/t0(0) o-1>scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 1 fl Rpc:XN/ffffffff/ffffffff rc -107/-1

154276:00000100:00000001:11.0:1341848952.937088:0:5006:0:(client.c:2440:ptlrpc_request_addref()) Process leaving (rc=18446612182972168192 : -131890737383424 : ffff880bcb72f800)

154277:00000100:00000040:11.0:1341848952.937089:0:5006:0:(lustre_net.h:1675:ptlrpc_rqphase_move()) @@@ move req "Interpret" > "Complete" req@ffff880bcb72f800 x1406627952493697/t0(0) o-1>scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 2 fl Interpret:XN/ffffffff/ffffffff rc -107/-1

154279:00000100:00000040:11.0:1341848952.937093:0:5006:0:(client.c:2201:__ptlrpc_req_finished()) @@@ refcount now 1 req@ffff880bcb72f800 x1406627952493697/t0(0) o-1->scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 2 fl Complete:XN/ffffffff/ffffffff rc -107/-1

154285:00000100:00000040:11.0:1341848952.937101:0:5006:0:(client.c:2201:__ptlrpc_req_finished()) @@@ refcount now 0 req@ffff880bcb72f800 x1406627952493697/t0(0) o-1->scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 1 fl Complete:XN/ffffffff/ffffffff rc -107/-1

154290:00000100:00000010:11.0:1341848952.937107:0:5006:0:(client.c:2166:__ptlrpc_free_req()) kfreed 'request': 968 at ffff880bcb72f800.

...

Comment by Andreas Dilger [ 27/Feb/17 ]

Close old bug.

Generated at Sat Feb 10 01:10:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.