[LU-782] sanity.sh subtest test_220 failed with 3, error -107 -ENOTCONN Created: 21/Oct/11 Updated: 27/Feb/17 Resolved: 27/Feb/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Zhenyu Xu |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 4352 |
| Description |
|
This issue was created by maloo for Chris Gearing <chris@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/af3669de-f9cd-11e0-b486-52540025f9af. The sub-test test_220 failed with the following error:
Info required for matching: sanity 220 |
| Comments |
| Comment by Peter Jones [ 31/Oct/11 ] |
|
Bobijam Could you please look into this regular failure. Note the comments from Tappro on LU748 and the work in this area done by Andreas on LU614 Thanks Peter |
| Comment by Zhenyu Xu [ 01/Nov/11 ] |
|
another hit https://maloo.whamcloud.com/sub_tests/ec9a1b44-f451-11e0-908b-52540025f9af With unknown reason, while creating object, MDS reports ENOTCONN (lov_request.c:569:lov_update_create_set()) error creating fid 0x78d0 sub-object on OST idx 0/1: rc = -107 But I havn't found the evidence of network error between MDS0 and OST0. |
| Comment by Zhenyu Xu [ 02/Nov/11 ] |
|
Couldn't find the cause of this -107 error, need -1 bug log to reveal more details. |
| Comment by Peter Jones [ 25/Nov/11 ] |
|
Sarah Could you please try and repeat this test manually and gather the -1 debug log? Thanks Peter |
| Comment by Sarah Liu [ 27/Nov/11 ] |
|
sure, will keep you updated |
| Comment by nasf (Inactive) [ 22/Dec/11 ] |
|
Another failure instance: https://maloo.whamcloud.com/test_sets/7d12e158-2c63-11e1-ab2e-5254004bbbd3 |
| Comment by Sebastien Piechurski [ 02/Jul/12 ] |
|
We seem to hit the same issue. In the MDS debug log, Request x1405820398383011 is never seen as "Sent" and also never appears in the OSS debug logs. |
| Comment by Sebastien Piechurski [ 02/Jul/12 ] |
|
Full debug log of lfs setstripe -c 1 -o 3 toto on client, MDS and OSS. |
| Comment by Sebastien Piechurski [ 06/Jul/12 ] |
|
A crashdump of the MDS in this state is now available if needed. |
| Comment by Zhenyu Xu [ 08/Jul/12 ] |
|
Is it easy to reproduce it? If so, would you mind collecting -1 debug log for it? |
| Comment by Sebastien Piechurski [ 09/Jul/12 ] |
|
This is easy to reproduce in the current state of the cluster. |
| Comment by Zhenyu Xu [ 09/Jul/12 ] |
|
I think the log has not covered the life cycle of x1405820398383011 request, the first appearance of it in MDS is its timeout event, the log shows it timed out and it seems that it's a no-resend request, mostly goes through following path. ptlrpc_check_set() // if (ptlrpc_import_delay_req(imp, req, &status)){ /* put on delay list - only if we wait * recovery finished - before send */ cfs_list_del_init(&req->rq_list); cfs_list_add_tail(&req->rq_list, &imp-> \ imp_delayed_list); cfs_spin_unlock(&imp->imp_lock); continue; } if (status != 0) { req->rq_status = status; ptlrpc_rqphase_move(req, RQ_PHASE_INTERPRET); cfs_spin_unlock(&imp->imp_lock); GOTO(interpret, req->rq_status); } if (ptlrpc_no_resend(req) && !req->rq_wait_ctx) { /////// HERE req->rq_status = -ENOTCONN; ptlrpc_rqphase_move(req, RQ_PHASE_INTERPRET); cfs_spin_unlock(&imp->imp_lock); GOTO(interpret, req->rq_status); } and the request never got send, is it possible to get a log covering the life cycle of the -107 request? ps. what Lustre version do you use? |
| Comment by Sebastien Piechurski [ 09/Jul/12 ] |
|
Here is a new debug log session. The request number leading to the -107 is 1406627952493697 and once again appears only once it times out on the MDS. I let 5 seconds between the activation of debug and trigger of the issue, so the whole lifecycle of the RPC should appear. We are using our internal Bull version Bull.2.213 which is based upon 2.1.1 + a few patches, including ORNL-22 and We should soon test the behavior with ptlrpc_bind_policy=1 to "revert" to the old code path. |
| Comment by Zhenyu Xu [ 09/Jul/12 ] |
|
strangely the request birth still escaped our monitoring. $ grep ffff880bcb72f800 lustre-debug-20120709.helios17 154252:00000100:00000040:11.0:1341848952.925668:0:5006:0:(client.c:1794:ptlrpc_expire_one_request()) @@@ Request x1406627952493697 sent from scratch-OST000d-osc-MDT0000 to NID 10.4.72.8@o2ib has timed out for sent delay: [sent 1341848902] [real_sent 0] [current 1341848952] [deadline 50s] [delay 0s] req@ffff880bcb72f800 x1406627952493697/t0(0) o-1->scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 1 fl Rpc:XN/ffffffff/ffffffff rc 0/-1 154265:00000100:00000040:11.0:1341848952.925691:0:5006:0:(lustre_net.h:1675:ptlrpc_rqphase_move()) @@@ move req "Rpc" 154276:00000100:00000001:11.0:1341848952.937088:0:5006:0:(client.c:2440:ptlrpc_request_addref()) Process leaving (rc=18446612182972168192 : -131890737383424 : ffff880bcb72f800) 154277:00000100:00000040:11.0:1341848952.937089:0:5006:0:(lustre_net.h:1675:ptlrpc_rqphase_move()) @@@ move req "Interpret" 154279:00000100:00000040:11.0:1341848952.937093:0:5006:0:(client.c:2201:__ptlrpc_req_finished()) @@@ refcount now 1 req@ffff880bcb72f800 x1406627952493697/t0(0) o-1->scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 2 fl Complete:XN/ffffffff/ffffffff rc -107/-1 154285:00000100:00000040:11.0:1341848952.937101:0:5006:0:(client.c:2201:__ptlrpc_req_finished()) @@@ refcount now 0 req@ffff880bcb72f800 x1406627952493697/t0(0) o-1->scratch-OST000d_UUID@10.4.72.8@o2ib:0/0 lens 0/0 e 0 to 1 dl 1341848952 ref 1 fl Complete:XN/ffffffff/ffffffff rc -107/-1 154290:00000100:00000010:11.0:1341848952.937107:0:5006:0:(client.c:2166:__ptlrpc_free_req()) kfreed 'request': 968 at ffff880bcb72f800. ... |
| Comment by Andreas Dilger [ 27/Feb/17 ] |
|
Close old bug. |