[LU-9793] sanity test 244 fail Created: 24/Jul/17 Updated: 13/Feb/19 Resolved: 21/Nov/18 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0, Lustre 2.12.0 |
| Fix Version/s: | Lustre 2.12.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | ZhangWei | Assignee: | Zhenyu Xu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS7,.3 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Epic/Theme: | ldlm, test | ||||||||||||||||||||||||||||
| Severity: | 2 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||||||||||
| Description |
|
Sanity test 244 fail, the fail log is: sendfile_grouplock: sendfile_grouplock.c:257: sendfile_copy: assertion 'sret > 0' failed: sendfile failed: Input/output error
sanity test_244: @@@@@@ FAIL: sendfile+grouplock failed
Trace dump:
= /lib64/lustre/tests/test-framework.sh:4851:error()
= /lib64/lustre/tests/sanity.sh:13830:test_244()
......
and the kernel message in the client, there is an ldlm error: LustreError: 11-0: lustre-OST0000-osc-ffff88085169e000: operation ldlm_enqueue to node 10.xxx.x.201@o2ib failed: rc = -107 and the kernel message in the ost show the following errors in the kernel message: [598085.271427] LustreError: 49547:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.xxx.x.192@o2ib) returned error from glimpse AST (req@ffff801fbf2e7800 x1573189961819552 status -5 rc -5), evict it ns: filter-lustre-OST0000_UUID lock: ffff801fa0d07c00/0xb727c060c669bd4b lrc: 4/0,0 mode: GROUP/GROUP res: [0x280000400:0x282:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000020000 nid: 10.xxx.x.192@o2ib remote: 0xdf79275a4bf506c3 expref: 8 pid: 49675 timeout: 0 lvb_type: 0 Can some one help me about this issue ? This error seems like: https://jira.hpdd.intel.com/browse/LU-9632?jql=text%20~%20%22sanity%20244%22 |
| Comments |
| Comment by ZhangWei [ 30/Aug/17 ] |
|
I have fix this issue. Any one need help can contact me by the e-mail. |
| Comment by Peter Jones [ 30/Aug/17 ] |
|
Why don't you push your patch into gerrit so that it can be landed into future Lustre releases without anyone needing to contact you? |
| Comment by ZhangWei [ 31/Aug/17 ] |
|
Hi Peter, unsigned int lustre_errno_hton(unsigned int h) { unsigned int n; if (h == 0) { n = 0; } else if (h < ARRAY_SIZE(lustre_errno_hton_mapping)) { n = lustre_errno_hton_mapping[h]; if (n == 0) goto generic; } else { generic: /* * A generic errno is better than the unknown one that could * mean anything to a different host. */ n = LUSTRE_EIO; } return n; } In this function, it can not find the 303(ELDLM_NO_LOCK_DATA) in lustre_errno_hton_mapping[] , so change the 303 to 5(LUSTRE_EIO). And I am wonder if there are some test cases that can simulate the 300(ELDLM_LOCK_CHANGED),301(ELDLM_LOCK_ABORTED),302(ELDLM_LOCK_REPLACED) and 304(ELDLM_LOCK_WOULDBLOCK), 400(ELDLM_NAMESPACE_EXISTS),401(ELDLM_BAD_NAMESPACE) ? So I can test my patch completely ! Thanks ! |
| Comment by Andreas Dilger [ 31/Aug/17 ] |
|
I don't think the error is in the error translation, but rather that the client shouldn't be returning the ELDLM_NO_LOCK_DATA error to userspace. |
| Comment by ZhangWei [ 02/Sep/17 ] |
|
The cient returns the ELDLM_NO_LOCK_DATA because of the dlmlock->l_ast_data in osc_ldlm_glimpse_ast(...) is null. It seems that only the file under gourp lock protected been visited by fstat from the client will trigger this issue right now. It's very nice if some one can help to figure out this issue ! |
| Comment by ZhangWei [ 03/Sep/17 ] |
|
Hi Andreas, |
| Comment by Ann Koehler (Inactive) [ 01/Jun/18 ] |
|
There's an inconsistency (i.e bug) in the errno returned to the server for a glimpse lock callback when the lock has no data. x86 and i386 architectures -ELDLM_NO_LOCK_DATA is returned. All other architectures pass the -ELDLM_NO_LOCK_DATA errno through lustre_errno_hton and so return -EIO. The server does not recognize the -EIO as a legitimate return status and incorrectly evicts the client. Here's the trace through logs and code that shows this behavior: Client gets glimpse lock callback and returns -303
> 00000100:00100000:48.0:1527716735.853979:1296:6674:0:(service.c:2073:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:76239:x1594741124538788:12345-10.10.100.11@o2ib4002:106
> 00000100:00000200:48.0:1527716735.853981:1296:6674:0:(service.c:2078:ptlrpc_server_handle_request()) got req 1594741124538788
> 00010000:00000001:48.0:1527716735.853981:1520:6674:0:(ldlm_lockd.c:2217:ldlm_callback_handler()) Process entered
> 00010000:00000002:48.0:1527716735.853988:1520:6674:0:(ldlm_lockd.c:2396:ldlm_callback_handler()) glimpse ast
> 00010000:00000001:48.0:1527716735.853988:1520:6674:0:(ldlm_lockd.c:1975:ldlm_handle_gl_callback()) Process entered
...
> 00010000:00010000:48.0:1527716735.853991:1728:6674:0:(ldlm_lockd.c:1977:ldlm_handle_gl_callback()) ### client glimpse AST callback handler ns: snx11260-OST0006-osc-ffff807d9570d000 lock: ffff80258c504400/0x7759ba9bc33b6e92 lrc: 6/0,0 mode: PW/PW res: [0x1b8b432:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x429400020000 nid: local remote: 0x43e2b0c6cc72764c expref: -99 pid: 34318 timeout: 0 lvb_type: 1
> 00000008:00000001:48.0:1527716735.853997:1616:6674:0:(osc_lock.c:595:osc_ldlm_glimpse_ast()) Process entered
> 00000008:00000001:48.0:1527716735.854007:1648:6674:0:(osc_lock.c:645:osc_ldlm_glimpse_ast()) Process leaving (rc=18446744073709551313 : -303 : fffffffffffffed1)
> 00010000:00000001:48.0:1527716735.854062:1552:6674:0:(ldlm_lockd.c:2404:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0)
...
> 00000100:00000040:48.0:1527716735.854064:1792:6674:0:(lustre_net.h:3478:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete" req@ffff8025831dc6c0 x1594741124538788/t0(0) o106->LOV_OSC_UUID@10.10.100.11@o2ib4002:-1/-1 lens 296/192 e 0 to 0 dl 1527716790 ref 1 fl Interpret:/0/0 rc -303/-5
> 00000100:00100000:48.0:1527716735.854068:1296:6674:0:(service.c:2123:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:76239:x1594741124538788:12345-10.10.100.11@o2ib4002:106 Request procesed in 90us (138us total) trans 0 rc -303/-5
OST receives -5 errno from client and incorrectly evicts client
> May 30 16:45:35 snx11260n010 kernel: LustreError: 76239:0:(ldlm_lockd.c:679:ldlm_handle_ast_error()) ### client (nid 0@gni) returned error from glimpse AST (req status -5 rc -5), evict it ns: filter-snx11260-OST0006_UUID lock: ffff8801fe7562c0/0x43e2b0c6cc72764c lrc: 5/0,0 mode: PW/PW res: [0x1b8b432:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000020020 nid: 0@gni remote: 0x7759ba9bc33b6e92 cl: 0108e53d-1d8e-11d7-4535-96704a86073a expref: 7 pid: 77697 timeout: 11151370288 lvb_type: 0
> May 30 16:45:35 snx11260n010 kernel: LustreError: 138-a: snx11260-OST0006: A client on nid 0@gni was evicted due to a lock glimpse callback time out: rc -5
When client sends reply, the errno in rq_status is translated with the call to ptlrpc_status_hton
> 543 int ptlrpc_send_reply(struct ptlrpc_request *req, int flags)
> 544 {
...
> 590 lustre_msg_set_status(req->rq_repmsg,
> 591 ptlrpc_status_hton(req->rq_status));
...
ELDLM_NO_LOCK_DATA is not included in the mapping table so gets converted to LUSTRE_EIO.
> 337 unsigned int lustre_errno_hton(unsigned int h)
...
> 348 generic:
> 349 /*
> 350 * A generic errno is better than the unknown one that could
> 351 * mean anything to a different host.
> 352 */
> 353 n = LUSTRE_EIO;
> 354 }
The errno translation function is not called for x86 architectures but is for ARM. This results in different errnos being returned for the glimpse callback when there is no inode.
> 206 #if !defined(__x86_64__) && !defined(__i386__)
> 207 #define LUSTRE_TRANSLATE_ERRNOS
> 208 #endif
> 209
> 210 #ifdef LUSTRE_TRANSLATE_ERRNOS
> 211 unsigned int lustre_errno_hton(unsigned int h);
> 212 unsigned int lustre_errno_ntoh(unsigned int n);
> 213 #else
> 214 #define lustre_errno_hton(h) (h)
> 215 #define lustre_errno_ntoh(n) (n)
> 216 #endif
The server should go through the true branch of the first if statement instead it executes the else rc != 0 branch.
> 728 static int ldlm_cb_interpret(const struct lu_env *env,
> 729 struct ptlrpc_request *req, void *data, int rc)
> 739 case LDLM_GL_CALLBACK:
> 740 /* Update the LVB from disk if the AST failed
> 741 * (this is a legal race)
> 742 *
> 743 * - Glimpse callback of local lock just returns
> 744 * -ELDLM_NO_LOCK_DATA.
> 745 * - Glimpse callback of remote lock might return
> 746 * -ELDLM_NO_LOCK_DATA when inode is cleared. LU-274
> 747 */
> 748 if (rc == -ELDLM_NO_LOCK_DATA) {
> 749 LDLM_DEBUG(lock, "lost race - client has a lock but no "
> 750 "inode");
> 751 ldlm_res_lvbo_update(lock->l_resource, NULL, 1);
> 752 } else if (rc != 0) {
> 753 rc = ldlm_handle_ast_error(lock, req, rc, "glimpse");
> 754 } else {
> 755 rc = ldlm_res_lvbo_update(lock->l_resource, req, 1);
> 756 }
> 757 break;
The obvious solutions are: I plan to develop a patch based on option 2 unless there are any objections.
|
| Comment by Andreas Dilger [ 15/Oct/18 ] |
|
Ann, did you make any progress with a patch for this? |
| Comment by Gerrit Updater [ 15/Oct/18 ] |
|
Ann Koehler (amk@cray.com) uploaded a new patch: https://review.whamcloud.com/33374 |
| Comment by Ann Koehler (Inactive) [ 15/Oct/18 ] |
|
Sorry for the delay in getting the patch uploaded. We've been running this patch on in house systems for at least 3 months. It resolves the problems we were seeing on ARM systems. |
| Comment by James A Simmons [ 16/Oct/18 ] |
|
Will give it a try. |
| Comment by Gerrit Updater [ 24/Oct/18 ] |
|
Ann Koehler (amk@cray.com) uploaded a new patch: https://review.whamcloud.com/33471 |
| Comment by Gerrit Updater [ 21/Nov/18 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33471/ |
| Comment by Peter Jones [ 21/Nov/18 ] |
|
Landed for 2.12 |
| Comment by Alex Zhuravlev [ 24/Jan/19 ] |
|
https://testing.whamcloud.com/test_sets/b636742e-1e76-11e9-b7d4-52540065bddc |