[LU-9793] sanity test 244 fail Created: 24/Jul/17  Updated: 13/Feb/19  Resolved: 21/Nov/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: ZhangWei Assignee: Zhenyu Xu
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS7,.3


Issue Links:
Duplicate
Related
is related to LU-9429 parallel-scale test_parallel_grouploc... Open
is related to LU-9479 sanity test 184d 244: don't instantia... Open
is related to LU-11200 Centos 8 arm64 server support Resolved
is related to LU-9632 sanity test_244: FAIL: sendfile+group... Open
is related to LU-11958 sanity test_244: FAIL: sendfile+group... Closed
Epic/Theme: ldlm, test
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

Sanity test 244 fail, the fail log is:

sendfile_grouplock: sendfile_grouplock.c:257: sendfile_copy: assertion 'sret > 0' failed: sendfile failed: Input/output error
 sanity test_244: @@@@@@ FAIL: sendfile+grouplock failed 
  Trace dump:
  = /lib64/lustre/tests/test-framework.sh:4851:error()
  = /lib64/lustre/tests/sanity.sh:13830:test_244()
  ......

and the kernel message in the client, there is an ldlm error:

LustreError: 11-0: lustre-OST0000-osc-ffff88085169e000: operation ldlm_enqueue to node 10.xxx.x.201@o2ib failed: rc = -107

and the kernel message in the ost show the following errors in the kernel message:

[598085.271427] LustreError: 49547:0:(ldlm_lockd.c:682:ldlm_handle_ast_error()) ### client (nid 10.xxx.x.192@o2ib) returned error from glimpse AST (req@ffff801fbf2e7800 x1573189961819552 status -5 rc -5), evict it ns: filter-lustre-OST0000_UUID lock: ffff801fa0d07c00/0xb727c060c669bd4b lrc: 4/0,0 mode: GROUP/GROUP res: [0x280000400:0x282:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x40000000020000 nid: 10.xxx.x.192@o2ib remote: 0xdf79275a4bf506c3 expref: 8 pid: 49675 timeout: 0 lvb_type: 0

Can some one help me about this issue ? This error seems like: https://jira.hpdd.intel.com/browse/LU-9632?jql=text%20~%20%22sanity%20244%22



 Comments   
Comment by ZhangWei [ 30/Aug/17 ]

I have fix this issue. Any one need help can contact me by the e-mail.

Comment by Peter Jones [ 30/Aug/17 ]

Why don't you push your patch into gerrit so that it can be landed into future Lustre releases without anyone needing to contact you?

Comment by ZhangWei [ 31/Aug/17 ]

Hi Peter,
This issue is related with the following code:

unsigned int lustre_errno_hton(unsigned int h)
{
	unsigned int n;

	if (h == 0) {
		n = 0;
	} else if (h < ARRAY_SIZE(lustre_errno_hton_mapping)) {
		n = lustre_errno_hton_mapping[h];
		if (n == 0)
			goto generic;
	} else {
generic:
		/*
		 * A generic errno is better than the unknown one that could
		 * mean anything to a different host.
		 */
		n = LUSTRE_EIO;
	}

	return n;
}

In this function, it can not find the 303(ELDLM_NO_LOCK_DATA) in lustre_errno_hton_mapping[] , so change the 303 to 5(LUSTRE_EIO).

And I am wonder if there are some test cases that can simulate the 300(ELDLM_LOCK_CHANGED),301(ELDLM_LOCK_ABORTED),302(ELDLM_LOCK_REPLACED) and 304(ELDLM_LOCK_WOULDBLOCK), 400(ELDLM_NAMESPACE_EXISTS),401(ELDLM_BAD_NAMESPACE) ? So I can test my patch completely ! Thanks !

Comment by Andreas Dilger [ 31/Aug/17 ]

I don't think the error is in the error translation, but rather that the client shouldn't be returning the ELDLM_NO_LOCK_DATA error to userspace.

Comment by ZhangWei [ 02/Sep/17 ]

The cient returns the ELDLM_NO_LOCK_DATA because of the dlmlock->l_ast_data in osc_ldlm_glimpse_ast(...) is null. It seems that only the file under gourp lock protected been visited by fstat from the client will trigger this issue right now. It's very nice if some one can help to figure out this issue !

Comment by ZhangWei [ 03/Sep/17 ]

Hi Andreas,
Can you provide the patch which can fix this issue in Lustre2.11.0 ?

Comment by Ann Koehler (Inactive) [ 01/Jun/18 ]

There's an inconsistency (i.e bug) in the errno returned to the server for a glimpse lock callback when the lock has no data. x86 and i386 architectures -ELDLM_NO_LOCK_DATA is returned. All other architectures pass the -ELDLM_NO_LOCK_DATA errno through lustre_errno_hton and so return -EIO. The server does not recognize the -EIO as a legitimate return status and incorrectly evicts the client.

Here's the trace through logs and code that shows this behavior:

Client gets glimpse lock callback and returns -303
> 00000100:00100000:48.0:1527716735.853979:1296:6674:0:(service.c:2073:ptlrpc_server_handle_request()) Handling RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:76239:x1594741124538788:12345-10.10.100.11@o2ib4002:106
> 00000100:00000200:48.0:1527716735.853981:1296:6674:0:(service.c:2078:ptlrpc_server_handle_request()) got req 1594741124538788
> 00010000:00000001:48.0:1527716735.853981:1520:6674:0:(ldlm_lockd.c:2217:ldlm_callback_handler()) Process entered
> 00010000:00000002:48.0:1527716735.853988:1520:6674:0:(ldlm_lockd.c:2396:ldlm_callback_handler()) glimpse ast
> 00010000:00000001:48.0:1527716735.853988:1520:6674:0:(ldlm_lockd.c:1975:ldlm_handle_gl_callback()) Process entered
...
> 00010000:00010000:48.0:1527716735.853991:1728:6674:0:(ldlm_lockd.c:1977:ldlm_handle_gl_callback()) ### client glimpse AST callback handler ns: snx11260-OST0006-osc-ffff807d9570d000 lock: ffff80258c504400/0x7759ba9bc33b6e92 lrc: 6/0,0 mode: PW/PW res: [0x1b8b432:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x429400020000 nid: local remote: 0x43e2b0c6cc72764c expref: -99 pid: 34318 timeout: 0 lvb_type: 1
> 00000008:00000001:48.0:1527716735.853997:1616:6674:0:(osc_lock.c:595:osc_ldlm_glimpse_ast()) Process entered
> 00000008:00000001:48.0:1527716735.854007:1648:6674:0:(osc_lock.c:645:osc_ldlm_glimpse_ast()) Process leaving (rc=18446744073709551313 : -303 : fffffffffffffed1)
> 00010000:00000001:48.0:1527716735.854062:1552:6674:0:(ldlm_lockd.c:2404:ldlm_callback_handler()) Process leaving (rc=0 : 0 : 0)
...
> 00000100:00000040:48.0:1527716735.854064:1792:6674:0:(lustre_net.h:3478:ptlrpc_rqphase_move()) @@@ move req "Interpret" -> "Complete"  req@ffff8025831dc6c0 x1594741124538788/t0(0) o106->LOV_OSC_UUID@10.10.100.11@o2ib4002:-1/-1 lens 296/192 e 0 to 0 dl 1527716790 ref 1 fl Interpret:/0/0 rc -303/-5
> 00000100:00100000:48.0:1527716735.854068:1296:6674:0:(service.c:2123:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc ldlm_cb00_000:LOV_OSC_UUID+4:76239:x1594741124538788:12345-10.10.100.11@o2ib4002:106 Request procesed in 90us (138us total) trans 0 rc -303/-5

OST receives -5 errno from client and incorrectly evicts client
> May 30 16:45:35 snx11260n010 kernel: LustreError: 76239:0:(ldlm_lockd.c:679:ldlm_handle_ast_error()) ### client (nid 0@gni) returned error from glimpse AST (req status -5 rc -5), evict it ns: filter-snx11260-OST0006_UUID lock: ffff8801fe7562c0/0x43e2b0c6cc72764c lrc: 5/0,0 mode: PW/PW res: [0x1b8b432:0x0:0x0].0 rrc: 4 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x60000000020020 nid: 0@gni remote: 0x7759ba9bc33b6e92 cl: 0108e53d-1d8e-11d7-4535-96704a86073a expref: 7 pid: 77697 timeout: 11151370288 lvb_type: 0
> May 30 16:45:35 snx11260n010 kernel: LustreError: 138-a: snx11260-OST0006: A client on nid 0@gni was evicted due to a lock glimpse callback time out: rc -5

When client sends reply, the errno in rq_status is translated with the call to ptlrpc_status_hton
> 543 int ptlrpc_send_reply(struct ptlrpc_request *req, int flags)
> 544 {
...
> 590         lustre_msg_set_status(req->rq_repmsg,
> 591                               ptlrpc_status_hton(req->rq_status));
...

ELDLM_NO_LOCK_DATA is not included in the mapping table so gets converted to LUSTRE_EIO.
> 337 unsigned int lustre_errno_hton(unsigned int h)
...
> 348 generic:
> 349                 /*
> 350                  * A generic errno is better than the unknown one that could
> 351                  * mean anything to a different host.
> 352                  */
> 353                 n = LUSTRE_EIO;
> 354         }

The errno translation function is not called for x86 architectures but is for ARM. This results in different errnos being returned for the glimpse callback when there is no inode.
> 206 #if !defined(__x86_64__) && !defined(__i386__)
> 207 #define LUSTRE_TRANSLATE_ERRNOS 
> 208 #endif
> 209 
> 210 #ifdef LUSTRE_TRANSLATE_ERRNOS
> 211 unsigned int lustre_errno_hton(unsigned int h);
> 212 unsigned int lustre_errno_ntoh(unsigned int n);
> 213 #else
> 214 #define lustre_errno_hton(h) (h)
> 215 #define lustre_errno_ntoh(n) (n)
> 216 #endif

The server should go through the true branch of the first if statement instead it executes the else rc != 0 branch.
>  728 static int ldlm_cb_interpret(const struct lu_env *env,
>  729                              struct ptlrpc_request *req, void *data, int rc)
>  739         case LDLM_GL_CALLBACK:
>  740                 /* Update the LVB from disk if the AST failed
>  741                  * (this is a legal race)
>  742                  *   
>  743                  * - Glimpse callback of local lock just returns
>  744                  *   -ELDLM_NO_LOCK_DATA.
>  745                  * - Glimpse callback of remote lock might return
>  746                  *   -ELDLM_NO_LOCK_DATA when inode is cleared. LU-274
>  747                  */
>  748                 if (rc == -ELDLM_NO_LOCK_DATA) {
>  749                         LDLM_DEBUG(lock, "lost race - client has a lock but no "
>  750                                    "inode");
>  751                         ldlm_res_lvbo_update(lock->l_resource, NULL, 1);
>  752                 } else if (rc != 0) {
>  753                         rc = ldlm_handle_ast_error(lock, req, rc, "glimpse");
>  754                 } else {
>  755                         rc = ldlm_res_lvbo_update(lock->l_resource, req, 1);
>  756                 }
>  757                 break;

The obvious solutions are:
1. Add the ELDLM errors to the lustre_errno_hton_mapping table leaving the values unchanged.
lustre_errno_hton() returns architecture agnostic errnos so adding a few more doesn't seem to violate any design principles.
2. Change lustre_errno_hton() to pass through unrecognized errnos without change.
This second approach is effectively the solution now used for x86 and i386 architectures.
3. Translate the ELDLM errnos to Linux errnos.
Code comments and the ldml_error2errno()/ldlm_errno2error() functions suggests that this may have been the original design intent. However changing the errno returned for the NO LOCK DATA case would cause backward compatibility and interoperability issues that don't seem worth solving.

I plan to develop a patch based on option 2 unless there are any objections.
 

 

Comment by Andreas Dilger [ 15/Oct/18 ]

Ann, did you make any progress with a patch for this?

Comment by Gerrit Updater [ 15/Oct/18 ]

Ann Koehler (amk@cray.com) uploaded a new patch: https://review.whamcloud.com/33374
Subject: LU-9793 ptlrpc: Do not map unrecognized ELDLM errnos to EIO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 623b7dc9a2143ed6da250141ebf13111aec6ffe9

Comment by Ann Koehler (Inactive) [ 15/Oct/18 ]

Sorry for the delay in getting the patch uploaded. We've been running this patch on in house systems for at least 3 months. It resolves the problems we were seeing on ARM systems.

Comment by James A Simmons [ 16/Oct/18 ]

Will give it a try.

Comment by Gerrit Updater [ 24/Oct/18 ]

Ann Koehler (amk@cray.com) uploaded a new patch: https://review.whamcloud.com/33471
Subject: LU-9793 ptlrpc: Do not map unrecognized ELDLM errnos to EIO
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9a29b3e2e97e9cf40aa5d04292d785c7b7f85de3

Comment by Gerrit Updater [ 21/Nov/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33471/
Subject: LU-9793 ptlrpc: Do not map unrecognized ELDLM errnos to EIO
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 641e1d546742aa355ef51bee6359ee82994d5735

Comment by Peter Jones [ 21/Nov/18 ]

Landed for 2.12

Comment by Alex Zhuravlev [ 24/Jan/19 ]

https://testing.whamcloud.com/test_sets/b636742e-1e76-11e9-b7d4-52540065bddc

Generated at Sat Feb 10 02:29:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.