[LU-9066] ior ERROR: read() failed, Input/output error; client was evicted after OST failover Created: 31/Jan/17 Updated: 14/Jun/17 Resolved: 23/Mar/17 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0, Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | nasf (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
During the OST failover, it was reported: 2016-12-05 02:42:07 [115035.688184] LustreError: 138-a: fs1-OST0001: A client on nid 172.18.1.103@o2ib was evicted due to a lock completion callback time out: rc -19 On the client side, IOR got I/O failure: bluepill-client03: IOR-3.0.1: MPI Coordinated Test of Parallel I/O bluepill-client03: bluepill-client03: Began: Sun Dec 4 18:28:38 2016 bluepill-client03: Command line used: /usr/local/bin/IOR -o /mnt/fs1//ha.sh-111769/bluepill-client03-ior/f.ior -f /test-tools/grev/Cray/2016_snx2k_fvt.ior.shared_file.p ... bluepill-client03: clients = 88 (8 per node) ... bluepill-client03: Commencing read performance test: Mon Dec 5 02:34:58 2016 bluepill-client03: ior ERROR: read() failed, errno 5, Input/output error (aiori-POSIX.c:250) |
| Comments |
| Comment by Gerrit Updater [ 31/Jan/17 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/25173 |
| Comment by Mikhail Pershin [ 06/Feb/17 ] |
|
It seems this is the same as LU-8359, isn't it? |
| Comment by nasf (Inactive) [ 06/Feb/17 ] |
|
The issue will be handled in |
| Comment by nasf (Inactive) [ 06/Feb/17 ] |
I think your patch https://review.whamcloud.com/#/c/23921 has already handled this case. |
| Comment by Mikhail Pershin [ 06/Feb/17 ] |
|
reopen to keep patch tracking under this ticket |
| Comment by Gerrit Updater [ 23/Mar/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23921/ |
| Comment by Andriy Skulysh [ 13/Jun/17 ] |
|
We observe same error in 2.7 with https://review.whamcloud.com/23921/ applied 2017-06-09 14:53:58 [77116.822290] LustreError: 138-a: fs1-OST0000: A client on nid 172.18.1.104@o2ib was evicted due to a lock blocking callback time out: rc -19 with patch from LU-8359 it isn't reproducible. |
| Comment by nasf (Inactive) [ 13/Jun/17 ] |
|
Sorry, I cannot imagine how this message can be printed with the https://review.whamcloud.com/23921/ applied. The logic of ldlm_handle_ast_error() with such patch is as following: static int ldlm_handle_ast_error(struct ldlm_lock *lock,
struct ptlrpc_request *req, int rc,
const char *ast_type)
{
...
} else if (rc == -ENODEV || rc == -ESHUTDOWN ||
(rc == -EIO &&
req->rq_import->imp_state == LUSTRE_IMP_CLOSED)) {
/* Upon umount process the AST fails because cannot be
* sent. This shouldn't lead to the client eviction.
* -ENODEV error is returned by ptl_send_rpc() for
* new request in such import.
* -SHUTDOWN is returned by ptlrpc_import_delay_req()
* if imp_invalid is set or obd_no_recov.
* Meanwhile there is also check for LUSTRE_IMP_CLOSED
* in ptlrpc_import_delay_req() as well with -EIO code.
* In all such cases errors are ignored.
*/
LDLM_DEBUG(lock, "%s AST can't be sent due to a server"
" %s failure or umount process: rc = %d\n",
ast_type,
req->rq_import->imp_obd->obd_name, rc);
} else {
LDLM_ERROR(lock,
"client (nid %s) %s %s AST (req@%p x%llu status %d rc %d), evict it",
libcfs_nid2str(peer.nid),
req->rq_replied ? "returned error from" :
"failed to reply to",
ast_type, req, req->rq_xid,
(req->rq_repmsg != NULL) ?
lustre_msg_get_status(req->rq_repmsg) : 0,
rc);
ldlm_failed_ast(lock, rc, ast_type);
}
return rc;
...
}
Please note that when the "rc == -19 (ENODEV)", if will goto the branch LDLM_DEBUG(), but not the branch ldlm_failed_ast(). Please correct me if I missed anything. |
| Comment by Andriy Skulysh [ 14/Jun/17 ] |
|
Ah, in fact the fix wasn't applied during our testing. Sorry for the confusion. |