[LU-9066] ior ERROR: read() failed, Input/output error; client was evicted after OST failover Created: 31/Jan/17  Updated: 14/Jun/17  Resolved: 23/Mar/17

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Minor
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates LU-8860 lock callback errors after client umount Closed
Related
is related to LU-8359 Wrong evict during failover Reopened
is related to LU-8860 lock callback errors after client umount Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During the OST failover, it was reported:

2016-12-05 02:42:07 [115035.688184] LustreError: 138-a: fs1-OST0001: A client on nid 172.18.1.103@o2ib was evicted due to a lock completion callback time out: rc -19

On the client side, IOR got I/O failure:

bluepill-client03: IOR-3.0.1: MPI Coordinated Test of Parallel I/O
bluepill-client03: 
bluepill-client03: Began: Sun Dec  4 18:28:38 2016
bluepill-client03: Command line used: /usr/local/bin/IOR -o /mnt/fs1//ha.sh-111769/bluepill-client03-ior/f.ior -f /test-tools/grev/Cray/2016_snx2k_fvt.ior.shared_file.p
...
bluepill-client03: 	clients            = 88 (8 per node)
...
bluepill-client03: Commencing read performance test: Mon Dec  5 02:34:58 2016
bluepill-client03: ior ERROR: read() failed, errno 5, Input/output error (aiori-POSIX.c:250)


 Comments   
Comment by Gerrit Updater [ 31/Jan/17 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: https://review.whamcloud.com/25173
Subject: LU-9066 ldlm: NOT evict client when target stopping
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1495b69d8bf5d901c372ec91fa316e9649b31866

Comment by Mikhail Pershin [ 06/Feb/17 ]

It seems this is the same as LU-8359, isn't it? 

Comment by nasf (Inactive) [ 06/Feb/17 ]

The issue will be handled in LU-8860.

Comment by nasf (Inactive) [ 06/Feb/17 ]

It seems this is the same as LU-8359, isn't it?

I think your patch https://review.whamcloud.com/#/c/23921 has already handled this case.

Comment by Mikhail Pershin [ 06/Feb/17 ]

reopen to keep patch tracking under this ticket

Comment by Gerrit Updater [ 23/Mar/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/23921/
Subject: LU-9066 ldlm: don't evict client on umount if AST fails
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: eda984e7cb4e6a97310ed0f5e81f398dc48b56bf

Comment by Andriy Skulysh [ 13/Jun/17 ]

We observe same error in 2.7 with https://review.whamcloud.com/23921/ applied

2017-06-09 14:53:58 [77116.822290] LustreError: 138-a: fs1-OST0000: A client on nid 172.18.1.104@o2ib was evicted due to a lock blocking callback time out: rc -19

with patch from LU-8359 it isn't reproducible.

Comment by nasf (Inactive) [ 13/Jun/17 ]

Sorry, I cannot imagine how this message can be printed with the https://review.whamcloud.com/23921/ applied. The logic of ldlm_handle_ast_error() with such patch is as following:

static int ldlm_handle_ast_error(struct ldlm_lock *lock,
                                 struct ptlrpc_request *req, int rc,
                                 const char *ast_type)
{
...
                } else if (rc == -ENODEV || rc == -ESHUTDOWN ||
                           (rc == -EIO &&
                            req->rq_import->imp_state == LUSTRE_IMP_CLOSED)) {
                        /* Upon umount process the AST fails because cannot be
                         * sent. This shouldn't lead to the client eviction.
                         * -ENODEV error is returned by ptl_send_rpc() for
                         *  new request in such import.
                         * -SHUTDOWN is returned by ptlrpc_import_delay_req()
                         *  if imp_invalid is set or obd_no_recov.
                         * Meanwhile there is also check for LUSTRE_IMP_CLOSED
                         * in ptlrpc_import_delay_req() as well with -EIO code.
                         * In all such cases errors are ignored.
                         */
                        LDLM_DEBUG(lock, "%s AST can't be sent due to a server"
                                         " %s failure or umount process: rc = %d\n",
                                         ast_type,
                                         req->rq_import->imp_obd->obd_name, rc);
                } else {
                        LDLM_ERROR(lock,
                                   "client (nid %s) %s %s AST (req@%p x%llu status %d rc %d), evict it",
                                   libcfs_nid2str(peer.nid),
                                   req->rq_replied ? "returned error from" :
                                   "failed to reply to",
                                   ast_type, req, req->rq_xid,
                                   (req->rq_repmsg != NULL) ?
                                   lustre_msg_get_status(req->rq_repmsg) : 0,
                                   rc);
                        ldlm_failed_ast(lock, rc, ast_type);
                }
                return rc;
...
}

Please note that when the "rc == -19 (ENODEV)", if will goto the branch LDLM_DEBUG(), but not the branch ldlm_failed_ast(). Please correct me if I missed anything.

Comment by Andriy Skulysh [ 14/Jun/17 ]

Ah, in fact the fix wasn't applied during our testing. Sorry for the confusion.

Generated at Sat Feb 10 02:22:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.