[LU-11601] IR doesn't handle EAGAIN after initial connect when pinger_recov is 0 Created: 02/Nov/18  Updated: 26/Aug/22  Resolved: 13/Apr/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.13.0

Type: Bug Priority: Minor
Reporter: Sergey Cheremencev Assignee: Sergey Cheremencev
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Related
is related to LU-9784 restore pinger_recov on error Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

There is a chance that client is connecting to OST before recovery when OST is not configured. In such case OST returns EAGAIN:

 if (target->obd_no_conn) {
                spin_unlock(&target->obd_dev_lock);
 
                CDEBUG(D_INFO, "%s: Temporarily refusing client connection "
                               "from %s\n", target->obd_name,
                               libcfs_nid2str(req->rq_peer.nid));
                GOTO(out, rc = -EAGAIN);
        }    

There is no problem when pinger_recov is enabled because ptlrpc_pinger_main will reconnect later.
But it doesn't reconnect when pinger_recov is 0.

00002000:00000001:0.0:1459250035.710100:0:56316:0:(ofd_dev.c:2083:ofd_init0()) Process entered
00002000:00000001:0.0:1459250035.772688:0:56316:0:(ofd_dev.c:2221:ofd_init0()) Process leaving (rc=0 : 0 : 0)
00010000:00000001:2.0:1459250035.813892:0:34564:0:(ldlm_lib.c:944:target_handle_connect()) Process leaving via out (rc=18446744073709551605 : -11 : 0xfffffffffffffff5)
00002000:00000001:3.0:1459250035.820015:0:56305:0:(ofd_dev.c:416:ofd_prepare()) Process entered
00002000:00000001:3.0:1459250035.822878:0:56305:0:(ofd_dev.c:452:ofd_prepare()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:1.0:1459250035.820231:0:33635:0:(import.c:985:ptlrpc_connect_interpret()) Process leaving via out (rc=18446744073709551605 : -11 : 0xfffffffffffffff5)
00000100:00080000:1.0:1459250035.820232:0:33635:0:(import.c:1217:ptlrpc_connect_interpret()) ffff88004003d800 lustre-OST0000_UUID: changing import state from CONNECTING to DISCONN
00000100:00080000:1.0:1459250035.820233:0:33635:0:(import.c:1263:ptlrpc_connect_interpret()) recovery of lustre-OST0000_UUID on 192.168.1.34@tcp failed (-11)


 Comments   
Comment by Gerrit Updater [ 02/Nov/18 ]

Sergey Cheremencev (c17829@cray.com) uploaded a new patch: https://review.whamcloud.com/33557
Subject: LU-11601 ptlrpc: IR doesn't reconnect after EAGAIN
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e199d5446d5d9beb5bc3686386ad4d12bf78b74d

Comment by Gerrit Updater [ 13/Apr/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33557/
Subject: LU-11601 ptlrpc: IR doesn't reconnect after EAGAIN
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 3341c8c31871ad5bcea914260643bf164194ee9a

Comment by Peter Jones [ 13/Apr/19 ]

Landed for 2.13

Generated at Sat Feb 10 02:45:18 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.