[LU-11476] Account for -ECONNRESET in ksocknak_txlist_done() Created: 04/Oct/18  Updated: 29/Oct/18  Resolved: 29/Oct/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0
Fix Version/s: Lustre 2.12.0

Type: Bug Priority: Minor
Reporter: Sonia Sharma (Inactive) Assignee: Sonia Sharma (Inactive)
Resolution: Fixed Votes: 0
Labels: lnet, lnet-health

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

In ksocknal_txlist_done(), ECONNRESET error is not accounted for. This should be added for remote failure cases.

 

 »·······»·······if (tx->tx_hstatus == LNET_MSG_STATUS_OK) {                                                                                                                      
 440 »·······»·······»·······if (error == -ETIMEDOUT)                                                                                                                                 
 441 »·······»·······»·······»·······tx->tx_hstatus =                                                                                                                                 
 442 »·······»·······»·······»·······  LNET_MSG_STATUS_LOCAL_TIMEOUT;                                                                                                                 
 443 »·······»·······»·······else if (error == -ENETDOWN ||                                                                                                                           
 444 »·······»·······»·······»······· error == -EHOSTUNREACH ||                                                                                                                       
 445 »·······»·······»·······»······· error == -ENETUNREACH)                                                                                                                          
 446 »·······»·······»·······»·······tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_DROPPED;                                                                                                  
 447 »·······»·······»·······/*                                                                                                                                                       
 448 »·······»·······»······· * for all other errors we don't want to                                                                                                                 
 449 »·······»·······»······· * retransmit                                                                                                                                            
 450 »·······»·······»······· */                                                                                                                                                      
 451 »·······»·······»·······else if (error)                                                                                                                                          
 452 »·······»·······»·······»·······tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;                                                                                                    
 453 »·······»·······}

 
Due to this, when an interface is brought down on a node which is added as an MR peer on another node, then lnetctl ping to the down interface fails. Ideally with health feature, the down interface should not be used and message should go to the other interface which is still up.

Accounting for ECONNRESET and updating the tx health status to LNET_MSG_STATUS_REMOTE_DROPPED corrects this behaviour.



 Comments   
Comment by Gerrit Updater [ 04/Oct/18 ]

Sonia Sharma (sharmaso@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33289
Subject: LU-11476 lnd: Update health status for ECONNRESET
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 59b78dd6e89077f6f73751ffb45ea2e5f0ca35af

 

Patch abandoned.

Comment by Gerrit Updater [ 05/Oct/18 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33294
Subject: LU-11476 lnet: set the health status correctly
Project: fs/lustre-release
Branch: multi-rail
Current Patch Set: 1
Commit: 77bca66ffe3e4f30ee39d32b6b8c2c129aa6a550

Comment by Gerrit Updater [ 05/Oct/18 ]

Amir Shehata (ashehata@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33307
Subject: LU-11476 lnet: set the health status correctly
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: ada66907566fd8e595720e3c7b721886dc84833d

Comment by Gerrit Updater [ 29/Oct/18 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33307/
Subject: LU-11476 lnet: set the health status correctly
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 5d77f0d8dc74c752032e449687090ff1360cd32e

Comment by Peter Jones [ 29/Oct/18 ]

Landed for 2.12

Generated at Sat Feb 10 02:44:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.