[LU-11476] Account for -ECONNRESET in ksocknak_txlist_done() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Minor
Fix Version/s: Lustre 2.12.0
Affects Version/s: Lustre 2.12.0
Labels:
- lnet
- lnet-health

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

In ksocknal_txlist_done(), ECONNRESET error is not accounted for. This should be added for remote failure cases.

 »·······»·······if (tx->tx_hstatus == LNET_MSG_STATUS_OK) {                                                                                                                      
 440 »·······»·······»·······if (error == -ETIMEDOUT)                                                                                                                                 
 441 »·······»·······»·······»·······tx->tx_hstatus =                                                                                                                                 
 442 »·······»·······»·······»·······  LNET_MSG_STATUS_LOCAL_TIMEOUT;                                                                                                                 
 443 »·······»·······»·······else if (error == -ENETDOWN ||                                                                                                                           
 444 »·······»·······»·······»······· error == -EHOSTUNREACH ||                                                                                                                       
 445 »·······»·······»·······»······· error == -ENETUNREACH)                                                                                                                          
 446 »·······»·······»·······»·······tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_DROPPED;                                                                                                  
 447 »·······»·······»·······/*                                                                                                                                                       
 448 »·······»·······»······· * for all other errors we don't want to                                                                                                                 
 449 »·······»·······»······· * retransmit                                                                                                                                            
 450 »·······»·······»······· */                                                                                                                                                      
 451 »·······»·······»·······else if (error)                                                                                                                                          
 452 »·······»·······»·······»·······tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR;                                                                                                    
 453 »·······»·······}

Due to this, when an interface is brought down on a node which is added as an MR peer on another node, then lnetctl ping to the down interface fails. Ideally with health feature, the down interface should not be used and message should go to the other interface which is still up.

Accounting for ECONNRESET and updating the tx health status to LNET_MSG_STATUS_REMOTE_DROPPED corrects this behaviour.

Attachments

Activity

People

Assignee:: Sonia Sharma (Inactive)

Reporter:: Sonia Sharma (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 04/Oct/18 9:24 PM

Updated:: 29/Oct/18 4:23 PM

Resolved:: 29/Oct/18 4:22 PM