Details
-
Bug
-
Resolution: Fixed
-
Minor
-
Lustre 2.12.0
-
3
-
9223372036854775807
Description
In ksocknal_txlist_done(), ECONNRESET error is not accounted for. This should be added for remote failure cases.
»·······»·······if (tx->tx_hstatus == LNET_MSG_STATUS_OK) { 440 »·······»·······»·······if (error == -ETIMEDOUT) 441 »·······»·······»·······»·······tx->tx_hstatus = 442 »·······»·······»·······»······· LNET_MSG_STATUS_LOCAL_TIMEOUT; 443 »·······»·······»·······else if (error == -ENETDOWN || 444 »·······»·······»·······»······· error == -EHOSTUNREACH || 445 »·······»·······»·······»······· error == -ENETUNREACH) 446 »·······»·······»·······»·······tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_DROPPED; 447 »·······»·······»·······/* 448 »·······»·······»······· * for all other errors we don't want to 449 »·······»·······»······· * retransmit 450 »·······»·······»······· */ 451 »·······»·······»·······else if (error) 452 »·······»·······»·······»·······tx->tx_hstatus = LNET_MSG_STATUS_LOCAL_ERROR; 453 »·······»·······}
Due to this, when an interface is brought down on a node which is added as an MR peer on another node, then lnetctl ping to the down interface fails. Ideally with health feature, the down interface should not be used and message should go to the other interface which is still up.
Accounting for ECONNRESET and updating the tx health status to LNET_MSG_STATUS_REMOTE_DROPPED corrects this behaviour.