Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
I noticed that after the MDS hit Inbound wait, it was failing transactions with REMOTE_TIMEOUT health status:
[Wed Mar 8 11:26:29 2023] LNet: 143125:0:(lib-msg.c:814:lnet_health_check()) Message from 17@kfi to 0@kfi exceeded message deadline by 2 seconds rc=-110 hstatus=REMOTE_TIMEOUT
Comes from this code:
static int kfilnd_tn_state_wait_timeout_tag_comp(struct kfilnd_transaction *tn,
enum tn_events event,
int status, bool *tn_released)
{
KFILND_TN_DEBUG(tn, "%s event status %d", tn_event_to_str(event),
status);
switch (event) {
case TN_EVENT_TAG_RX_CANCEL:
kfilnd_tn_status_update(tn, -ETIMEDOUT,
LNET_MSG_STATUS_REMOTE_TIMEOUT);
kfilnd_peer_tn_failed(tn->tn_kp, -ETIMEDOUT);
break;
Current behavior means that the server never realizes its local NIC is not healthy:
[root@s-lmo-gaz38a ~]# lnetctl net show -v 2 | grep -e nid -e health
- nid: 0@lo
health stats:
health value: 1000
- nid: 172.18.2.3@tcp
health stats:
health value: 1000
- nid: 17@kfi
health stats:
health value: 1000
[root@s-lmo-gaz38a ~]#