[LU-16990] Use NETWORK_TIMEOUT message status for TN_EVENT_TAG_RX_CANCEL Created: 27/Jul/23 Updated: 22/Aug/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Chris Horn | Assignee: | Chris Horn |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I noticed that after the MDS hit Inbound wait, it was failing transactions with REMOTE_TIMEOUT health status: [Wed Mar 8 11:26:29 2023] LNet: 143125:0:(lib-msg.c:814:lnet_health_check()) Message from 17@kfi to 0@kfi exceeded message deadline by 2 seconds rc=-110 hstatus=REMOTE_TIMEOUT Comes from this code:
static int kfilnd_tn_state_wait_timeout_tag_comp(struct kfilnd_transaction *tn,
enum tn_events event,
int status, bool *tn_released)
{
KFILND_TN_DEBUG(tn, "%s event status %d", tn_event_to_str(event),
status);
switch (event) {
case TN_EVENT_TAG_RX_CANCEL:
kfilnd_tn_status_update(tn, -ETIMEDOUT,
LNET_MSG_STATUS_REMOTE_TIMEOUT);
kfilnd_peer_tn_failed(tn->tn_kp, -ETIMEDOUT);
break;
Current behavior means that the server never realizes its local NIC is not healthy: [root@s-lmo-gaz38a ~]# lnetctl net show -v 2 | grep -e nid -e health
- nid: 0@lo
health stats:
health value: 1000
- nid: 172.18.2.3@tcp
health stats:
health value: 1000
- nid: 17@kfi
health stats:
health value: 1000
[root@s-lmo-gaz38a ~]#
|
| Comments |
| Comment by Gerrit Updater [ 27/Jul/23 ] |
|
"Chris Horn <chris.horn@hpe.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/51782 |
| Comment by Gerrit Updater [ 22/Aug/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/51782/ |