[LU-14114] LNet: print device status in net show command Created: 04/Nov/20  Updated: 30/Jul/21  Resolved: 22/Jul/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.15.0

Type: Bug Priority: Minor
Reporter: Amir Shehata (Inactive) Assignee: Cyril Bordage
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807
Epic Link: unlabelled-LU-13422

 Description   

A device can be in fatal state, if the cable was disconnected, or the port brought down on the switch side. In these cases, the LND (o2iblnd for now), will flag the device in fatal state. That device will not be used any further. However, it's health will not be decremented. This causes some confusion when examining the state of the node.

It is better to print the device status in the output of the lnetctl net show command.

Basically we need to propagate this value:

 lnet_ni->ni_fatal_error_on

up to user space when we show the net.

lustre_lnet_show_net() and lnet_get_ni_config() need to be modified to propagate and print this value.

There is space in lnet_ioctl_config_ni to add this value without having to add more data structures to the IOCTL API.



 Comments   
Comment by Andreas Dilger [ 04/Nov/20 ]

Does it make sense from a usability POV to decrement or immediately set the health down to 0 in this case, so that there is no confusion?

Comment by Amir Shehata (Inactive) [ 04/Nov/20 ]

Health value was meant to track intermittent errors. Fatal is meant to track the case when the HW tells us that the HCA is not usable. I think it's important to maintain the distinction in the code. I understand though that it might be clearer from a user's perspective. So I don't mind setting that value to 0 when we display it in lnetctl/liblnetconfig user space. But I would rather maintain the distinction in the LNet code. Also it will still be ambiguous if we only display health as 0. Does it mean we've had a series of intermittent failure on that NI and we're currently recovering it? Or does it mean it is in fatal state? If we are in fatal state we don't attempt to recover the NI, because we rely on the HW to tell us when it's up again.

Printing out the state of the device in the show output breaks this ambiguity.

Comment by Gerrit Updater [ 07/Jul/21 ]

Cyril Bordage (cbordage@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44169
Subject: LU-14114 lnet: print device status in net show command
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 1807c9fd9ce9d72028c6624ff6a9f0f9c1d1d919

Comment by Gerrit Updater [ 22/Jul/21 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/44169/
Subject: LU-14114 lnet: print device status in net show command
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: f75ff33d9fbefd6995a26693032a32a0ba211b51

Comment by Peter Jones [ 22/Jul/21 ]

Landed for 2.15

Generated at Sat Feb 10 03:06:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.