[LU-14114] LNet: print device status in net show command Created: 04/Nov/20 Updated: 30/Jul/21 Resolved: 22/Jul/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.15.0 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Amir Shehata (Inactive) | Assignee: | Cyril Bordage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Epic Link: | unlabelled-LU-13422 | ||||
| Description |
|
A device can be in fatal state, if the cable was disconnected, or the port brought down on the switch side. In these cases, the LND (o2iblnd for now), will flag the device in fatal state. That device will not be used any further. However, it's health will not be decremented. This causes some confusion when examining the state of the node. It is better to print the device status in the output of the lnetctl net show command. Basically we need to propagate this value: lnet_ni->ni_fatal_error_on up to user space when we show the net. lustre_lnet_show_net() and lnet_get_ni_config() need to be modified to propagate and print this value. There is space in lnet_ioctl_config_ni to add this value without having to add more data structures to the IOCTL API. |
| Comments |
| Comment by Andreas Dilger [ 04/Nov/20 ] |
|
Does it make sense from a usability POV to decrement or immediately set the health down to 0 in this case, so that there is no confusion? |
| Comment by Amir Shehata (Inactive) [ 04/Nov/20 ] |
|
Health value was meant to track intermittent errors. Fatal is meant to track the case when the HW tells us that the HCA is not usable. I think it's important to maintain the distinction in the code. I understand though that it might be clearer from a user's perspective. So I don't mind setting that value to 0 when we display it in lnetctl/liblnetconfig user space. But I would rather maintain the distinction in the LNet code. Also it will still be ambiguous if we only display health as 0. Does it mean we've had a series of intermittent failure on that NI and we're currently recovering it? Or does it mean it is in fatal state? If we are in fatal state we don't attempt to recover the NI, because we rely on the HW to tell us when it's up again. Printing out the state of the device in the show output breaks this ambiguity. |
| Comment by Gerrit Updater [ 07/Jul/21 ] |
|
Cyril Bordage (cbordage@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44169 |
| Comment by Gerrit Updater [ 22/Jul/21 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/44169/ |
| Comment by Peter Jones [ 22/Jul/21 ] |
|
Landed for 2.15 |