Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14114

LNet: print device status in net show command

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.15.0
    • None
    • None

    Description

      A device can be in fatal state, if the cable was disconnected, or the port brought down on the switch side. In these cases, the LND (o2iblnd for now), will flag the device in fatal state. That device will not be used any further. However, it's health will not be decremented. This causes some confusion when examining the state of the node.

      It is better to print the device status in the output of the lnetctl net show command.

      Basically we need to propagate this value:

       lnet_ni->ni_fatal_error_on

      up to user space when we show the net.

      lustre_lnet_show_net() and lnet_get_ni_config() need to be modified to propagate and print this value.

      There is space in lnet_ioctl_config_ni to add this value without having to add more data structures to the IOCTL API.

      Attachments

        Activity

          [LU-14114] LNet: print device status in net show command
          pjones Peter Jones added a comment -

          Landed for 2.15

          pjones Peter Jones added a comment - Landed for 2.15

          Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/44169/
          Subject: LU-14114 lnet: print device status in net show command
          Project: fs/lustre-release
          Branch: master
          Current Patch Set:
          Commit: f75ff33d9fbefd6995a26693032a32a0ba211b51

          gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/44169/ Subject: LU-14114 lnet: print device status in net show command Project: fs/lustre-release Branch: master Current Patch Set: Commit: f75ff33d9fbefd6995a26693032a32a0ba211b51

          Cyril Bordage (cbordage@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44169
          Subject: LU-14114 lnet: print device status in net show command
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 1807c9fd9ce9d72028c6624ff6a9f0f9c1d1d919

          gerrit Gerrit Updater added a comment - Cyril Bordage (cbordage@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/44169 Subject: LU-14114 lnet: print device status in net show command Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 1807c9fd9ce9d72028c6624ff6a9f0f9c1d1d919

          Health value was meant to track intermittent errors. Fatal is meant to track the case when the HW tells us that the HCA is not usable. I think it's important to maintain the distinction in the code. I understand though that it might be clearer from a user's perspective. So I don't mind setting that value to 0 when we display it in lnetctl/liblnetconfig user space. But I would rather maintain the distinction in the LNet code. Also it will still be ambiguous if we only display health as 0. Does it mean we've had a series of intermittent failure on that NI and we're currently recovering it? Or does it mean it is in fatal state? If we are in fatal state we don't attempt to recover the NI, because we rely on the HW to tell us when it's up again.

          Printing out the state of the device in the show output breaks this ambiguity.

          ashehata Amir Shehata (Inactive) added a comment - Health value was meant to track intermittent errors. Fatal is meant to track the case when the HW tells us that the HCA is not usable. I think it's important to maintain the distinction in the code. I understand though that it might be clearer from a user's perspective. So I don't mind setting that value to 0 when we display it in lnetctl/liblnetconfig user space. But I would rather maintain the distinction in the LNet code. Also it will still be ambiguous if we only display health as 0. Does it mean we've had a series of intermittent failure on that NI and we're currently recovering it? Or does it mean it is in fatal state? If we are in fatal state we don't attempt to recover the NI, because we rely on the HW to tell us when it's up again. Printing out the state of the device in the show output breaks this ambiguity.

          Does it make sense from a usability POV to decrement or immediately set the health down to 0 in this case, so that there is no confusion?

          adilger Andreas Dilger added a comment - Does it make sense from a usability POV to decrement or immediately set the health down to 0 in this case, so that there is no confusion?

          People

            cbordage Cyril Bordage
            ashehata Amir Shehata (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: