Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16990

Use NETWORK_TIMEOUT message status for TN_EVENT_TAG_RX_CANCEL

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      I noticed that after the MDS hit Inbound wait, it was failing transactions with REMOTE_TIMEOUT health status:

      [Wed Mar  8 11:26:29 2023] LNet: 143125:0:(lib-msg.c:814:lnet_health_check()) Message from 17@kfi to 0@kfi exceeded message deadline by 2 seconds rc=-110 hstatus=REMOTE_TIMEOUT
      

      Comes from this code:

      static int kfilnd_tn_state_wait_timeout_tag_comp(struct kfilnd_transaction *tn,
                                                       enum tn_events event,
                                                       int status, bool *tn_released)
      {
              KFILND_TN_DEBUG(tn, "%s event status %d", tn_event_to_str(event),
                              status);
      
              switch (event) {
              case TN_EVENT_TAG_RX_CANCEL:
                      kfilnd_tn_status_update(tn, -ETIMEDOUT,
                                              LNET_MSG_STATUS_REMOTE_TIMEOUT);
                      kfilnd_peer_tn_failed(tn->tn_kp, -ETIMEDOUT);
                      break;
      

      Current behavior means that the server never realizes its local NIC is not healthy:

      [root@s-lmo-gaz38a ~]# lnetctl net show -v 2 | grep -e nid -e health
              - nid: 0@lo
                health stats:
                    health value: 1000
              - nid: 172.18.2.3@tcp
                health stats:
                    health value: 1000
              - nid: 17@kfi
                health stats:
                    health value: 1000
      [root@s-lmo-gaz38a ~]#
      

      Attachments

        Activity

          People

            hornc Chris Horn
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: