Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17729

LNET goes down often in AKS deployments

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • Lustre 2.17.0
    • None
    • 3
    • 9223372036854775807

    Description

      LNET goes down every time network interfaces change.  This happens with IPv6 every time a pod is added or removed from the system when using the kubenet networking plugin.

      The dmesg output looks like this:
      [68345.628473] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
      [68345.628503] IPv6: ADDRCONF(NETDEV_CHANGE): vethde7a610d: link becomes ready
      [68345.628535] cbr0: port 1(vethde7a610d) entered blocking state
      [68345.628536] cbr0: port 1(vethde7a610d) entered forwarding state
      [68433.354593] cbr0: port 1(vethde7a610d) entered disabled state
      [68433.357102] device vethde7a610d left promiscuous mode
      [68433.357112] cbr0: port 1(vethde7a610d) entered disabled state
      [68518.266748] LNet: Added LNI 10.224.0.5@tcp [8/256/0/180]    ***LNET ACTIVE HERE**
      [68518.266823] LNet: Accept secure, port 988
      [68536.743566] cbr0: port 1(veth3c5b73b5) entered blocking state  **POD CHANGE**
      [68536.743568] cbr0: port 1(veth3c5b73b5) entered disabled state
      [68536.743800] device veth3c5b73b5 entered promiscuous mode
      [68536.749545] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready  **POD CHANGE**
      [68536.749570] IPv6: ADDRCONF(NETDEV_CHANGE): veth3c5b73b5: link becomes ready
      [68536.749593] cbr0: port 1(veth3c5b73b5) entered blocking state
      [68536.749595] cbr0: port 1(veth3c5b73b5) entered forwarding state
      [68767.344431] cbr0: port 1(veth3c5b73b5) entered disabled state
      [68767.349742] device veth3c5b73b5 left promiscuous mode
      [68767.349751] cbr0: port 1(veth3c5b73b5) entered disabled state

      This is the result of ksocknal_handle_link_state_change and ksocknal_handle_inetaddr_change not being namespace aware.

      Patch to be sent shortly.

      Attachments

        Activity

          [LU-17729] LNET goes down often in AKS deployments
          elliswilson Ellis Wilson added a comment -

          Superseded by the work accomplished in LU-18644, so closing this as a dupe.

          elliswilson Ellis Wilson added a comment - Superseded by the work accomplished in LU-18644 , so closing this as a dupe.
          pjones Peter Jones added a comment -

          As per discussion on the LWG call today, moving tickets that do not appear to be essential to fix version 2.17. If the fix lands before code freeze we will update the fix version to reflect that but we want to focus on activities on the critical path. Please speak up if you think that this issue definitely needs to be fixed before we could issue a 2.16 release.

          pjones Peter Jones added a comment - As per discussion on the LWG call today, moving tickets that do not appear to be essential to fix version 2.17. If the fix lands before code freeze we will update the fix version to reflect that but we want to focus on activities on the critical path. Please speak up if you think that this issue definitely needs to be fixed before we could issue a 2.16 release.

          "Ellis Wilson <elliswilson@microsoft.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54758
          Subject: LU-17729 lnet: LNET goes down often in AKS deployments
          Project: fs/lustre-release
          Branch: master
          Current Patch Set: 1
          Commit: 5f4770f339cc2ec4958bb0a263624ff4b0daa653

          gerrit Gerrit Updater added a comment - "Ellis Wilson <elliswilson@microsoft.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/54758 Subject: LU-17729 lnet: LNET goes down often in AKS deployments Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 5f4770f339cc2ec4958bb0a263624ff4b0daa653

          People

            elliswilson Ellis Wilson
            elliswilson Ellis Wilson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: