Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-12493

Bug when importing peer yaml; Panic/hang on cleanup afterwards

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • Lustre 2.13.0
    • None
    • 3
    • 9223372036854775807

    Description

      I see errors when importing peer yaml that I don't understand.

      here's the test:

      sles15build01:~ # bash -x /bin/clean.sh
      + LUSTRE=/home/hornc/lustre-filesystem
      + LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
      + /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet unconfigure
      opening /dev/lnet failed: No such file or directory
      hint: the kernel modules may not be loaded
      unconfigure:
          - lnet:
                errno: -2
                descr: "LNet unconfigure error: No such file or directory"
      + rmmod /home/hornc/lustre-filesystem/lnet/klnds/socklnd/ksocklnd.ko
      rmmod: ERROR: Module ksocklnd is not currently loaded
      + rmmod /home/hornc/lustre-filesystem/lnet/lnet/lnet.ko
      rmmod: ERROR: Module lnet is not currently loaded
      + rmmod /home/hornc/lustre-filesystem/libcfs/libcfs/libcfs.ko
      rmmod: ERROR: Module libcfs is not currently loaded
      sles15build01:~ # bash -x /bin/start.sh
      + LUSTRE=/home/hornc/lustre-filesystem
      + LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
      + insmod /home/hornc/lustre-filesystem/libcfs/libcfs/libcfs.ko
      + insmod /home/hornc/lustre-filesystem/lnet/lnet/lnet.ko
      + insmod /home/hornc/lustre-filesystem/lnet/klnds/socklnd/ksocklnd.ko
      + /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet configure
      sles15build01:~ # cat /tmp/t.txt
      peer:
          - primary nid: 1.1.1.1@o2ib
            Multi-Rail: False
            peer ni:
              - nid: 1.1.1.1@o2ib
              - nid: 2.2.2.2@tcp
              - nid: 3.3.3.3@tcp
              - nid: 4.4.4.4@o2ib
              - nid: 5.5.5.5@o2ib
              - nid: 6.6.6.6@tcp
              - nid: 7.7.7.7@tcp
              - nid: 8.8.8.8@o2ib
          - primary nid: 9.9.9.9@o2ib
            Multi-Rail: True
            peer ni:
              - nid: 9.9.9.9@o2ib
              - nid: 10.10.10.10@o2ib
              - nid: 11.11.11.11@o2ib
              - nid: 12.12.12.12@o2ib
              - nid: 13.13.13.13@tcp
              - nid: 14.14.14.14@o2ib
              - nid: 15.15.15.15@tcp
              - nid: 16.16.16.16@o2ib
          - primary nid: 17.17.17.17@o2ib
            Multi-Rail: True
            peer ni:
              - nid: 17.17.17.17@o2ib
              - nid: 18.18.18.18@o2ib
              - nid: 19.19.19.19@o2ib
              - nid: 20.20.20.20@tcp
              - nid: 21.21.21.21@o2ib
              - nid: 22.22.22.22@o2ib
              - nid: 23.23.23.23@tcp
              - nid: 24.24.24.24@o2ib
          - primary nid: 25.25.25.25@tcp
            Multi-Rail: False
            peer ni:
              - nid: 25.25.25.25@tcp
              - nid: 26.26.26.26@o2ib
              - nid: 27.27.27.27@o2ib
              - nid: 28.28.28.28@tcp
              - nid: 29.29.29.29@o2ib
              - nid: 30.30.30.30@tcp
              - nid: 31.31.31.31@o2ib
              - nid: 32.32.32.32@tcp
          - primary nid: 33.33.33.33@tcp
            Multi-Rail: False
            peer ni:
              - nid: 33.33.33.33@tcp
              - nid: 34.34.34.34@o2ib
              - nid: 35.35.35.35@tcp
              - nid: 36.36.36.36@o2ib
              - nid: 37.37.37.37@o2ib
              - nid: 38.38.38.38@tcp
              - nid: 39.39.39.39@o2ib
              - nid: 40.40.40.40@o2ib
      sles15build01:~ # /home/hornc/lustre-filesystem/lnet/utils/lnetctl import < /tmp/t.txt
      add:
          - peer_ni:
                errno: -76
                descr: "cannot add peer ni: Name not unique on network"
          - peer_ni:
                errno: 0
                descr: "Success"
          - peer_ni:
                errno: 0
                descr: "Success"
          - peer_ni:
                errno: -76
                descr: "cannot add peer ni: Name not unique on network"
          - peer_ni:
                errno: -76
                descr: "cannot add peer ni: Name not unique on network"
      sles15build01:~ #
      

      Every nid in the file is unique.

      Trying to cleanup after doing the above and the node either hangs or crashes:

      sles15build01:~ # bash -x /bin/clean.sh
      + LUSTRE=/home/hornc/lustre-filesystem
      + LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
      + /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet unconfigure
      ^^Hangs
      

      I haven't been able to get a crash dump.

      Seen with 2.13 but I'd bet money it affects 2.10+

      Attachments

        Activity

          People

            wc-triage WC Triage
            hornc Chris Horn
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: