[LU-12493] Bug when importing peer yaml; Panic/hang on cleanup afterwards Created: 29/Jun/19  Updated: 29/Jun/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Chris Horn Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I see errors when importing peer yaml that I don't understand.

here's the test:

sles15build01:~ # bash -x /bin/clean.sh
+ LUSTRE=/home/hornc/lustre-filesystem
+ LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
+ /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet unconfigure
opening /dev/lnet failed: No such file or directory
hint: the kernel modules may not be loaded
unconfigure:
    - lnet:
          errno: -2
          descr: "LNet unconfigure error: No such file or directory"
+ rmmod /home/hornc/lustre-filesystem/lnet/klnds/socklnd/ksocklnd.ko
rmmod: ERROR: Module ksocklnd is not currently loaded
+ rmmod /home/hornc/lustre-filesystem/lnet/lnet/lnet.ko
rmmod: ERROR: Module lnet is not currently loaded
+ rmmod /home/hornc/lustre-filesystem/libcfs/libcfs/libcfs.ko
rmmod: ERROR: Module libcfs is not currently loaded
sles15build01:~ # bash -x /bin/start.sh
+ LUSTRE=/home/hornc/lustre-filesystem
+ LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
+ insmod /home/hornc/lustre-filesystem/libcfs/libcfs/libcfs.ko
+ insmod /home/hornc/lustre-filesystem/lnet/lnet/lnet.ko
+ insmod /home/hornc/lustre-filesystem/lnet/klnds/socklnd/ksocklnd.ko
+ /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet configure
sles15build01:~ # cat /tmp/t.txt
peer:
    - primary nid: 1.1.1.1@o2ib
      Multi-Rail: False
      peer ni:
        - nid: 1.1.1.1@o2ib
        - nid: 2.2.2.2@tcp
        - nid: 3.3.3.3@tcp
        - nid: 4.4.4.4@o2ib
        - nid: 5.5.5.5@o2ib
        - nid: 6.6.6.6@tcp
        - nid: 7.7.7.7@tcp
        - nid: 8.8.8.8@o2ib
    - primary nid: 9.9.9.9@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 9.9.9.9@o2ib
        - nid: 10.10.10.10@o2ib
        - nid: 11.11.11.11@o2ib
        - nid: 12.12.12.12@o2ib
        - nid: 13.13.13.13@tcp
        - nid: 14.14.14.14@o2ib
        - nid: 15.15.15.15@tcp
        - nid: 16.16.16.16@o2ib
    - primary nid: 17.17.17.17@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 17.17.17.17@o2ib
        - nid: 18.18.18.18@o2ib
        - nid: 19.19.19.19@o2ib
        - nid: 20.20.20.20@tcp
        - nid: 21.21.21.21@o2ib
        - nid: 22.22.22.22@o2ib
        - nid: 23.23.23.23@tcp
        - nid: 24.24.24.24@o2ib
    - primary nid: 25.25.25.25@tcp
      Multi-Rail: False
      peer ni:
        - nid: 25.25.25.25@tcp
        - nid: 26.26.26.26@o2ib
        - nid: 27.27.27.27@o2ib
        - nid: 28.28.28.28@tcp
        - nid: 29.29.29.29@o2ib
        - nid: 30.30.30.30@tcp
        - nid: 31.31.31.31@o2ib
        - nid: 32.32.32.32@tcp
    - primary nid: 33.33.33.33@tcp
      Multi-Rail: False
      peer ni:
        - nid: 33.33.33.33@tcp
        - nid: 34.34.34.34@o2ib
        - nid: 35.35.35.35@tcp
        - nid: 36.36.36.36@o2ib
        - nid: 37.37.37.37@o2ib
        - nid: 38.38.38.38@tcp
        - nid: 39.39.39.39@o2ib
        - nid: 40.40.40.40@o2ib
sles15build01:~ # /home/hornc/lustre-filesystem/lnet/utils/lnetctl import < /tmp/t.txt
add:
    - peer_ni:
          errno: -76
          descr: "cannot add peer ni: Name not unique on network"
    - peer_ni:
          errno: 0
          descr: "Success"
    - peer_ni:
          errno: 0
          descr: "Success"
    - peer_ni:
          errno: -76
          descr: "cannot add peer ni: Name not unique on network"
    - peer_ni:
          errno: -76
          descr: "cannot add peer ni: Name not unique on network"
sles15build01:~ #

Every nid in the file is unique.

Trying to cleanup after doing the above and the node either hangs or crashes:

sles15build01:~ # bash -x /bin/clean.sh
+ LUSTRE=/home/hornc/lustre-filesystem
+ LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
+ /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet unconfigure
^^Hangs

I haven't been able to get a crash dump.

Seen with 2.13 but I'd bet money it affects 2.10+



 Comments   
Comment by Chris Horn [ 29/Jun/19 ]

Ah, we're hitting this code:

        /*
         * Get the peer_net. Check that we're not adding a second
         * peer_ni on a peer_net of a non-multi-rail peer.
         */
        lpn = lnet_peer_get_net_locked(lp, LNET_NIDNET(nid));
        if (!lpn) {
                lpn = lnet_peer_net_alloc(LNET_NIDNET(nid));
                if (!lpn) {
                        rc = -ENOMEM;
                        goto out_free_lpni;
                }
        } else if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
                rc = -ENOTUNIQ;
                goto out_free_lpni;
        }

Would be nice to have a better error message. Still not sure why we're hanging/crashing on cleanup.

Comment by Chris Horn [ 29/Jun/19 ]

So if we're supposed to be prevented from adding secondary nids to non-MR peers then that doesn't seem to be working correctly:

sles15build01:/home/hornc/lustre-filesystem/lustre/tests # cat /tmp/t.txt
peer:
    - primary nid: 1.1.1.1@o2ib
      Multi-Rail: False
      peer ni:
        - nid: 1.1.1.1@o2ib
        - nid: 2.2.2.2@tcp
        - nid: 3.3.3.3@tcp
        - nid: 4.4.4.4@o2ib
        - nid: 5.5.5.5@o2ib
        - nid: 6.6.6.6@tcp
        - nid: 7.7.7.7@tcp
        - nid: 8.8.8.8@o2ib
    - primary nid: 9.9.9.9@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 9.9.9.9@o2ib
        - nid: 10.10.10.10@o2ib
        - nid: 11.11.11.11@o2ib
        - nid: 12.12.12.12@o2ib
        - nid: 13.13.13.13@tcp
        - nid: 14.14.14.14@o2ib
        - nid: 15.15.15.15@tcp
        - nid: 16.16.16.16@o2ib
    - primary nid: 17.17.17.17@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 17.17.17.17@o2ib
        - nid: 18.18.18.18@o2ib
        - nid: 19.19.19.19@o2ib
        - nid: 20.20.20.20@tcp
        - nid: 21.21.21.21@o2ib
        - nid: 22.22.22.22@o2ib
        - nid: 23.23.23.23@tcp
        - nid: 24.24.24.24@o2ib
    - primary nid: 25.25.25.25@tcp
      Multi-Rail: False
      peer ni:
        - nid: 25.25.25.25@tcp
        - nid: 26.26.26.26@o2ib
        - nid: 27.27.27.27@o2ib
        - nid: 28.28.28.28@tcp
        - nid: 29.29.29.29@o2ib
        - nid: 30.30.30.30@tcp
        - nid: 31.31.31.31@o2ib
        - nid: 32.32.32.32@tcp
    - primary nid: 33.33.33.33@tcp
      Multi-Rail: False
      peer ni:
        - nid: 33.33.33.33@tcp
        - nid: 34.34.34.34@o2ib
        - nid: 35.35.35.35@tcp
        - nid: 36.36.36.36@o2ib
        - nid: 37.37.37.37@o2ib
        - nid: 38.38.38.38@tcp
        - nid: 39.39.39.39@o2ib
        - nid: 40.40.40.40@o2ib
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # lnetctl import < /tmp/t.txt
add:
    - peer_ni:
          errno: -76
          descr: "cannot add peer ni: Name not unique on network"
    - peer_ni:
          errno: 0
          descr: "Success"
    - peer_ni:
          errno: 0
          descr: "Success"
    - peer_ni:
          errno: -76
          descr: "cannot add peer ni: Name not unique on network"
    - peer_ni:
          errno: -76
          descr: "cannot add peer ni: Name not unique on network"
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # lnetctl peer show
peer:
    - primary nid: 25.25.25.25@tcp
      Multi-Rail: False
      peer ni:
        - nid: 25.25.25.25@tcp
          state: NA
        - nid: 26.26.26.26@o2ib
          state: NA
    - primary nid: 9.9.9.9@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 9.9.9.9@o2ib
          state: NA
        - nid: 10.10.10.10@o2ib
          state: NA
        - nid: 11.11.11.11@o2ib
          state: NA
        - nid: 12.12.12.12@o2ib
          state: NA
        - nid: 14.14.14.14@o2ib
          state: NA
        - nid: 16.16.16.16@o2ib
          state: NA
        - nid: 13.13.13.13@tcp
          state: NA
        - nid: 15.15.15.15@tcp
          state: NA
    - primary nid: 1.1.1.1@o2ib
      Multi-Rail: False
      peer ni:
        - nid: 1.1.1.1@o2ib
          state: NA
        - nid: 2.2.2.2@tcp
          state: NA
    - primary nid: 17.17.17.17@o2ib
      Multi-Rail: True
      peer ni:
        - nid: 17.17.17.17@o2ib
          state: NA
        - nid: 18.18.18.18@o2ib
          state: NA
        - nid: 19.19.19.19@o2ib
          state: NA
        - nid: 21.21.21.21@o2ib
          state: NA
        - nid: 22.22.22.22@o2ib
          state: NA
        - nid: 24.24.24.24@o2ib
          state: NA
        - nid: 20.20.20.20@tcp
          state: NA
        - nid: 23.23.23.23@tcp
          state: NA
    - primary nid: 33.33.33.33@tcp
      Multi-Rail: False
      peer ni:
        - nid: 33.33.33.33@tcp
          state: NA
        - nid: 34.34.34.34@o2ib
          state: NA
sles15build01:/home/hornc/lustre-filesystem/lustre/tests #
Generated at Sat Feb 10 02:53:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.