Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
Lustre 2.13.0
-
None
-
3
-
9223372036854775807
Description
I see errors when importing peer yaml that I don't understand.
here's the test:
sles15build01:~ # bash -x /bin/clean.sh
+ LUSTRE=/home/hornc/lustre-filesystem
+ LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
+ /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet unconfigure
opening /dev/lnet failed: No such file or directory
hint: the kernel modules may not be loaded
unconfigure:
- lnet:
errno: -2
descr: "LNet unconfigure error: No such file or directory"
+ rmmod /home/hornc/lustre-filesystem/lnet/klnds/socklnd/ksocklnd.ko
rmmod: ERROR: Module ksocklnd is not currently loaded
+ rmmod /home/hornc/lustre-filesystem/lnet/lnet/lnet.ko
rmmod: ERROR: Module lnet is not currently loaded
+ rmmod /home/hornc/lustre-filesystem/libcfs/libcfs/libcfs.ko
rmmod: ERROR: Module libcfs is not currently loaded
sles15build01:~ # bash -x /bin/start.sh
+ LUSTRE=/home/hornc/lustre-filesystem
+ LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
+ insmod /home/hornc/lustre-filesystem/libcfs/libcfs/libcfs.ko
+ insmod /home/hornc/lustre-filesystem/lnet/lnet/lnet.ko
+ insmod /home/hornc/lustre-filesystem/lnet/klnds/socklnd/ksocklnd.ko
+ /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet configure
sles15build01:~ # cat /tmp/t.txt
peer:
- primary nid: 1.1.1.1@o2ib
Multi-Rail: False
peer ni:
- nid: 1.1.1.1@o2ib
- nid: 2.2.2.2@tcp
- nid: 3.3.3.3@tcp
- nid: 4.4.4.4@o2ib
- nid: 5.5.5.5@o2ib
- nid: 6.6.6.6@tcp
- nid: 7.7.7.7@tcp
- nid: 8.8.8.8@o2ib
- primary nid: 9.9.9.9@o2ib
Multi-Rail: True
peer ni:
- nid: 9.9.9.9@o2ib
- nid: 10.10.10.10@o2ib
- nid: 11.11.11.11@o2ib
- nid: 12.12.12.12@o2ib
- nid: 13.13.13.13@tcp
- nid: 14.14.14.14@o2ib
- nid: 15.15.15.15@tcp
- nid: 16.16.16.16@o2ib
- primary nid: 17.17.17.17@o2ib
Multi-Rail: True
peer ni:
- nid: 17.17.17.17@o2ib
- nid: 18.18.18.18@o2ib
- nid: 19.19.19.19@o2ib
- nid: 20.20.20.20@tcp
- nid: 21.21.21.21@o2ib
- nid: 22.22.22.22@o2ib
- nid: 23.23.23.23@tcp
- nid: 24.24.24.24@o2ib
- primary nid: 25.25.25.25@tcp
Multi-Rail: False
peer ni:
- nid: 25.25.25.25@tcp
- nid: 26.26.26.26@o2ib
- nid: 27.27.27.27@o2ib
- nid: 28.28.28.28@tcp
- nid: 29.29.29.29@o2ib
- nid: 30.30.30.30@tcp
- nid: 31.31.31.31@o2ib
- nid: 32.32.32.32@tcp
- primary nid: 33.33.33.33@tcp
Multi-Rail: False
peer ni:
- nid: 33.33.33.33@tcp
- nid: 34.34.34.34@o2ib
- nid: 35.35.35.35@tcp
- nid: 36.36.36.36@o2ib
- nid: 37.37.37.37@o2ib
- nid: 38.38.38.38@tcp
- nid: 39.39.39.39@o2ib
- nid: 40.40.40.40@o2ib
sles15build01:~ # /home/hornc/lustre-filesystem/lnet/utils/lnetctl import < /tmp/t.txt
add:
- peer_ni:
errno: -76
descr: "cannot add peer ni: Name not unique on network"
- peer_ni:
errno: 0
descr: "Success"
- peer_ni:
errno: 0
descr: "Success"
- peer_ni:
errno: -76
descr: "cannot add peer ni: Name not unique on network"
- peer_ni:
errno: -76
descr: "cannot add peer ni: Name not unique on network"
sles15build01:~ #
Every nid in the file is unique.
Trying to cleanup after doing the above and the node either hangs or crashes:
sles15build01:~ # bash -x /bin/clean.sh + LUSTRE=/home/hornc/lustre-filesystem + LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl + /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet unconfigure ^^Hangs
I haven't been able to get a crash dump.
Seen with 2.13 but I'd bet money it affects 2.10+