[LU-12493] Bug when importing peer yaml; Panic/hang on cleanup afterwards Created: 29/Jun/19 Updated: 29/Jun/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.13.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Chris Horn | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
I see errors when importing peer yaml that I don't understand. here's the test: sles15build01:~ # bash -x /bin/clean.sh
+ LUSTRE=/home/hornc/lustre-filesystem
+ LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
+ /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet unconfigure
opening /dev/lnet failed: No such file or directory
hint: the kernel modules may not be loaded
unconfigure:
- lnet:
errno: -2
descr: "LNet unconfigure error: No such file or directory"
+ rmmod /home/hornc/lustre-filesystem/lnet/klnds/socklnd/ksocklnd.ko
rmmod: ERROR: Module ksocklnd is not currently loaded
+ rmmod /home/hornc/lustre-filesystem/lnet/lnet/lnet.ko
rmmod: ERROR: Module lnet is not currently loaded
+ rmmod /home/hornc/lustre-filesystem/libcfs/libcfs/libcfs.ko
rmmod: ERROR: Module libcfs is not currently loaded
sles15build01:~ # bash -x /bin/start.sh
+ LUSTRE=/home/hornc/lustre-filesystem
+ LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl
+ insmod /home/hornc/lustre-filesystem/libcfs/libcfs/libcfs.ko
+ insmod /home/hornc/lustre-filesystem/lnet/lnet/lnet.ko
+ insmod /home/hornc/lustre-filesystem/lnet/klnds/socklnd/ksocklnd.ko
+ /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet configure
sles15build01:~ # cat /tmp/t.txt
peer:
- primary nid: 1.1.1.1@o2ib
Multi-Rail: False
peer ni:
- nid: 1.1.1.1@o2ib
- nid: 2.2.2.2@tcp
- nid: 3.3.3.3@tcp
- nid: 4.4.4.4@o2ib
- nid: 5.5.5.5@o2ib
- nid: 6.6.6.6@tcp
- nid: 7.7.7.7@tcp
- nid: 8.8.8.8@o2ib
- primary nid: 9.9.9.9@o2ib
Multi-Rail: True
peer ni:
- nid: 9.9.9.9@o2ib
- nid: 10.10.10.10@o2ib
- nid: 11.11.11.11@o2ib
- nid: 12.12.12.12@o2ib
- nid: 13.13.13.13@tcp
- nid: 14.14.14.14@o2ib
- nid: 15.15.15.15@tcp
- nid: 16.16.16.16@o2ib
- primary nid: 17.17.17.17@o2ib
Multi-Rail: True
peer ni:
- nid: 17.17.17.17@o2ib
- nid: 18.18.18.18@o2ib
- nid: 19.19.19.19@o2ib
- nid: 20.20.20.20@tcp
- nid: 21.21.21.21@o2ib
- nid: 22.22.22.22@o2ib
- nid: 23.23.23.23@tcp
- nid: 24.24.24.24@o2ib
- primary nid: 25.25.25.25@tcp
Multi-Rail: False
peer ni:
- nid: 25.25.25.25@tcp
- nid: 26.26.26.26@o2ib
- nid: 27.27.27.27@o2ib
- nid: 28.28.28.28@tcp
- nid: 29.29.29.29@o2ib
- nid: 30.30.30.30@tcp
- nid: 31.31.31.31@o2ib
- nid: 32.32.32.32@tcp
- primary nid: 33.33.33.33@tcp
Multi-Rail: False
peer ni:
- nid: 33.33.33.33@tcp
- nid: 34.34.34.34@o2ib
- nid: 35.35.35.35@tcp
- nid: 36.36.36.36@o2ib
- nid: 37.37.37.37@o2ib
- nid: 38.38.38.38@tcp
- nid: 39.39.39.39@o2ib
- nid: 40.40.40.40@o2ib
sles15build01:~ # /home/hornc/lustre-filesystem/lnet/utils/lnetctl import < /tmp/t.txt
add:
- peer_ni:
errno: -76
descr: "cannot add peer ni: Name not unique on network"
- peer_ni:
errno: 0
descr: "Success"
- peer_ni:
errno: 0
descr: "Success"
- peer_ni:
errno: -76
descr: "cannot add peer ni: Name not unique on network"
- peer_ni:
errno: -76
descr: "cannot add peer ni: Name not unique on network"
sles15build01:~ #
Every nid in the file is unique. Trying to cleanup after doing the above and the node either hangs or crashes: sles15build01:~ # bash -x /bin/clean.sh + LUSTRE=/home/hornc/lustre-filesystem + LNETCTL=/home/hornc/lustre-filesystem/lnet/utils/lnetctl + /home/hornc/lustre-filesystem/lnet/utils/lnetctl lnet unconfigure ^^Hangs I haven't been able to get a crash dump. Seen with 2.13 but I'd bet money it affects 2.10+ |
| Comments |
| Comment by Chris Horn [ 29/Jun/19 ] |
|
Ah, we're hitting this code: /*
* Get the peer_net. Check that we're not adding a second
* peer_ni on a peer_net of a non-multi-rail peer.
*/
lpn = lnet_peer_get_net_locked(lp, LNET_NIDNET(nid));
if (!lpn) {
lpn = lnet_peer_net_alloc(LNET_NIDNET(nid));
if (!lpn) {
rc = -ENOMEM;
goto out_free_lpni;
}
} else if (!(lp->lp_state & LNET_PEER_MULTI_RAIL)) {
rc = -ENOTUNIQ;
goto out_free_lpni;
}
Would be nice to have a better error message. Still not sure why we're hanging/crashing on cleanup. |
| Comment by Chris Horn [ 29/Jun/19 ] |
|
So if we're supposed to be prevented from adding secondary nids to non-MR peers then that doesn't seem to be working correctly: sles15build01:/home/hornc/lustre-filesystem/lustre/tests # cat /tmp/t.txt
peer:
- primary nid: 1.1.1.1@o2ib
Multi-Rail: False
peer ni:
- nid: 1.1.1.1@o2ib
- nid: 2.2.2.2@tcp
- nid: 3.3.3.3@tcp
- nid: 4.4.4.4@o2ib
- nid: 5.5.5.5@o2ib
- nid: 6.6.6.6@tcp
- nid: 7.7.7.7@tcp
- nid: 8.8.8.8@o2ib
- primary nid: 9.9.9.9@o2ib
Multi-Rail: True
peer ni:
- nid: 9.9.9.9@o2ib
- nid: 10.10.10.10@o2ib
- nid: 11.11.11.11@o2ib
- nid: 12.12.12.12@o2ib
- nid: 13.13.13.13@tcp
- nid: 14.14.14.14@o2ib
- nid: 15.15.15.15@tcp
- nid: 16.16.16.16@o2ib
- primary nid: 17.17.17.17@o2ib
Multi-Rail: True
peer ni:
- nid: 17.17.17.17@o2ib
- nid: 18.18.18.18@o2ib
- nid: 19.19.19.19@o2ib
- nid: 20.20.20.20@tcp
- nid: 21.21.21.21@o2ib
- nid: 22.22.22.22@o2ib
- nid: 23.23.23.23@tcp
- nid: 24.24.24.24@o2ib
- primary nid: 25.25.25.25@tcp
Multi-Rail: False
peer ni:
- nid: 25.25.25.25@tcp
- nid: 26.26.26.26@o2ib
- nid: 27.27.27.27@o2ib
- nid: 28.28.28.28@tcp
- nid: 29.29.29.29@o2ib
- nid: 30.30.30.30@tcp
- nid: 31.31.31.31@o2ib
- nid: 32.32.32.32@tcp
- primary nid: 33.33.33.33@tcp
Multi-Rail: False
peer ni:
- nid: 33.33.33.33@tcp
- nid: 34.34.34.34@o2ib
- nid: 35.35.35.35@tcp
- nid: 36.36.36.36@o2ib
- nid: 37.37.37.37@o2ib
- nid: 38.38.38.38@tcp
- nid: 39.39.39.39@o2ib
- nid: 40.40.40.40@o2ib
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # lnetctl import < /tmp/t.txt
add:
- peer_ni:
errno: -76
descr: "cannot add peer ni: Name not unique on network"
- peer_ni:
errno: 0
descr: "Success"
- peer_ni:
errno: 0
descr: "Success"
- peer_ni:
errno: -76
descr: "cannot add peer ni: Name not unique on network"
- peer_ni:
errno: -76
descr: "cannot add peer ni: Name not unique on network"
sles15build01:/home/hornc/lustre-filesystem/lustre/tests # lnetctl peer show
peer:
- primary nid: 25.25.25.25@tcp
Multi-Rail: False
peer ni:
- nid: 25.25.25.25@tcp
state: NA
- nid: 26.26.26.26@o2ib
state: NA
- primary nid: 9.9.9.9@o2ib
Multi-Rail: True
peer ni:
- nid: 9.9.9.9@o2ib
state: NA
- nid: 10.10.10.10@o2ib
state: NA
- nid: 11.11.11.11@o2ib
state: NA
- nid: 12.12.12.12@o2ib
state: NA
- nid: 14.14.14.14@o2ib
state: NA
- nid: 16.16.16.16@o2ib
state: NA
- nid: 13.13.13.13@tcp
state: NA
- nid: 15.15.15.15@tcp
state: NA
- primary nid: 1.1.1.1@o2ib
Multi-Rail: False
peer ni:
- nid: 1.1.1.1@o2ib
state: NA
- nid: 2.2.2.2@tcp
state: NA
- primary nid: 17.17.17.17@o2ib
Multi-Rail: True
peer ni:
- nid: 17.17.17.17@o2ib
state: NA
- nid: 18.18.18.18@o2ib
state: NA
- nid: 19.19.19.19@o2ib
state: NA
- nid: 21.21.21.21@o2ib
state: NA
- nid: 22.22.22.22@o2ib
state: NA
- nid: 24.24.24.24@o2ib
state: NA
- nid: 20.20.20.20@tcp
state: NA
- nid: 23.23.23.23@tcp
state: NA
- primary nid: 33.33.33.33@tcp
Multi-Rail: False
peer ni:
- nid: 33.33.33.33@tcp
state: NA
- nid: 34.34.34.34@o2ib
state: NA
sles15build01:/home/hornc/lustre-filesystem/lustre/tests #
|