Details
-
Bug
-
Resolution: Duplicate
-
Major
-
None
-
Lustre 2.15.1
-
lustre-2.15.1_7.llnl
4.18.0-372.26.1.1toss.t4.x86_64
omnipath
-
3
-
9223372036854775807
Description
(Update: this ticket was created while investigating another issue which we think we've reproduced.)
After running a test in which a router (mutt2) was powered off during an ior run, then the router was brought back up, we observe problems in the logs where one of the client nodes involved (mutt18) in the ior keeps reporting its route through mutt2 is going up and down.
After mutt2 came back up, we observe lctl ping to it to to fail intermittently. Similarly, we see lnet_selftest fail between mutt2 and mutt8.
After mutt2 is back up, we start seeing LNet and LNetError console log messages:
- On the router node that was rebooted, mutt2,, console log messages like these with the NID of a compute node
- "LNetError.*kiblnd_cm_callback.*DISCONNECTED"
- "LNetError.*kiblnd_reconnect_peer.*Abort reconnection of"
- On the router nodes that were not rebooted, console log messages like these
- "LNet:.*kiblnd_handle_rx.*PUT_NACK" with the NID of a garter node (an OSS)
- On the compute node mutt18, console log messages like these with the NID of the router that was power cycled
- "LnetError.*kiblnd_post_locked.*Error -22 posting transmit"
- "LNetError.*lnet_set_route_aliveness.*route to ... has gone from up to down"
from the mutt18 console logs, 192.168.128.2@o2ib44 is mutt2
2022-10-27 09:40:42 [577896.982939] LNetError: 201529:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up 2022-10-27 11:42:44 [585218.762479] LNetError: 201549:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from up to down 2022-10-27 11:42:48 [585222.858555] LNetError: 201531:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up 2022-10-27 12:27:25 [587899.915999] LNetError: 201515:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from up to down 2022-10-27 12:27:48 [587923.166182] LNetError: 201529:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up 2022-10-27 12:44:56 [588951.269523] LNetError: 201549:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from up to down 2022-10-27 12:44:57 [588952.294216] LNetError: 201529:0:(lib-lnet.h:1241:lnet_set_route_aliveness()) route to o2ib100 through 192.168.128.2@o2ib44 has gone from down to up
Original description:
LNet selftest session cannot be created because "lst add_group fails"
# lst add_group anodes mutt110@o2ib44 create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22 No nodes added successfully, deleting group anodes Group is deleted # lctl ping mutt110@o2ib44 12345-0@lo 12345-192.168.128.110@o2ib44 ## I had already loaded the lnet_selftest module on mutt110, but the error is the same either way # pdsh -w emutt110 lsmod | grep lnet_selftest emutt110: lnet_selftest 270336 0 emutt110: lnet 704512 9 osc,ko2iblnd,obdclass,ptlrpc,lnet_selftes ,mgc,lmv,lustre emutt110: libcfs 266240 13 fld,lnet,osc,fid,ko2iblnd,obdclass,ptlrpc,lnet_selftest,mgc,lov,mdc,lmv,lustre
In addition to the lnet_selftest failure we see intermittent ping failures. So the issue is not lnet_selftest itself (as I first believed) but a more general problem.
For my tracking purposes, our local ticket is TOSS5812
Attachments
Issue Links
- duplicates
-
LU-16349 Excessive number of OPA disconnects / LNET network errors in cluster
- Resolved