Details
-
Bug
-
Resolution: Unresolved
-
Medium
-
None
-
None
-
None
-
3
-
9223372036854775807
Description
Lustre adds LNet peer by NIDs from config or IR using LNetAddPeer() and rely on their existence after that. Meanwhile LNet discovery may remove peer silently if some of its parameters conflicts with remote peer
00000400:00000200:0.0:1758842132.678583:0:6112:0:(peer.c:1524:lnet_peer_attach_peer_ni()) peer 10.240.43.85@tcp NID 10.240.43.85@tcp flags 0x100001 00000100:00000040:0.0:1758842132.678585:0:6112:0:(lustre_peer.c:139:class_add_uuid()) Add peer 10.240.43.85@tcp rc = 0 --- so peer was added by Lustre and at the moment discovery is ON on client 00000400:00000200:0.0:1758842132.683498:0:6112:0:(peer.c:2299:lnet_peer_queue_for_discovery()) Queue peer 10.240.43.85@tcp: 0 ... 00000400:00000200:0.0:1758842132.685154:0:5942:0:(peer.c:2749:lnet_discovery_event_reply()) Peer 10.240.43.85@tcp has discovery disabled 00000400:00000200:0.0:1758842132.685156:0:5942:0:(peer.c:2769:lnet_discovery_event_reply()) Marking 10.240.43.85@tcp:0x100241 for deletion ... 00000400:00000200:0.0:1758842132.685175:0:5946:0:(peer.c:2061:lnet_destroy_peer_ni_locked()) 000000006a68d27e nid 10.240.43.85@tcp --- and finally peer is deleted as result of discovery
this is happening when client LNet 'discovery' is enabled but server's one is disabled.
As result peer is deleted and can't be find by any of its NIDs anymore, but Lustre keep trying to use it failing to send any request immediately
00000100:00080000:0.0:1758576480.895954:0:24:0:(import.c:537:import_select_connection()) MGC10.240.43.85@tcp: connect to NID 10.240.43.85@tcp last attempt 148 00000100:00080000:0.0:1758576480.895957:0:24:0:(import.c:553:import_select_connection()) MGC10.240.43.85@tcp: skip NID 10.240.43.85@tcp as unreachable