Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-19437

LNet discovery may remove peer added by Lustre

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Medium
    • None
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      Lustre adds LNet peer by NIDs from config or IR using LNetAddPeer() and rely on their existence after that. Meanwhile LNet discovery may remove peer silently if some of its parameters conflicts with remote peer

      00000400:00000200:0.0:1758842132.678583:0:6112:0:(peer.c:1524:lnet_peer_attach_peer_ni()) peer 10.240.43.85@tcp NID 10.240.43.85@tcp flags 0x100001
      00000100:00000040:0.0:1758842132.678585:0:6112:0:(lustre_peer.c:139:class_add_uuid()) Add peer 10.240.43.85@tcp rc = 0
      --- so peer was added by Lustre and at the moment discovery is ON on client
      
      00000400:00000200:0.0:1758842132.683498:0:6112:0:(peer.c:2299:lnet_peer_queue_for_discovery()) Queue peer 10.240.43.85@tcp: 0
      ...
      00000400:00000200:0.0:1758842132.685154:0:5942:0:(peer.c:2749:lnet_discovery_event_reply()) Peer 10.240.43.85@tcp has discovery disabled
      00000400:00000200:0.0:1758842132.685156:0:5942:0:(peer.c:2769:lnet_discovery_event_reply()) Marking 10.240.43.85@tcp:0x100241 for deletion
      ...
      00000400:00000200:0.0:1758842132.685175:0:5946:0:(peer.c:2061:lnet_destroy_peer_ni_locked()) 000000006a68d27e nid 10.240.43.85@tcp
      
      --- and finally peer is deleted as result of discovery 

      this is happening when client LNet 'discovery' is enabled but server's one is disabled.

      As result peer is deleted and can't be find by any of its NIDs anymore, but Lustre keep trying to use it failing to send any request immediately

      00000100:00080000:0.0:1758576480.895954:0:24:0:(import.c:537:import_select_connection()) MGC10.240.43.85@tcp: connect to NID 10.240.43.85@tcp last attempt 148
      00000100:00080000:0.0:1758576480.895957:0:24:0:(import.c:553:import_select_connection()) MGC10.240.43.85@tcp: skip NID 10.240.43.85@tcp as unreachable

      Attachments

        Activity

          People

            tappro Mikhail Pershin
            tappro Mikhail Pershin
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: