Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17700

lnet ping: failed to ping NID: Protocol error

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0
    • None
    • None
    • EL9.3
    • 3
    • 9223372036854775807

    Description

      I was testing today's master branch to get Lustre servers running on EL9.3 for a new project and it looks like a regression was introduced as I cannot ping the local NID anymore:

       Lustre 2.15.59_32 works as expected:

      [root@elm-rcf-io2-s2 ~]# rpm -q lustre
      lustre-2.15.59_32_g1bb972b-1.el9.x86_64
      
      [root@elm-rcf-io2-s2 ~]# ibstat
      CA 'mlx5_0'
      	CA type: MT4125
      	Number of ports: 1
      	Firmware version: 22.38.1002
      	Hardware version: 0
      	Node GUID: 0x58a2e103003e5238
      	System image GUID: 0x58a2e103003e5238
      	Port 1:
      		State: Active
      		Physical state: LinkUp
      		Rate: 100
      		Base lid: 0
      		LMC: 0
      		SM lid: 0
      		Capability mask: 0x00010000
      		Port GUID: 0x5aa2e1fffe3e5238
      		Link layer: Ethernet
      CA 'mlx5_1'
      	CA type: MT4125
      	Number of ports: 1
      	Firmware version: 22.38.1002
      	Hardware version: 0
      	Node GUID: 0x58a2e103003e5239
      	System image GUID: 0x58a2e103003e5238
      	Port 1:
      		State: Down
      		Physical state: Disabled
      		Rate: 40
      		Base lid: 0
      		LMC: 0
      		SM lid: 0
      		Capability mask: 0x00010000
      		Port GUID: 0x5aa2e1fffe3e5239
      		Link layer: Ethernet
      
      [root@elm-rcf-io2-s2 ~]# lnetctl net show
      net:
      -     net type: lo
            local NI(s):
            -     nid: 0@lo
                  status: up
      -     net type: o2ib9
            local NI(s):
            -     nid: 10.4.0.24@o2ib9
                  status: up
                  interfaces:
                        0: ens2f0np0
      
      [root@elm-rcf-io2-s2 ~]# lctl list_nids
      10.4.0.24@o2ib9
      
      [root@elm-rcf-io2-s2 ~]# lctl ping 10.4.0.24@o2ib9
      12345-0@lo
      12345-10.4.0.24@o2ib9
      
      [root@elm-rcf-io2-s2 ~]# lnetctl ping 10.4.0.24@o2ib9
      ping:
          - primary nid: 10.4.0.24@o2ib9
            Multi-Rail: False
            peer ni:
              - nid: 10.4.0.24@o2ib9
      
      [root@elm-rcf-io2-s2 ~]# lnetctl ping 0@lo
      ping:
          - primary nid: 0@lo
            Multi-Rail: False
            peer ni:
              - nid: 10.4.0.24@o2ib9
      

       

      But Lustre 2.15.61_225 is now broken:

      [root@elm-rcf-io2-s2 ~]# rpm -q lustre
      lustre-2.15.61_225_gbb6a2d2-1.el9.x86_64
      
      [root@elm-rcf-io2-s2 ~]# lnetctl net show
      net:
      -     net type: lo
            local NI(s):
            -     nid: 0@lo
                  status: up
      -     net type: o2ib9
            local NI(s):
            -     nid: 10.4.0.24@o2ib9
                  status: up
                  interfaces:
                        0: ens2f0np0
      
      [root@elm-rcf-io2-s2 ~]# ibstat
      CA 'mlx5_0'
      	CA type: MT4125
      	Number of ports: 1
      	Firmware version: 22.38.1002
      	Hardware version: 0
      	Node GUID: 0x58a2e103003e5238
      	System image GUID: 0x58a2e103003e5238
      	Port 1:
      		State: Active
      		Physical state: LinkUp
      		Rate: 100
      		Base lid: 0
      		LMC: 0
      		SM lid: 0
      		Capability mask: 0x00010000
      		Port GUID: 0x5aa2e1fffe3e5238
      		Link layer: Ethernet
      CA 'mlx5_1'
      	CA type: MT4125
      	Number of ports: 1
      	Firmware version: 22.38.1002
      	Hardware version: 0
      	Node GUID: 0x58a2e103003e5239
      	System image GUID: 0x58a2e103003e5238
      	Port 1:
      		State: Down
      		Physical state: Disabled
      		Rate: 40
      		Base lid: 0
      		LMC: 0
      		SM lid: 0
      		Capability mask: 0x00010000
      		Port GUID: 0x5aa2e1fffe3e5239
      		Link layer: Ethernet
      
      [root@elm-rcf-io2-s2 ~]# lnetctl net show
      net:
      -     net type: lo
            local NI(s):
            -     nid: 0@lo
                  status: up
      -     net type: o2ib9
            local NI(s):
            -     nid: 10.4.0.24@o2ib9
                  status: up
                  interfaces:
                        0: ens2f0np0
      
      [root@elm-rcf-io2-s2 ~]# lctl list_nids
      10.4.0.24@o2ib9
      
      [root@elm-rcf-io2-s2 ~]# lctl ping 10.4.0.24@o2ib9
      failed to ping 10.4.0.24@o2ib9: Protocol error
      
      [root@elm-rcf-io2-s2 ~]# lnetctl ping 10.4.0.24@o2ib9
      manage:
      - ping:
        errno: -71
        descr: ! 'failed to ping 10.4.0.24@o2ib9: Protocol error'
      
      [root@elm-rcf-io2-s2 ~]# lnetctl ping 0@lo
      ping:
      - primary nid: 0@lo
        Multi-Rail: false
        peer_ni:
        - nid: 10.4.0.24@o2ib9
      

       

      Lustre debug shows:

      00000400:00020000:4.0:1712120315.285318:0:101829:0:(api-ni.c:9146:lnet_ping()) 12345-10.4.0.24@o2ib9: Unexpected magic 00000000
      

       

      Attachments

        Issue Links

          Activity

            People

              ssmirnov Serguei Smirnov
              sthiell Stephane Thiell
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: