Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-17719

2.15.62: lnet broken with errno - 22 descr: ! "unsupported NID"

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Major
    • Lustre 2.16.0
    • None
    • None
    • Lustre tagged 2.15.62
    • 3
    • 9223372036854775807

    Description

      We tried the latest tag 2.15.62 on top of EL9.3 today and LNet o2ib is broken. This is not related to LU-17700 which I tested separately.

      This seems to be a regression introduced during that time:

      9b2cb4e208 New tag 2.15.62      <- TESTED: lnet o2ib broken
      b53aa99294 LU-17690 quota: uninitialized var in qmt_lgd_extend_cb()
      fac57894c7 LU-17683 lnet: ksocknal_startup() leaks iface table
      53c3585ac3 LU-17673 obdclass: properly free opts string
      37e1316050 LU-17685 utils: Allow nocompr flag in lfs mirror extend
      b4d2566ea7 LU-930 ofd: improve orphan cleaning message
      9f7b60739e LU-8191 selftest: restore BUILD_BUGs
      02aa540250 LU-17678 quota: fix memleak in qmt_setup_lqe_gd()
      552a813a22 LU-17673 llite: ll_options() to release the string
      4ae823762d LU-17261 lov: unlink can handle bogus striping
      90ec7361b7 LU-17665 lnet: lock primary NID only on lustre-built peer
      94d05d0737 LU-17379 mgc: try MGS nodes faster
      1ec1858f64 LU-16724 ptlrpc: refactor page pools patch 1
      c660bdef1d LU-16694 tests: rewrite socket client, server in python
      1e9a36b00c LU-15981 tests: add missing close in cascading_rw
      7bdbe5d8e3 LU-17680 ldlm: fix ldlm_res_hop_hash() argument
      a9704ed2d8 LU-17675 tests: sanity-flr/61a set atime_diff=1 for statx
      6b4ef5cab9 LU-17674 build: use nop_mnt_idmap in inode_owner_or_capable
      fa08092d9a LU-6142 llite: Fix style issues under lustre/llite
      4c809f7621 LU-12452 o2iblnd: allow setting IP ToS value (RoCE)
      56af81e1aa LU-10391 lnet: update Netlink commands functionality
      7151881aa3 LU-13814 clio: remove cp_state usage for DIO pages
      9ef186b71b LU-16692 tests: remove force_new_seq from some test suites
      f00d2467fc LU-16692 osp: do not assert on seq got over network
      bb6a2d2e80 LU-17053 libcfs: make a debugfs equivalent for markers  <- TESTED: lnet o2ib loads
      

      What happens with 2.15.62 is this (tested on ConnectX-6 RoCE):

       [root@elm-rcf-md1-s2 ~]# cat /etc/lnet.conf
        # lnet.conf - configuration file for lnet routes to be imported by lnetctl
        
        net:
            - net type: o2ib9
              local NI(s):
                - nid:
                  interfaces:
                      0: eno12399np0
      
        [root@elm-rcf-md1-s2 ~]# systemctl status lnet
        × lnet.service - lnet management
             Loaded: loaded (/usr/lib/systemd/system/lnet.service; enabled; preset: disabled)
             Active: failed (Result: exit-code) since Tue 2024-04-09 14:40:39 PDT; 12s ago
            Process: 3239 ExecStart=/sbin/modprobe lnet (code=exited, status=0/SUCCESS)
            Process: 3301 ExecStart=/usr/sbin/lnetctl lnet configure (code=exited, status=0/SUCCESS)
            Process: 3304 ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf (code=exited, status=234)
           Main PID: 3304 (code=exited, status=234)
                CPU: 789ms
        
        Apr 09 14:40:38 elm-rcf-md1-s2 systemd[1]: Starting lnet management...
        Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: ---
        Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: add:
        Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: -     import:
        Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]:       errno: -22
        Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]:       descr: ! "unsupported NID"
        Apr 09 14:40:39 elm-rcf-md1-s2 lnetctl[3304]: ...
        Apr 09 14:40:39 elm-rcf-md1-s2 systemd[1]: lnet.service: Main process exited, code=exited, status=234/n/a
        Apr 09 14:40:39 elm-rcf-md1-s2 systemd[1]: lnet.service: Failed with result 'exit-code'.
        Apr 09 14:40:39 elm-rcf-md1-s2 systemd[1]: Failed to start lnet management.
      
        [root@elm-rcf-md1-s2 ~]# ibstat
        CA 'mlx5_0'
        	CA type: MT4127
        	Number of ports: 1
        	Firmware version: 26.38.1002
        	Hardware version: 0
        	Node GUID: 0x946dae03006d972a
        	System image GUID: 0x946dae03006d972a
        	Port 1:
        		State: Active
        		Physical state: LinkUp
        		Rate: 25
        		Base lid: 0
        		LMC: 0
        		SM lid: 0
        		Capability mask: 0x00010000
        		Port GUID: 0x966daefffe6d972a
        		Link layer: Ethernet
      

      Attachments

        Activity

          People

            simmonsja James A Simmons
            sthiell Stephane Thiell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: