Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11257

RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: Lustre 2.10.4
    • Fix Version/s: None
    • Labels:
      None
    • Environment:
      CentOS 7.5, x86_64
    • Severity:
      3
    • Rank (Obsolete):
      9223372036854775807

      Description

      It looks like the latest kernel update from CentOS/RedHat prevents LNet to work on Infiniband interfaces (mlx5).

      Symptoms

      No LNet communication, self-ping doesn't work:

      # lctl list_nids
      10.9.101.60@o2ib4
      # lctl ping 10.9.101.60@o2ib4
      failed to ping 10.9.101.60@o2ib4: Input/output error

      Communicating with other nodes is impossible, as is mounting filesystems.
      The exact same node with the exact same configuration works flawlessly with kernel 3.10.0-862.9.1.el7.x86_64

       Versions

      # uname -r
      3.10.0-862.11.6.el7.x86_64
      # cat /sys/fs/lustre/version
      2.10.4

      HW

       

      # ibstat
      CA 'mlx5_0'
              CA type: MT4115
              Number of ports: 1
              Firmware version: 12.21.3012
              Hardware version: 0
              Node GUID: 0x7cfe900300268c04
              System image GUID: 0x7cfe900300268c04
              Port 1:
                      State: Active
                      Physical state: LinkUp
                      Rate: 100
                      Base lid: 72
                      LMC: 0
                      SM lid: 6
                      Capability mask: 0x2651e848
                      Port GUID: 0x7cfe900300268c04
                      Link layer: InfiniBand

       

      Kernel logs

      [ 1185.337098] LNetError: 22109:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 10.9.101.60@o2ib4: -22 
      [ 1185.348376] LNet: 22109:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject 
      [ 1185.357473] LNetError: 22109:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 10.9.101.60@o2ib4 rejected: consumer defined fatal error

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                pjones Peter Jones
                Reporter:
                srcc Stanford Research Computing Center
              • Votes:
                0 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: