Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11257

RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.10.4
    • None
    • CentOS 7.5, x86_64
    • 3
    • 9223372036854775807

    Description

      It looks like the latest kernel update from CentOS/RedHat prevents LNet to work on Infiniband interfaces (mlx5).

      Symptoms

      No LNet communication, self-ping doesn't work:

      # lctl list_nids
      10.9.101.60@o2ib4
      # lctl ping 10.9.101.60@o2ib4
      failed to ping 10.9.101.60@o2ib4: Input/output error

      Communicating with other nodes is impossible, as is mounting filesystems.
      The exact same node with the exact same configuration works flawlessly with kernel 3.10.0-862.9.1.el7.x86_64

       Versions

      # uname -r
      3.10.0-862.11.6.el7.x86_64
      # cat /sys/fs/lustre/version
      2.10.4

      HW

       

      # ibstat
      CA 'mlx5_0'
              CA type: MT4115
              Number of ports: 1
              Firmware version: 12.21.3012
              Hardware version: 0
              Node GUID: 0x7cfe900300268c04
              System image GUID: 0x7cfe900300268c04
              Port 1:
                      State: Active
                      Physical state: LinkUp
                      Rate: 100
                      Base lid: 72
                      LMC: 0
                      SM lid: 6
                      Capability mask: 0x2651e848
                      Port GUID: 0x7cfe900300268c04
                      Link layer: InfiniBand

       

      Kernel logs

      [ 1185.337098] LNetError: 22109:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 10.9.101.60@o2ib4: -22 
      [ 1185.348376] LNet: 22109:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject 
      [ 1185.357473] LNetError: 22109:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 10.9.101.60@o2ib4 rejected: consumer defined fatal error

      Attachments

        Issue Links

          Activity

            [LU-11257] RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet
            pjones Peter Jones added a comment -

            It seems like this was fixed in the next RHEL/CentOS update

            pjones Peter Jones added a comment - It seems like this was fixed in the next RHEL/CentOS update
            yujian Jian Yu added a comment -

            RHEL 7.5 kernel update to 3.10.0-862.14.4.el7 is tracked in LU-11448.

            yujian Jian Yu added a comment - RHEL 7.5 kernel update to 3.10.0-862.14.4.el7 is tracked in LU-11448 .
            boggl Bob Glossman added a comment -

            the kernel update to 3.10.0-862.14.4 is now available on Centos mirrors

             

            boggl Bob Glossman added a comment - the kernel update to 3.10.0-862.14.4 is now available on Centos mirrors  
            boggl Bob Glossman added a comment -

            The fix has also been noted in the Centos bug report; https://bugs.centos.org/view.php?id=15193.  The update .rpm isn't available in Centos mirrors yet though.

             

            boggl Bob Glossman added a comment - The fix has also been noted in the Centos bug report; https://bugs.centos.org/view.php?id=15193.   The update .rpm isn't available in Centos mirrors yet though.  
            srcc Stanford Research Computing Center added a comment - Kernel 3.10.0-862.14 has been released, which fixes the issue:   https://access.redhat.com/downloads/content/rhel---7/x86_64/2456/kernel/3.10.0-862.14.4.el7/x86_64/fd431d51/package
            scadmin SC Admin added a comment -

            yeah, we gave up waiting and just built our own ib_core.ko module with the 1-character patch from centos.
            works fine now.

            cheers,
            robin

            scadmin SC Admin added a comment - yeah, we gave up waiting and just built our own ib_core.ko module with the 1-character patch from centos. works fine now. cheers, robin
            mdiep Minh Diep added a comment - FYI https://bugs.centos.org/view.php?id=15193

            I wish I would have looked here first when digging into the same thing instead of wasting a day trying to figure out the culprit.  For now I've opened another redhat bug since I didn't come across anything when searching their bugzilla.:

            https://bugzilla.redhat.com/show_bug.cgi?id=1625620

            jfilizetti Jeremy Filizetti added a comment - I wish I would have looked here first when digging into the same thing instead of wasting a day trying to figure out the culprit.  For now I've opened another redhat bug since I didn't come across anything when searching their bugzilla.: https://bugzilla.redhat.com/show_bug.cgi?id=1625620

            People

              pjones Peter Jones
              srcc Stanford Research Computing Center
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: