Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11257

RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.10.4
    • None
    • CentOS 7.5, x86_64
    • 3
    • 9223372036854775807

    Description

      It looks like the latest kernel update from CentOS/RedHat prevents LNet to work on Infiniband interfaces (mlx5).

      Symptoms

      No LNet communication, self-ping doesn't work:

      # lctl list_nids
      10.9.101.60@o2ib4
      # lctl ping 10.9.101.60@o2ib4
      failed to ping 10.9.101.60@o2ib4: Input/output error

      Communicating with other nodes is impossible, as is mounting filesystems.
      The exact same node with the exact same configuration works flawlessly with kernel 3.10.0-862.9.1.el7.x86_64

       Versions

      # uname -r
      3.10.0-862.11.6.el7.x86_64
      # cat /sys/fs/lustre/version
      2.10.4

      HW

       

      # ibstat
      CA 'mlx5_0'
              CA type: MT4115
              Number of ports: 1
              Firmware version: 12.21.3012
              Hardware version: 0
              Node GUID: 0x7cfe900300268c04
              System image GUID: 0x7cfe900300268c04
              Port 1:
                      State: Active
                      Physical state: LinkUp
                      Rate: 100
                      Base lid: 72
                      LMC: 0
                      SM lid: 6
                      Capability mask: 0x2651e848
                      Port GUID: 0x7cfe900300268c04
                      Link layer: InfiniBand

       

      Kernel logs

      [ 1185.337098] LNetError: 22109:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 10.9.101.60@o2ib4: -22 
      [ 1185.348376] LNet: 22109:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject 
      [ 1185.357473] LNetError: 22109:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 10.9.101.60@o2ib4 rejected: consumer defined fatal error

      Attachments

        Issue Links

          Activity

            [LU-11257] RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet
            pjones Peter Jones added a comment -

            It seems like this was fixed in the next RHEL/CentOS update

            pjones Peter Jones added a comment - It seems like this was fixed in the next RHEL/CentOS update
            yujian Jian Yu added a comment -

            RHEL 7.5 kernel update to 3.10.0-862.14.4.el7 is tracked in LU-11448.

            yujian Jian Yu added a comment - RHEL 7.5 kernel update to 3.10.0-862.14.4.el7 is tracked in LU-11448 .
            boggl Bob Glossman added a comment -

            the kernel update to 3.10.0-862.14.4 is now available on Centos mirrors

             

            boggl Bob Glossman added a comment - the kernel update to 3.10.0-862.14.4 is now available on Centos mirrors  
            boggl Bob Glossman added a comment -

            The fix has also been noted in the Centos bug report; https://bugs.centos.org/view.php?id=15193.  The update .rpm isn't available in Centos mirrors yet though.

             

            boggl Bob Glossman added a comment - The fix has also been noted in the Centos bug report; https://bugs.centos.org/view.php?id=15193.   The update .rpm isn't available in Centos mirrors yet though.  
            srcc Stanford Research Computing Center added a comment - Kernel 3.10.0-862.14 has been released, which fixes the issue:   https://access.redhat.com/downloads/content/rhel---7/x86_64/2456/kernel/3.10.0-862.14.4.el7/x86_64/fd431d51/package
            scadmin SC Admin added a comment -

            yeah, we gave up waiting and just built our own ib_core.ko module with the 1-character patch from centos.
            works fine now.

            cheers,
            robin

            scadmin SC Admin added a comment - yeah, we gave up waiting and just built our own ib_core.ko module with the 1-character patch from centos. works fine now. cheers, robin
            mdiep Minh Diep added a comment - FYI https://bugs.centos.org/view.php?id=15193

            I wish I would have looked here first when digging into the same thing instead of wasting a day trying to figure out the culprit.  For now I've opened another redhat bug since I didn't come across anything when searching their bugzilla.:

            https://bugzilla.redhat.com/show_bug.cgi?id=1625620

            jfilizetti Jeremy Filizetti added a comment - I wish I would have looked here first when digging into the same thing instead of wasting a day trying to figure out the culprit.  For now I've opened another redhat bug since I didn't come across anything when searching their bugzilla.: https://bugzilla.redhat.com/show_bug.cgi?id=1625620
            scadmin SC Admin added a comment -

            hmm. comments attached to that article point to a fix in centos - potentially just a misplaced semi-colon. but OTOH the centos bug seems to be talking about IPoIB and that works fine. perhaps the fix is right and the bug report is wrong?

            cheers,
            robin

            scadmin SC Admin added a comment - hmm. comments attached to that article point to a fix in centos - potentially just a misplaced semi-colon. but OTOH the centos bug seems to be talking about IPoIB and that works fine. perhaps the fix is right and the bug report is wrong? cheers, robin

            Still no update from Red Hat. 

            We're getting more info via The Register:
            https://www.theregister.co.uk/2018/08/21/fix_for_julys_spectrelike_bug_is_breaking_some_supers/

            “The problem will be fixed in kernel-3.10.0-862.13.1 which is currently being reviewed by Red Hat Enterprise Linux Engineering.”

            But no ETA yet.

            srcc Stanford Research Computing Center added a comment - Still no update from Red Hat.  We're getting more info via The Register: https://www.theregister.co.uk/2018/08/21/fix_for_julys_spectrelike_bug_is_breaking_some_supers/ “The problem will be fixed in kernel-3.10.0-862.13.1 which is currently being reviewed by Red Hat Enterprise Linux Engineering.” But no ETA yet.

            People

              pjones Peter Jones
              srcc Stanford Research Computing Center
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: