Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11257

RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.10.4
    • None
    • CentOS 7.5, x86_64
    • 3
    • 9223372036854775807

    Description

      It looks like the latest kernel update from CentOS/RedHat prevents LNet to work on Infiniband interfaces (mlx5).

      Symptoms

      No LNet communication, self-ping doesn't work:

      # lctl list_nids
      10.9.101.60@o2ib4
      # lctl ping 10.9.101.60@o2ib4
      failed to ping 10.9.101.60@o2ib4: Input/output error

      Communicating with other nodes is impossible, as is mounting filesystems.
      The exact same node with the exact same configuration works flawlessly with kernel 3.10.0-862.9.1.el7.x86_64

       Versions

      # uname -r
      3.10.0-862.11.6.el7.x86_64
      # cat /sys/fs/lustre/version
      2.10.4

      HW

       

      # ibstat
      CA 'mlx5_0'
              CA type: MT4115
              Number of ports: 1
              Firmware version: 12.21.3012
              Hardware version: 0
              Node GUID: 0x7cfe900300268c04
              System image GUID: 0x7cfe900300268c04
              Port 1:
                      State: Active
                      Physical state: LinkUp
                      Rate: 100
                      Base lid: 72
                      LMC: 0
                      SM lid: 6
                      Capability mask: 0x2651e848
                      Port GUID: 0x7cfe900300268c04
                      Link layer: InfiniBand

       

      Kernel logs

      [ 1185.337098] LNetError: 22109:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 10.9.101.60@o2ib4: -22 
      [ 1185.348376] LNet: 22109:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject 
      [ 1185.357473] LNetError: 22109:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 10.9.101.60@o2ib4 rejected: consumer defined fatal error

      Attachments

        Issue Links

          Activity

            [LU-11257] RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet
            scadmin SC Admin added a comment -

            yeah, we gave up waiting and just built our own ib_core.ko module with the 1-character patch from centos.
            works fine now.

            cheers,
            robin

            scadmin SC Admin added a comment - yeah, we gave up waiting and just built our own ib_core.ko module with the 1-character patch from centos. works fine now. cheers, robin
            mdiep Minh Diep added a comment - FYI https://bugs.centos.org/view.php?id=15193

            I wish I would have looked here first when digging into the same thing instead of wasting a day trying to figure out the culprit.  For now I've opened another redhat bug since I didn't come across anything when searching their bugzilla.:

            https://bugzilla.redhat.com/show_bug.cgi?id=1625620

            jfilizetti Jeremy Filizetti added a comment - I wish I would have looked here first when digging into the same thing instead of wasting a day trying to figure out the culprit.  For now I've opened another redhat bug since I didn't come across anything when searching their bugzilla.: https://bugzilla.redhat.com/show_bug.cgi?id=1625620
            scadmin SC Admin added a comment -

            hmm. comments attached to that article point to a fix in centos - potentially just a misplaced semi-colon. but OTOH the centos bug seems to be talking about IPoIB and that works fine. perhaps the fix is right and the bug report is wrong?

            cheers,
            robin

            scadmin SC Admin added a comment - hmm. comments attached to that article point to a fix in centos - potentially just a misplaced semi-colon. but OTOH the centos bug seems to be talking about IPoIB and that works fine. perhaps the fix is right and the bug report is wrong? cheers, robin

            Still no update from Red Hat. 

            We're getting more info via The Register:
            https://www.theregister.co.uk/2018/08/21/fix_for_julys_spectrelike_bug_is_breaking_some_supers/

            “The problem will be fixed in kernel-3.10.0-862.13.1 which is currently being reviewed by Red Hat Enterprise Linux Engineering.”

            But no ETA yet.

            srcc Stanford Research Computing Center added a comment - Still no update from Red Hat.  We're getting more info via The Register: https://www.theregister.co.uk/2018/08/21/fix_for_julys_spectrelike_bug_is_breaking_some_supers/ “The problem will be fixed in kernel-3.10.0-862.13.1 which is currently being reviewed by Red Hat Enterprise Linux Engineering.” But no ETA yet.
            pjones Peter Jones added a comment -

            Thanks for the info!

            pjones Peter Jones added a comment - Thanks for the info!

            RHEL's reply:

            https://bugzilla.redhat.com/show_bug.cgi?id=1618452

            — Comment #3 from Don Dutile <ddutile@redhat.com> —
            Already reported and being actively fixed.

            Cannot make this public, as the patch that caused it was due to embargo'd
            security fix.

            This issue has highest priority for resolution.
            Revert to 3.10.0-862.11.5.el7 in the mean time.

            This bug has been marked as a duplicate of bug 1616346

             

             

            srcc Stanford Research Computing Center added a comment - - edited RHEL's reply: https://bugzilla.redhat.com/show_bug.cgi?id=1618452 — Comment #3 from Don Dutile < ddutile@redhat.com > — Already reported and being actively fixed. Cannot make this public, as the patch that caused it was due to embargo'd security fix. This issue has highest priority for resolution. Revert to 3.10.0-862.11.5.el7 in the mean time. This bug has been marked as a duplicate of bug 1616346    
            pjones Peter Jones added a comment -

            I think that you have to request for them to open it up.

            pjones Peter Jones added a comment - I think that you have to request for them to open it up.

            I submitted RHEL bug #1618452 to report the issue:
            https://bugzilla.redhat.com/show_bug.cgi?id=1618452

            Which seem to be marked "private" by the Redhat bugzilla, without any way to mark it "public" on my end

            srcc Stanford Research Computing Center added a comment - - edited I submitted RHEL bug #1618452 to report the issue: https://bugzilla.redhat.com/show_bug.cgi?id=1618452 Which seem to be marked "private" by the Redhat bugzilla, without any way to mark it "public" on my end

            Good observation, indeed: perf tests such as ib_{read,send}_bw  and ibv_rc_pingpong fail with errors like:

            Failed to modify QP to RTR
            Couldn't connect to remote QP 

            or

            Failed to modify QP 386 to RTR
            Unable to Connect the HCA's through the link 
            srcc Stanford Research Computing Center added a comment - Good observation, indeed: perf tests such as ib_{read,send}_bw  and ibv_rc_pingpong fail with errors like: Failed to modify QP to RTR Couldn't connect to remote QP or Failed to modify QP 386 to RTR Unable to Connect the HCA's through the link
            scadmin SC Admin added a comment -

            actually, ib_send_rw and ibv_rc_pingpong don't seem to work on this new kernel, so I suspect a RHEL have broken all IB RDMA?

            do they work for you?

            IPoIB works ok.

            cheers,
            robin

            scadmin SC Admin added a comment - actually, ib_send_rw and ibv_rc_pingpong don't seem to work on this new kernel, so I suspect a RHEL have broken all IB RDMA? do they work for you? IPoIB works ok. cheers, robin

            People

              pjones Peter Jones
              srcc Stanford Research Computing Center
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: