Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11257

RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • Lustre 2.10.4
    • None
    • CentOS 7.5, x86_64
    • 3
    • 9223372036854775807

    Description

      It looks like the latest kernel update from CentOS/RedHat prevents LNet to work on Infiniband interfaces (mlx5).

      Symptoms

      No LNet communication, self-ping doesn't work:

      # lctl list_nids
      10.9.101.60@o2ib4
      # lctl ping 10.9.101.60@o2ib4
      failed to ping 10.9.101.60@o2ib4: Input/output error

      Communicating with other nodes is impossible, as is mounting filesystems.
      The exact same node with the exact same configuration works flawlessly with kernel 3.10.0-862.9.1.el7.x86_64

       Versions

      # uname -r
      3.10.0-862.11.6.el7.x86_64
      # cat /sys/fs/lustre/version
      2.10.4

      HW

       

      # ibstat
      CA 'mlx5_0'
              CA type: MT4115
              Number of ports: 1
              Firmware version: 12.21.3012
              Hardware version: 0
              Node GUID: 0x7cfe900300268c04
              System image GUID: 0x7cfe900300268c04
              Port 1:
                      State: Active
                      Physical state: LinkUp
                      Rate: 100
                      Base lid: 72
                      LMC: 0
                      SM lid: 6
                      Capability mask: 0x2651e848
                      Port GUID: 0x7cfe900300268c04
                      Link layer: InfiniBand

       

      Kernel logs

      [ 1185.337098] LNetError: 22109:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 10.9.101.60@o2ib4: -22 
      [ 1185.348376] LNet: 22109:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject 
      [ 1185.357473] LNetError: 22109:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 10.9.101.60@o2ib4 rejected: consumer defined fatal error

      Attachments

        Issue Links

          Activity

            [LU-11257] RHEL/CentOS 3.10.0-862.11.6.el7.x86_64 kernel breaks LNet

            Still no update from Red Hat. 

            We're getting more info via The Register:
            https://www.theregister.co.uk/2018/08/21/fix_for_julys_spectrelike_bug_is_breaking_some_supers/

            “The problem will be fixed in kernel-3.10.0-862.13.1 which is currently being reviewed by Red Hat Enterprise Linux Engineering.”

            But no ETA yet.

            srcc Stanford Research Computing Center added a comment - Still no update from Red Hat.  We're getting more info via The Register: https://www.theregister.co.uk/2018/08/21/fix_for_julys_spectrelike_bug_is_breaking_some_supers/ “The problem will be fixed in kernel-3.10.0-862.13.1 which is currently being reviewed by Red Hat Enterprise Linux Engineering.” But no ETA yet.
            pjones Peter Jones added a comment -

            Thanks for the info!

            pjones Peter Jones added a comment - Thanks for the info!

            RHEL's reply:

            https://bugzilla.redhat.com/show_bug.cgi?id=1618452

            — Comment #3 from Don Dutile <ddutile@redhat.com> —
            Already reported and being actively fixed.

            Cannot make this public, as the patch that caused it was due to embargo'd
            security fix.

            This issue has highest priority for resolution.
            Revert to 3.10.0-862.11.5.el7 in the mean time.

            This bug has been marked as a duplicate of bug 1616346

             

             

            srcc Stanford Research Computing Center added a comment - - edited RHEL's reply: https://bugzilla.redhat.com/show_bug.cgi?id=1618452 — Comment #3 from Don Dutile < ddutile@redhat.com > — Already reported and being actively fixed. Cannot make this public, as the patch that caused it was due to embargo'd security fix. This issue has highest priority for resolution. Revert to 3.10.0-862.11.5.el7 in the mean time. This bug has been marked as a duplicate of bug 1616346    
            pjones Peter Jones added a comment -

            I think that you have to request for them to open it up.

            pjones Peter Jones added a comment - I think that you have to request for them to open it up.

            I submitted RHEL bug #1618452 to report the issue:
            https://bugzilla.redhat.com/show_bug.cgi?id=1618452

            Which seem to be marked "private" by the Redhat bugzilla, without any way to mark it "public" on my end

            srcc Stanford Research Computing Center added a comment - - edited I submitted RHEL bug #1618452 to report the issue: https://bugzilla.redhat.com/show_bug.cgi?id=1618452 Which seem to be marked "private" by the Redhat bugzilla, without any way to mark it "public" on my end

            Good observation, indeed: perf tests such as ib_{read,send}_bw  and ibv_rc_pingpong fail with errors like:

            Failed to modify QP to RTR
            Couldn't connect to remote QP 

            or

            Failed to modify QP 386 to RTR
            Unable to Connect the HCA's through the link 
            srcc Stanford Research Computing Center added a comment - Good observation, indeed: perf tests such as ib_{read,send}_bw  and ibv_rc_pingpong fail with errors like: Failed to modify QP to RTR Couldn't connect to remote QP or Failed to modify QP 386 to RTR Unable to Connect the HCA's through the link
            scadmin SC Admin added a comment -

            actually, ib_send_rw and ibv_rc_pingpong don't seem to work on this new kernel, so I suspect a RHEL have broken all IB RDMA?

            do they work for you?

            IPoIB works ok.

            cheers,
            robin

            scadmin SC Admin added a comment - actually, ib_send_rw and ibv_rc_pingpong don't seem to work on this new kernel, so I suspect a RHEL have broken all IB RDMA? do they work for you? IPoIB works ok. cheers, robin
            scadmin SC Admin added a comment -

            we see the same thing on our OPA network.

            ksocklnd reportedly seems ok with this kernel on our TCP networks (in VMs mostly), so I suspect it's ko2iblnd related.

            below is syslog from
            john98 # lctl ping warble@o2ib44

            Aug 17 01:58:10 john98 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 36, npartitions: 2
            Aug 17 01:58:10 john98 kernel: alg: No test for adler32 (adler32-zlib)
            Aug 17 01:58:11 john98 kernel: Lustre: Lustre: Build Version: 2.10.4
            Aug 17 01:58:11 john98 kernel: LNet: Using FMR for registration
            Aug 17 01:58:11 john98 kernel: LNet: Added LNI 192.168.44.198@o2ib44 [128/2048/0/180]
            Aug 17 01:58:36 warble1 kernel: LNetError: 103:0:(o2iblnd_cb.c:3061:kiblnd_cm_callback()) 192.168.44.198@o2ib44: REJECTED 28
            Aug 17 01:58:36 warble1 kernel: LNetError: 103:0:(o2iblnd_cb.c:3061:kiblnd_cm_callback()) Skipped 3 previous similar messages
            Aug 17 02:06:08 john98 kernel: LNetError: 204:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 192.168.44.198@o2ib44: -22
            Aug 17 02:06:08 john98 kernel: LNet: 204:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject
            Aug 17 02:06:08 john98 kernel: LNetError: 204:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 192.168.44.198@o2ib44 rejected: consumer defined fatal error
            

            2.10.4 was dkms rebuilt for this kernel.

            cheers,
            robin

            scadmin SC Admin added a comment - we see the same thing on our OPA network. ksocklnd reportedly seems ok with this kernel on our TCP networks (in VMs mostly), so I suspect it's ko2iblnd related. below is syslog from john98 # lctl ping warble@o2ib44 Aug 17 01:58:10 john98 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 36, npartitions: 2 Aug 17 01:58:10 john98 kernel: alg: No test for adler32 (adler32-zlib) Aug 17 01:58:11 john98 kernel: Lustre: Lustre: Build Version: 2.10.4 Aug 17 01:58:11 john98 kernel: LNet: Using FMR for registration Aug 17 01:58:11 john98 kernel: LNet: Added LNI 192.168.44.198@o2ib44 [128/2048/0/180] Aug 17 01:58:36 warble1 kernel: LNetError: 103:0:(o2iblnd_cb.c:3061:kiblnd_cm_callback()) 192.168.44.198@o2ib44: REJECTED 28 Aug 17 01:58:36 warble1 kernel: LNetError: 103:0:(o2iblnd_cb.c:3061:kiblnd_cm_callback()) Skipped 3 previous similar messages Aug 17 02:06:08 john98 kernel: LNetError: 204:0:(o2iblnd_cb.c:2513:kiblnd_passive_connect()) Can't accept 192.168.44.198@o2ib44: -22 Aug 17 02:06:08 john98 kernel: LNet: 204:0:(o2iblnd_cb.c:2212:kiblnd_reject()) Error -22 sending reject Aug 17 02:06:08 john98 kernel: LNetError: 204:0:(o2iblnd_cb.c:2721:kiblnd_rejected()) 192.168.44.198@o2ib44 rejected: consumer defined fatal error 2.10.4 was dkms rebuilt for this kernel. cheers, robin
            srcc Stanford Research Computing Center added a comment - A more detailed changelog about that kernel is at https://access.redhat.com/downloads/content/rhel---7/x86_64/2456/kernel/3.10.0-862.11.6.el7/x86_64/fd431d51/package-changelog,  if that's of any help.

            People

              pjones Peter Jones
              srcc Stanford Research Computing Center
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: