Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8693

ko2iblnd recieving IB_WC_MW_BIND_ERR errors.

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.8.0, Lustre 2.9.0
    • None
    • Power8 running RHEL with a MOFED 3.3 stack.
    • 3
    • 9223372036854775807

    Description

      Moving to our production Power8 system running an MOFED stack we are seeing a new IB error in the ko2iblnd that wasn't encountered before.

      [ 170.597561] mlx5_warn:mlx5_0:dump_cqe:257:(pid 8738): dump error cqe
      [ 170.597620] mlx5_warn:mlx5_0:dump_cqe:257:(pid 8714): dump error cqe
      [ 170.597622] 00000000 00000000 00000000 00000000
      [ 170.597623] 00000000 00000000 00000000 00000000
      [ 170.597625] 00000000 00000000 00000000 00000000
      [ 170.597626] 00000000 08007806 25000039 0642b3d2
      [ 170.597651] LNet: 8714:0:(o2iblnd_cb.c:3433:kiblnd_complete()) FastReg failed: 6
      [ 170.597728] LNet: 8713:0:(o2iblnd_cb.c:3444:kiblnd_complete()) RDMA (tx: c000003c6a78c5a8) failed: 5
      [ 170.598355] 00000000 00000000 00000000 00000000
      [ 170.598403] 00000000 00000000 00000000 00000000
      [ 170.599245] powernv-cpufreq: CPU 104 on Chip 1 has Pmax restored to 0
      [ 170.599647] LNet: 8714:0:(o2iblnd_cb.c:990:kiblnd_tx_complete()) Tx -> 10.39.232.11@o2ib6 cookie 0x63e sending 1 waiting 0: failed 5
      [ 170.599651] LNet: 8714:0:(o2iblnd_cb.c:990:kiblnd_tx_complete()) Skipped 2 previous similar messages
      [ 170.599654] LNet: 8713:0:(o2iblnd_cb.c:1934:kiblnd_close_conn_locked()) Closing conn to 10.39.232.11@o2ib6: error -5(waiting)
      [ 170.599669] LustreError: 8714:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc c000003c62cf5c00
      [ 170.599675] Lustre: 8896:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1476124274/real 1476124274] req@c000003c4e340000 x1547828424878916/t0(0) o4->atlastds-OST0035-osc-c000001fc5b75000@10.36.226.69@o2ib:6/4 lens 608/448 e 0 to 1 dl 1476124841 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
      [ 170.599681] Lustre: atlastds-OST0035-osc-c000001fc5b75000: Connection to atlastds-OST0035 (at 10.36.226.69@o2ib) was lost; in progress operations using this service will wait for recovery to complete
      [ 170.611219] 00000000 00000000 00000000 00000000
      [ 170.612270] 00000000 08007806 2500003a 06789cd2
      [ 170.613866] LustreError: 8737:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc c000001fb98c0400

      Attachments

        Issue Links

          Activity

            [LU-8693] ko2iblnd recieving IB_WC_MW_BIND_ERR errors.

            I believe this bug is addressed by the fix to LU-8752.

            doug Doug Oucharek (Inactive) added a comment - I believe this bug is addressed by the fix to LU-8752 .

            We only use OFED 3.12 in our production systems. Also for our Cray systems we don't enable map_on_demand so we don't see any problems.

            simmonsja James A Simmons added a comment - We only use OFED 3.12 in our production systems. Also for our Cray systems we don't enable map_on_demand so we don't see any problems.

            Hi simmonsja,
            Any thoughts on Doug and Mahmoud's OFED query?

            bhoagland Brad Hoagland (Inactive) added a comment - Hi simmonsja , Any thoughts on Doug and Mahmoud's OFED query?

            That's a good question. James? Have you tried the upstream OFED for this?

            doug Doug Oucharek (Inactive) added a comment - That's a good question. James? Have you tried the upstream OFED for this?

            Does OFED reproduce this error?

            mhanafi Mahmoud Hanafi added a comment - Does OFED reproduce this error?

            The only reference I can find to IB_WC_MW_BIND_ERR in the upstream OFED code is in Linux/drivers/infiniband/hw/mlx5/cq.c, routine: mlx5_handle_error_cqe():

            ...
            switch (cqe->syndrome) {
            ...
                    case MLX5_CQE_SYNDROME_MW_BIND_ERR:
                            wc->status = IB_WC_MW_BIND_ERR;
                            break;
            ...
            

            I cannot find any other reference to MLX5_CQE_SYNDROME_MW_BIND_ERR so I am assuming this comes from the MLX5 driver or firmware.

            doug Doug Oucharek (Inactive) added a comment - The only reference I can find to IB_WC_MW_BIND_ERR in the upstream OFED code is in Linux/drivers/infiniband/hw/mlx5/cq.c, routine: mlx5_handle_error_cqe(): ... switch (cqe->syndrome) { ... case MLX5_CQE_SYNDROME_MW_BIND_ERR: wc->status = IB_WC_MW_BIND_ERR; break; ... I cannot find any other reference to MLX5_CQE_SYNDROME_MW_BIND_ERR so I am assuming this comes from the MLX5 driver or firmware.

            It would be very useful to know under what conditions MOFED returns this error. Without access to the MOFED source or the firmware source (if the error is generated by firmware), I cannot determine that.

            Do you have a support ticket opened with Mellanox for this? If they can provide us with a list of conditions which generate this error, we would have something to work with to debug what we are doing wrong in o2iblnd.

            doug Doug Oucharek (Inactive) added a comment - It would be very useful to know under what conditions MOFED returns this error. Without access to the MOFED source or the firmware source (if the error is generated by firmware), I cannot determine that. Do you have a support ticket opened with Mellanox for this? If they can provide us with a list of conditions which generate this error, we would have something to work with to debug what we are doing wrong in o2iblnd.

            That all we have.

            simmonsja James A Simmons added a comment - That all we have.

            Is this only happening with Power8 to/from x86?

            doug Doug Oucharek (Inactive) added a comment - Is this only happening with Power8 to/from x86?

            FastReg

            simmonsja James A Simmons added a comment - FastReg

            People

              ashehata Amir Shehata (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: