[LU-8693] ko2iblnd recieving IB_WC_MW_BIND_ERR errors. Created: 11/Oct/16  Updated: 12/Oct/17  Resolved: 31/Jan/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: James A Simmons Assignee: Amir Shehata (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Power8 running RHEL with a MOFED 3.3 stack.


Issue Links:
Duplicate
duplicates LU-8752 mlx5_warn:mlx5_0:dump_cqe:257: Resolved
Related
is related to LU-6387 Add Power8 support to Lustre Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Moving to our production Power8 system running an MOFED stack we are seeing a new IB error in the ko2iblnd that wasn't encountered before.

[ 170.597561] mlx5_warn:mlx5_0:dump_cqe:257:(pid 8738): dump error cqe
[ 170.597620] mlx5_warn:mlx5_0:dump_cqe:257:(pid 8714): dump error cqe
[ 170.597622] 00000000 00000000 00000000 00000000
[ 170.597623] 00000000 00000000 00000000 00000000
[ 170.597625] 00000000 00000000 00000000 00000000
[ 170.597626] 00000000 08007806 25000039 0642b3d2
[ 170.597651] LNet: 8714:0:(o2iblnd_cb.c:3433:kiblnd_complete()) FastReg failed: 6
[ 170.597728] LNet: 8713:0:(o2iblnd_cb.c:3444:kiblnd_complete()) RDMA (tx: c000003c6a78c5a8) failed: 5
[ 170.598355] 00000000 00000000 00000000 00000000
[ 170.598403] 00000000 00000000 00000000 00000000
[ 170.599245] powernv-cpufreq: CPU 104 on Chip 1 has Pmax restored to 0
[ 170.599647] LNet: 8714:0:(o2iblnd_cb.c:990:kiblnd_tx_complete()) Tx -> 10.39.232.11@o2ib6 cookie 0x63e sending 1 waiting 0: failed 5
[ 170.599651] LNet: 8714:0:(o2iblnd_cb.c:990:kiblnd_tx_complete()) Skipped 2 previous similar messages
[ 170.599654] LNet: 8713:0:(o2iblnd_cb.c:1934:kiblnd_close_conn_locked()) Closing conn to 10.39.232.11@o2ib6: error -5(waiting)
[ 170.599669] LustreError: 8714:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc c000003c62cf5c00
[ 170.599675] Lustre: 8896:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1476124274/real 1476124274] req@c000003c4e340000 x1547828424878916/t0(0) o4->atlastds-OST0035-osc-c000001fc5b75000@10.36.226.69@o2ib:6/4 lens 608/448 e 0 to 1 dl 1476124841 ref 2 fl Rpc:X/0/ffffffff rc 0/-1
[ 170.599681] Lustre: atlastds-OST0035-osc-c000001fc5b75000: Connection to atlastds-OST0035 (at 10.36.226.69@o2ib) was lost; in progress operations using this service will wait for recovery to complete
[ 170.611219] 00000000 00000000 00000000 00000000
[ 170.612270] 00000000 08007806 2500003a 06789cd2
[ 170.613866] LustreError: 8737:0:(events.c:201:client_bulk_callback()) event type 1, status -5, desc c000001fb98c0400



 Comments   
Comment by Peter Jones [ 11/Oct/16 ]

Doug

Could you please advise on this one?

Peter

Comment by Doug Oucharek (Inactive) [ 11/Oct/16 ]

James, do you know if this is using FastReg or the older FMR?

Comment by James A Simmons [ 11/Oct/16 ]

FastReg

Comment by Doug Oucharek (Inactive) [ 11/Oct/16 ]

Is this only happening with Power8 to/from x86?

Comment by James A Simmons [ 13/Oct/16 ]

That all we have.

Comment by Doug Oucharek (Inactive) [ 13/Oct/16 ]

It would be very useful to know under what conditions MOFED returns this error. Without access to the MOFED source or the firmware source (if the error is generated by firmware), I cannot determine that.

Do you have a support ticket opened with Mellanox for this? If they can provide us with a list of conditions which generate this error, we would have something to work with to debug what we are doing wrong in o2iblnd.

Comment by Doug Oucharek (Inactive) [ 13/Oct/16 ]

The only reference I can find to IB_WC_MW_BIND_ERR in the upstream OFED code is in Linux/drivers/infiniband/hw/mlx5/cq.c, routine: mlx5_handle_error_cqe():

...
switch (cqe->syndrome) {
...
        case MLX5_CQE_SYNDROME_MW_BIND_ERR:
                wc->status = IB_WC_MW_BIND_ERR;
                break;
...

I cannot find any other reference to MLX5_CQE_SYNDROME_MW_BIND_ERR so I am assuming this comes from the MLX5 driver or firmware.

Comment by Mahmoud Hanafi [ 01/Nov/16 ]

Does OFED reproduce this error?

Comment by Doug Oucharek (Inactive) [ 01/Nov/16 ]

That's a good question. James? Have you tried the upstream OFED for this?

Comment by Brad Hoagland (Inactive) [ 11/Nov/16 ]

Hi simmonsja,
Any thoughts on Doug and Mahmoud's OFED query?

Comment by James A Simmons [ 11/Nov/16 ]

We only use OFED 3.12 in our production systems. Also for our Cray systems we don't enable map_on_demand so we don't see any problems.

Comment by Doug Oucharek (Inactive) [ 16/Dec/16 ]

I believe this bug is addressed by the fix to LU-8752.

Generated at Sat Feb 10 02:19:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.