[LU-8693] ko2iblnd recieving IB_WC_MW_BIND_ERR errors. Created: 11/Oct/16 Updated: 12/Oct/17 Resolved: 31/Jan/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Amir Shehata (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Power8 running RHEL with a MOFED 3.3 stack. |
||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Moving to our production Power8 system running an MOFED stack we are seeing a new IB error in the ko2iblnd that wasn't encountered before. [ 170.597561] mlx5_warn:mlx5_0:dump_cqe:257:(pid 8738): dump error cqe |
| Comments |
| Comment by Peter Jones [ 11/Oct/16 ] |
|
Doug Could you please advise on this one? Peter |
| Comment by Doug Oucharek (Inactive) [ 11/Oct/16 ] |
|
James, do you know if this is using FastReg or the older FMR? |
| Comment by James A Simmons [ 11/Oct/16 ] |
|
FastReg |
| Comment by Doug Oucharek (Inactive) [ 11/Oct/16 ] |
|
Is this only happening with Power8 to/from x86? |
| Comment by James A Simmons [ 13/Oct/16 ] |
|
That all we have. |
| Comment by Doug Oucharek (Inactive) [ 13/Oct/16 ] |
|
It would be very useful to know under what conditions MOFED returns this error. Without access to the MOFED source or the firmware source (if the error is generated by firmware), I cannot determine that. Do you have a support ticket opened with Mellanox for this? If they can provide us with a list of conditions which generate this error, we would have something to work with to debug what we are doing wrong in o2iblnd. |
| Comment by Doug Oucharek (Inactive) [ 13/Oct/16 ] |
|
The only reference I can find to IB_WC_MW_BIND_ERR in the upstream OFED code is in Linux/drivers/infiniband/hw/mlx5/cq.c, routine: mlx5_handle_error_cqe(): ...
switch (cqe->syndrome) {
...
case MLX5_CQE_SYNDROME_MW_BIND_ERR:
wc->status = IB_WC_MW_BIND_ERR;
break;
...
I cannot find any other reference to MLX5_CQE_SYNDROME_MW_BIND_ERR so I am assuming this comes from the MLX5 driver or firmware. |
| Comment by Mahmoud Hanafi [ 01/Nov/16 ] |
|
Does OFED reproduce this error? |
| Comment by Doug Oucharek (Inactive) [ 01/Nov/16 ] |
|
That's a good question. James? Have you tried the upstream OFED for this? |
| Comment by Brad Hoagland (Inactive) [ 11/Nov/16 ] |
|
Hi simmonsja, |
| Comment by James A Simmons [ 11/Nov/16 ] |
|
We only use OFED 3.12 in our production systems. Also for our Cray systems we don't enable map_on_demand so we don't see any problems. |
| Comment by Doug Oucharek (Inactive) [ 16/Dec/16 ] |
|
I believe this bug is addressed by the fix to |