[LU-8752] mlx5_warn:mlx5_0:dump_cqe:257: - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.10.0
Affects Version/s: Lustre 2.7.0
Labels:
None
Environment:
mlnx ofed3.2
lustre-2.7.2-2nas-fe
Linux elrtr1 3.0.101-77.1.20160630-nasa #1 SMP Thu Jun 30 00:56:32 UTC 2016 (a082ea6) x86_64 x86_64 x86_64 GNU/Linux

Severity:
2
Rank (Obsolete):
9223372036854775807

Description

Running lnet selftest on a mlx5 card we get these errors.

[1477328975.069684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10912): dump error cqe
[1477328975.085684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10906): dump error cqe
[1477328975.085684] 00000000 00000000 00000000 00000000
[1477328975.085684] 00000000 00000000 00000000 00000000
[1477328975.085684] 00000000 00000000 00000000 00000000
[1477328975.085684] 00000000 08007806 2500002f 00085dd0
[1477328975.085684] LustreError: 11028:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.25@o2ib: -5
[1477328975.085684] LustreError: 11028:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer to 12345-10.151.27.25@o2ib has failed: -5
[1477328975.093683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10922): dump error cqe
[1477328975.093683] 00000000 00000000 00000000 00000000
[1477328975.093683] 00000000 00000000 00000000 00000000
[1477328975.093683] 00000000 00000000 00000000 00000000
[1477328975.093683] 00000000 08007806 25000030 000842d0
[1477328975.105683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
[1477328975.105683] 00000000 00000000 00000000 00000000
[1477328975.105683] 00000000 00000000 00000000 00000000
[1477328975.105683] 00000000 00000000 00000000 00000000
[1477328975.105683] 00000000 08007806 25000031 000843d0
[1477328975.113683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
[1477328975.113683] 00000000 00000000 00000000 00000000
[1477328975.113683] 00000000 00000000 00000000 00000000
[1477328975.113683] 00000000 00000000 00000000 00000000
[1477328975.113683] 00000000 08007806 25000032 000840d0
[1477328975.121683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
[1477328975.121683] 00000000 00000000 00000000 00000000
[1477328975.121683] 00000000 00000000 00000000 00000000
[1477328975.121683] 00000000 00000000 00000000 00000000
[1477328975.121683] 00000000 08007806 25000033 000841d0
[1477328975.129683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
[1477328975.129683] 00000000 00000000 00000000 00000000
[1477328975.129683] 00000000 00000000 00000000 00000000
[1477328975.129683] 00000000 00000000 00000000 00000000
[1477328975.129683] 00000000 08007806 2500002e 00085cd0
[1477328975.133683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10907): dump error cqe
[1477328975.133683] 00000000 00000000 00000000 00000000
[1477328975.133683] 00000000 00000000 00000000 00000000
[1477328975.133683] 00000000 00000000 00000000 00000000
[1477328975.133683] 00000000 08007806 25000034 000846d0
[1477328975.205682] 00000000 00000000 00000000 00000000
[1477328975.281682] 00000000 00000000 00000000 00000000
[1477328975.305681] 00000000 00000000 00000000 00000000
[1477328975.305681] 00000000 08007806 2500002d 000b57d0

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

netdebug.mlx4host.gz
18 kB
28/Oct/16 11:17 PM
netdebug.mlx5host.gz
28 kB
28/Oct/16 11:17 PM

Issue Links

is duplicated by

LU-8693 ko2iblnd recieving IB_WC_MW_BIND_ERR errors.

Resolved

Activity

[LU-8752] mlx5_warn:mlx5_0:dump_cqe:257:

Mahmoud Hanafi added a comment - 23/Nov/16 3:05 PM

Mellanox has also been able to reproduce this issue in their lab and are looking at it.

Mahmoud Hanafi added a comment - 23/Nov/16 3:05 PM Mellanox has also been able to reproduce this issue in their lab and are looking at it.

Doug Oucharek (Inactive) added a comment - 23/Nov/16 7:54 AM - edited

I've been able to get mixed mlx4 and mlx5 cards in the same cluster. Using your test examples, I have been able to reproduce this issue getting the same results as you.

I verified that the issue only occurs when map_on_demand is non zero. I have reproduced the issue with both upstream OFED and MOFED (latest version for both). I have also reproduced it with RHEL 6.8 and 7.3.

No solution yet, but at least I have a way to investigate this now.

Doug Oucharek (Inactive) added a comment - 23/Nov/16 7:54 AM - edited I've been able to get mixed mlx4 and mlx5 cards in the same cluster. Using your test examples, I have been able to reproduce this issue getting the same results as you. I verified that the issue only occurs when map_on_demand is non zero. I have reproduced the issue with both upstream OFED and MOFED (latest version for both). I have also reproduced it with RHEL 6.8 and 7.3. No solution yet, but at least I have a way to investigate this now.

Mahmoud Hanafi added a comment - 19/Nov/16 12:21 AM - edited

After the host is reboot, the error can be reproduced when read from mlx4 to mlx5.
This will produce the dump_cqe error.

#test1
lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw read size=1M check=full

But this will work!

#test2
lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw write size=1M check=full

What's interesting once the write test is ran the read will start to work.

Mahmoud Hanafi added a comment - 19/Nov/16 12:21 AM - edited After the host is reboot, the error can be reproduced when read from mlx4 to mlx5. This will produce the dump_cqe error. #test1 lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw read size=1M check=full But this will work! #test2 lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw write size=1M check=full What's interesting once the write test is ran the read will start to work.

Doug Oucharek (Inactive) added a comment - 01/Nov/16 7:15 PM - edited

Thank you Mahmoud. I have suspected the Fast Memory Registration code could be the culprit here. We had to add support for that over the older FMR code when using newer MLX cards which only support Fast Memory. As such, that code is fairly new and can have an issue. It will be great if Mellanox can point us specifically at where we are making a mistake so we can address it.

Doug Oucharek (Inactive) added a comment - 01/Nov/16 7:15 PM - edited Thank you Mahmoud. I have suspected the Fast Memory Registration code could be the culprit here. We had to add support for that over the older FMR code when using newer MLX cards which only support Fast Memory. As such, that code is fairly new and can have an issue. It will be great if Mellanox can point us specifically at where we are making a mistake so we can address it.

Mahmoud Hanafi added a comment - 01/Nov/16 7:12 PM

We have opened a case with Mellanox and sent them some debugging. Waiting to hear back, but they did say the message is related to Fast Memory Registration Mode.

Mahmoud Hanafi added a comment - 01/Nov/16 7:12 PM We have opened a case with Mellanox and sent them some debugging. Waiting to hear back, but they did say the message is related to Fast Memory Registration Mode.

Doug Oucharek (Inactive) added a comment - 31/Oct/16 11:39 PM

So, I assume that means the dump_cqe message happens without the ~~LU-7650~~ patch but the "RDMA is too large for peer" message does not. They must then be unrelated.

I'm going to have to go on the Mellanox community board to ask about this mlx5-specific failure. We do not have a support contract with Mellanox and I doubt they will help fix this failure without it. To my knowledge, there is nothing we are doing in ko2iblnd which would break the MLX code "assuming" they did proper backwards compatibility between mlx4 and mlx5. We have never seen such an error or problem with mlx4 so I have to assume they did something to mlx5 to change things.

If NASA has a support contract with Mellanox, perhaps you can raise a ticket with them to get their input on the potential causes of these error messages. That would help us to determine if there is something we need to change in ko2iblnd.

Doug Oucharek (Inactive) added a comment - 31/Oct/16 11:39 PM So, I assume that means the dump_cqe message happens without the LU-7650 patch but the "RDMA is too large for peer" message does not. They must then be unrelated. I'm going to have to go on the Mellanox community board to ask about this mlx5-specific failure. We do not have a support contract with Mellanox and I doubt they will help fix this failure without it. To my knowledge, there is nothing we are doing in ko2iblnd which would break the MLX code "assuming" they did proper backwards compatibility between mlx4 and mlx5. We have never seen such an error or problem with mlx4 so I have to assume they did something to mlx5 to change things. If NASA has a support contract with Mellanox, perhaps you can raise a ticket with them to get their input on the potential causes of these error messages. That would help us to determine if there is something we need to change in ko2iblnd.

Jay Lan (Inactive) added a comment - 31/Oct/16 11:32 PM

I did picked up ~~LU-7650~~. Dont remember why I picked that one up.

However, I picked up ~~LU-7650~~ in 2.7.2-3.1nas build, but Mahmoud reported this problem against 2.7.2-2nas. The ~~LU-7650~~ patch is not in 2.7.2-2nas.

Jay Lan (Inactive) added a comment - 31/Oct/16 11:32 PM I did picked up LU-7650 . Dont remember why I picked that one up. However, I picked up LU-7650 in 2.7.2-3.1nas build, but Mahmoud reported this problem against 2.7.2-2nas. The LU-7650 patch is not in 2.7.2-2nas.

James A Simmons added a comment - 31/Oct/16 11:16 PM

Only if they applied the patch. The patch for ~~LU-7650~~ never landed to lustre 2.7 or lustre 2.8.0. In fact it doesn't exist anywhere now but the upstream client. This problem they are seeing was first seen by Jeremy in one of the ~~LU-3322~~ patches. I have to find it in the comments. The patch was still landed due to no else being able to reproduce this problem. I think he couldn't reproduce it after awhile. It is hit or miss with this.

James A Simmons added a comment - 31/Oct/16 11:16 PM Only if they applied the patch. The patch for LU-7650 never landed to lustre 2.7 or lustre 2.8.0. In fact it doesn't exist anywhere now but the upstream client. This problem they are seeing was first seen by Jeremy in one of the LU-3322 patches. I have to find it in the comments. The patch was still landed due to no else being able to reproduce this problem. I think he couldn't reproduce it after awhile. It is hit or miss with this.

Doug Oucharek (Inactive) added a comment - 31/Oct/16 10:48 PM

I can't comment on dump_cqe message and what it means, but I do understand what is causing the "RDMA is too large for peer" message and its breakage.

This was caused by the patch associated with ~~LU-7650~~. The patch was trying to make o2iblnd fragments work with PPC's different page sizes. Even after inspections (I was one of those), we did not see the breakage. To be honest, I still don't know why that patch is causing this issue. As a result, a revert patch was created for master and has landed: http://review.whamcloud.com/#/c/23439/.

I suspect that NASA has picked up the patch from ~~LU-7650~~, but not the revert. Either remove the patch from ~~LU-7650~~ or apply the revert patch.

Hopefully, the dump_cqe message is associated with this and will go away one this is fixed.

Doug Oucharek (Inactive) added a comment - 31/Oct/16 10:48 PM I can't comment on dump_cqe message and what it means, but I do understand what is causing the "RDMA is too large for peer" message and its breakage. This was caused by the patch associated with LU-7650 . The patch was trying to make o2iblnd fragments work with PPC's different page sizes. Even after inspections (I was one of those), we did not see the breakage. To be honest, I still don't know why that patch is causing this issue. As a result, a revert patch was created for master and has landed: http://review.whamcloud.com/#/c/23439/ . I suspect that NASA has picked up the patch from LU-7650 , but not the revert. Either remove the patch from LU-7650 or apply the revert patch. Hopefully, the dump_cqe message is associated with this and will go away one this is fixed.

Mahmoud Hanafi added a comment - 31/Oct/16 7:06 PM

The above tests where all done with connectIB card. We tried a EDR card and got once piece of additional info. Just before the cqe error we get this

[1477940064.283784] LNetError: 11647:0:(o2iblnd_cb.c:1082:kiblnd_init_rdma()) RDMA is too large for peer 10.151.27.56@o2ib (131072), src size: 1048576 dst size: 1048576
[1477940064.327784] LNetError: 11647:0:(o2iblnd_cb.c:434:kiblnd_handle_rx()) Can't setup rdma for PUT to 10.151.27.56@o2ib: -90

[1477940064.363784] mlx5_warn:mlx5_1:dump_cqe:257:(pid 11647): dump error cqe
[1477940064.383784] 00000000 00000000 00000000 00000000
[1477940064.383784] 00000000 00000000 00000000 00000000
[1477940064.383784] 00000000 00000000 00000000 00000000
[1477940064.383784] 00000000 08007806 2500002e 026fc4d2
[1477940064.383784] LustreError: 11597:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5
[1477940064.383784] LustreError: 11598:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5
[1477940064.455783] LustreError: 11597:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message

Mahmoud Hanafi added a comment - 31/Oct/16 7:06 PM The above tests where all done with connectIB card. We tried a EDR card and got once piece of additional info. Just before the cqe error we get this [1477940064.283784] LNetError: 11647:0:(o2iblnd_cb.c:1082:kiblnd_init_rdma()) RDMA is too large for peer 10.151.27.56@o2ib (131072), src size: 1048576 dst size: 1048576 [1477940064.327784] LNetError: 11647:0:(o2iblnd_cb.c:434:kiblnd_handle_rx()) Can't setup rdma for PUT to 10.151.27.56@o2ib: -90 [1477940064.363784] mlx5_warn:mlx5_1:dump_cqe:257:(pid 11647): dump error cqe [1477940064.383784] 00000000 00000000 00000000 00000000 [1477940064.383784] 00000000 00000000 00000000 00000000 [1477940064.383784] 00000000 00000000 00000000 00000000 [1477940064.383784] 00000000 08007806 2500002e 026fc4d2 [1477940064.383784] LustreError: 11597:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5 [1477940064.383784] LustreError: 11598:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5 [1477940064.455783] LustreError: 11597:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message

Mahmoud Hanafi added a comment - 31/Oct/16 4:08 PM

yes the debug output included +neterror

Mahmoud Hanafi added a comment - 31/Oct/16 4:08 PM yes the debug output included +neterror

People

Assignee:: Doug Oucharek (Inactive)

Reporter:: Mahmoud Hanafi

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Dates

Created:: 24/Oct/16 5:17 PM

Updated:: 18/Jan/18 5:13 PM

Resolved:: 18/Jan/17 7:12 PM