[LU-8752] mlx5_warn:mlx5_0:dump_cqe:257: Created: 24/Oct/16 Updated: 18/Jan/18 Resolved: 18/Jan/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Mahmoud Hanafi | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
mlnx ofed3.2 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 2 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
Running lnet selftest on a mlx5 card we get these errors. [1477328975.069684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10912): dump error cqe
[1477328975.085684] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10906): dump error cqe
[1477328975.085684] 00000000 00000000 00000000 00000000
[1477328975.085684] 00000000 00000000 00000000 00000000
[1477328975.085684] 00000000 00000000 00000000 00000000
[1477328975.085684] 00000000 08007806 2500002f 00085dd0
[1477328975.085684] LustreError: 11028:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.25@o2ib: -5
[1477328975.085684] LustreError: 11028:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer to 12345-10.151.27.25@o2ib has failed: -5
[1477328975.093683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10922): dump error cqe
[1477328975.093683] 00000000 00000000 00000000 00000000
[1477328975.093683] 00000000 00000000 00000000 00000000
[1477328975.093683] 00000000 00000000 00000000 00000000
[1477328975.093683] 00000000 08007806 25000030 000842d0
[1477328975.105683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
[1477328975.105683] 00000000 00000000 00000000 00000000
[1477328975.105683] 00000000 00000000 00000000 00000000
[1477328975.105683] 00000000 00000000 00000000 00000000
[1477328975.105683] 00000000 08007806 25000031 000843d0
[1477328975.113683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
[1477328975.113683] 00000000 00000000 00000000 00000000
[1477328975.113683] 00000000 00000000 00000000 00000000
[1477328975.113683] 00000000 00000000 00000000 00000000
[1477328975.113683] 00000000 08007806 25000032 000840d0
[1477328975.121683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10900): dump error cqe
[1477328975.121683] 00000000 00000000 00000000 00000000
[1477328975.121683] 00000000 00000000 00000000 00000000
[1477328975.121683] 00000000 00000000 00000000 00000000
[1477328975.121683] 00000000 08007806 25000033 000841d0
[1477328975.129683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10915): dump error cqe
[1477328975.129683] 00000000 00000000 00000000 00000000
[1477328975.129683] 00000000 00000000 00000000 00000000
[1477328975.129683] 00000000 00000000 00000000 00000000
[1477328975.129683] 00000000 08007806 2500002e 00085cd0
[1477328975.133683] mlx5_warn:mlx5_0:dump_cqe:257:(pid 10907): dump error cqe
[1477328975.133683] 00000000 00000000 00000000 00000000
[1477328975.133683] 00000000 00000000 00000000 00000000
[1477328975.133683] 00000000 00000000 00000000 00000000
[1477328975.133683] 00000000 08007806 25000034 000846d0
[1477328975.205682] 00000000 00000000 00000000 00000000
[1477328975.281682] 00000000 00000000 00000000 00000000
[1477328975.305681] 00000000 00000000 00000000 00000000
[1477328975.305681] 00000000 08007806 2500002d 000b57d0
|
| Comments |
| Comment by James A Simmons [ 24/Oct/16 ] |
|
Do you have map_on_demand enabled? We saw this with one of our systems as well and when we disabled map_on_demand it went away. This issue only shows up on specific systems. If you look at your logs you will see IB_BIND_ERRORS (something along that line). See ticket |
| Comment by Peter Jones [ 24/Oct/16 ] |
|
Doug Anything to add here? Peter |
| Comment by Doug Oucharek (Inactive) [ 24/Oct/16 ] |
|
Mahmoud: can you confirm if IB_BIND_ERRORS are present in your logs? |
| Comment by Mahmoud Hanafi [ 24/Oct/16 ] |
|
I don't see IB_BIND_ERRORS. But we are using map_on_demand. Our cluster wide setting is options ko2iblnd timeout=150 retry_count=7 map_on_demand=32 peer_credits=63 concurrent_sends=63 Can we keep the peer_credits and concurrent_sends at 63 with map_on_demand disabled? |
| Comment by James A Simmons [ 24/Oct/16 ] |
|
Do you see the following in your logs: [ 170.597651] LNet: 8714:0:(o2iblnd_cb.c:3433:kiblnd_complete()) FastReg failed: 6 That error code 6 you see is the ib_wc_status which is IB_WC_MW_BIND_ERR. I found in my testing you only need map_on_demand if you have hardware using the mlx4 driver talking to |
| Comment by Mahmoud Hanafi [ 24/Oct/16 ] |
|
If I set map_on_demand=0 then I get this error elrtr1 login: [1477346024.576963] LNet: 81:0:(o2iblnd_cb.c:2367:kiblnd_passive_connect()) Can't accept conn from 10.151.27.22@o2ib (version 12): max_frags 32 incompatible without FMR pool (256 wanted) if I set map_on_demand=256 on the mlnx5 host I get #mlx4 config options ko2iblnd timeout=150 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=63 concurrent_sends=63 #mlx5 config options ko2iblnd timeout=150 retry_count=7 map_on_demand=256 peer_credits=63 concurrent_sends=63 [1477346171.287627] mlx5_warn:mlx5_0:calc_sq_size:516:(pid 81): wqe count(65536) exceeds limits(32768)
[1477346171.315626] mlx5_warn:mlx5_0:create_kernel_qp:1078:(pid 81): err -12
[1477346171.335626] mlx5_warn:mlx5_0:create_qp_common:1823:(pid 81): err -12
[1477346171.359626] mlx5_warn:mlx5_0:__create_qp:2267:(pid 81): create_qp_common failed
[1477346171.379626] mlx5_warn:mlx5_0:calc_sq_size:516:(pid 81): wqe count(65536) exceeds limits(32768)
[1477346171.407626] mlx5_warn:mlx5_0:create_kernel_qp:1078:(pid 81): err -12
[1477346171.423625] mlx5_warn:mlx5_0:create_qp_common:1823:(pid 81): err -12
[1477346171.447625] mlx5_warn:mlx5_0:__create_qp:2267:(pid 81): create_qp_common failed
[1477346171.479625] mlx5_warn:mlx5_0:dump_cqe:257:(pid 13624): dump error cqe
[1477346171.495625] LustreError: 13730:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5
[1477346171.495625] LustreError: 13735:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5
[1477346171.567624] LustreError: 13730:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message
|
| Comment by Mahmoud Hanafi [ 24/Oct/16 ] |
|
So this configuration eliminated the LNET errors but still getting 'dump error cqe' #mlx4 config options ko2iblnd timeout=150 retry_count=7 peer_timeout=0 map_on_demand=32 peer_credits=16 concurrent_sends=16 #mlx5 config options ko2iblnd timeout=150 retry_count=7 map_on_demand=256 peer_credits=16 concurrent_sends=16 ERRORS [1477347536.651112] mlx5_warn:mlx5_0:dump_cqe:257:(pid 14273): dump error cqe
[1477347536.671112] 00000000 00000000 00000000 00000000
[1477347536.671112] 00000000 00000000 00000000 00000000
[1477347536.671112] 00000000 00000000 00000000 00000000
[1477347536.671112] 00000000 08007806 2500005c 000e3cd0
[1477347536.671112] LustreError: 14383:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5
[1477347536.671112] LustreError: 14381:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5
[1477347536.743111] LustreError: 14383:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message
|
| Comment by James A Simmons [ 24/Oct/16 ] |
|
Why map_on_demand=32 ? You are sending 32 * 4K in size on x86 platforms. We use map_on_demand=16 on Power8 since the page is 64K in size. So 1MB divided by 64K pages equals 16 fragments for map_on_demand. For x86 platforms the optimal is map_on_demand=256 since 1 MB divided by 4K pages equals 256. |
| Comment by Mahmoud Hanafi [ 25/Oct/16 ] |
|
The 32 was a recommendation from a previous LU. |
| Comment by Mahmoud Hanafi [ 25/Oct/16 ] |
|
I tried turning off map_on_demand on both mlx4 and mlx5 and that worked. So we need to understand why map_on_demand causes the errors. |
| Comment by Jay Lan (Inactive) [ 25/Oct/16 ] |
|
James A. Simmons wrote: We do have a mix of mlx4 and mlx5 hardwares in our fabrics, don't we, Mahmoud? |
| Comment by James A Simmons [ 25/Oct/16 ] |
|
Also if you disable map_on_demand you lose a good chunk of performance |
| Comment by Doug Oucharek (Inactive) [ 25/Oct/16 ] |
|
I thought that map_on_demand with Mellanox hardware only really helps in cases where you have a high latency network. Under normal conditions, the difference should be minimal. Are you having a difference experience, James? |
| Comment by Mahmoud Hanafi [ 28/Oct/16 ] |
|
So when map_on_demand is set to any thing, even 256, causes this error this must point to creating pools with fmr/fastReq. [1477674572.063221] Lustre: Lustre: Build Version: 2.7.2-3.2nasC_mofed32v3f [1477674572.123220] LNet: Using FastReg for registration [1477674572.659216] LNet: Using FastReg for registration [1477674572.659216] LNet: Skipped 5 previous similar messages [1477674572.915214] LNet: Added LNI 10.151.25.168@o2ib [16/1024/0/180] [1477674573.807206] LNet: Added LNI 10.150.25.168@o2ib233 [16/1024/0/180] [1477674632.270692] mlx5_warn:mlx5_0:dump_cqe:257:(pid 9951): dump error cqe [1477674632.286692] 00000000 00000000 00000000 00000000 [1477674632.286692] 00000000 00000000 00000000 00000000 [1477674632.286692] 00000000 00000000 00000000 00000000 [1477674632.286692] 00000000 08007806 2500002d 000d4fd2 [1477674632.290692] LustreError: 10088:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5 [1477674632.290692] LustreError: 10083:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5 [1477674632.362691] LustreError: 10088:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message |
| Comment by Doug Oucharek (Inactive) [ 28/Oct/16 ] |
|
Mahmoud: do you have neterrors turn on (going to /var/log/messages)? I'm wondering if there is some other feedback we are getting from the lower drivers. |
| Comment by Mahmoud Hanafi [ 28/Oct/16 ] |
|
Attaching +net debug output for mlx5 and mlx4 hosts. There are no other errors in /var/log/messages or console. |
| Comment by Doug Oucharek (Inactive) [ 28/Oct/16 ] |
|
Did this also include +neterror (or do you have those on by default)? Neterror is not included by default yet. I have a patch to change this but it is not getting much traction. |
| Comment by Mahmoud Hanafi [ 31/Oct/16 ] |
|
yes the debug output included +neterror |
| Comment by Mahmoud Hanafi [ 31/Oct/16 ] |
|
The above tests where all done with connectIB card. We tried a EDR card and got once piece of additional info. Just before the cqe error we get this [1477940064.283784] LNetError: 11647:0:(o2iblnd_cb.c:1082:kiblnd_init_rdma()) RDMA is too large for peer 10.151.27.56@o2ib (131072), src size: 1048576 dst size: 1048576 [1477940064.327784] LNetError: 11647:0:(o2iblnd_cb.c:434:kiblnd_handle_rx()) Can't setup rdma for PUT to 10.151.27.56@o2ib: -90 [1477940064.363784] mlx5_warn:mlx5_1:dump_cqe:257:(pid 11647): dump error cqe
[1477940064.383784] 00000000 00000000 00000000 00000000
[1477940064.383784] 00000000 00000000 00000000 00000000
[1477940064.383784] 00000000 00000000 00000000 00000000
[1477940064.383784] 00000000 08007806 2500002e 026fc4d2
[1477940064.383784] LustreError: 11597:0:(brw_test.c:362:brw_server_rpc_done()) Bulk transfer from 12345-10.151.27.56@o2ib has failed: -5
[1477940064.383784] LustreError: 11598:0:(brw_test.c:388:brw_bulk_ready()) BRW bulk READ failed for RPC from 12345-10.151.27.56@o2ib: -5
[1477940064.455783] LustreError: 11597:0:(brw_test.c:362:brw_server_rpc_done()) Skipped 1 previous similar message
|
| Comment by Doug Oucharek (Inactive) [ 31/Oct/16 ] |
|
I can't comment on dump_cqe message and what it means, but I do understand what is causing the "RDMA is too large for peer" message and its breakage. This was caused by the patch associated with I suspect that NASA has picked up the patch from Hopefully, the dump_cqe message is associated with this and will go away one this is fixed. |
| Comment by James A Simmons [ 31/Oct/16 ] |
|
Only if they applied the patch. The patch for |
| Comment by Jay Lan (Inactive) [ 31/Oct/16 ] |
|
I did picked up However, I picked up |
| Comment by Doug Oucharek (Inactive) [ 31/Oct/16 ] |
|
So, I assume that means the dump_cqe message happens without the I'm going to have to go on the Mellanox community board to ask about this mlx5-specific failure. We do not have a support contract with Mellanox and I doubt they will help fix this failure without it. To my knowledge, there is nothing we are doing in ko2iblnd which would break the MLX code "assuming" they did proper backwards compatibility between mlx4 and mlx5. We have never seen such an error or problem with mlx4 so I have to assume they did something to mlx5 to change things. If NASA has a support contract with Mellanox, perhaps you can raise a ticket with them to get their input on the potential causes of these error messages. That would help us to determine if there is something we need to change in ko2iblnd. |
| Comment by Mahmoud Hanafi [ 01/Nov/16 ] |
|
We have opened a case with Mellanox and sent them some debugging. Waiting to hear back, but they did say the message is related to Fast Memory Registration Mode. |
| Comment by Doug Oucharek (Inactive) [ 01/Nov/16 ] |
|
Thank you Mahmoud. I have suspected the Fast Memory Registration code could be the culprit here. We had to add support for that over the older FMR code when using newer MLX cards which only support Fast Memory. As such, that code is fairly new and can have an issue. It will be great if Mellanox can point us specifically at where we are making a mistake so we can address it. |
| Comment by Mahmoud Hanafi [ 19/Nov/16 ] |
|
After the host is reboot, the error can be reproduced when read from mlx4 to mlx5. #test1 lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw read size=1M check=full But this will work! #test2 lst add_test --batch bulk_rw --from mlx4_host --to mlx5_host brw write size=1M check=full What's interesting once the write test is ran the read will start to work. |
| Comment by Doug Oucharek (Inactive) [ 23/Nov/16 ] |
|
I've been able to get mixed mlx4 and mlx5 cards in the same cluster. Using your test examples, I have been able to reproduce this issue getting the same results as you. I verified that the issue only occurs when map_on_demand is non zero. I have reproduced the issue with both upstream OFED and MOFED (latest version for both). I have also reproduced it with RHEL 6.8 and 7.3. No solution yet, but at least I have a way to investigate this now. |
| Comment by Mahmoud Hanafi [ 23/Nov/16 ] |
|
Mellanox has also been able to reproduce this issue in their lab and are looking at it. |
| Comment by Doug Oucharek (Inactive) [ 23/Nov/16 ] |
|
Another interesting data point: I flipped the from and to parameters in your test making it "--from mlx5_host --to mlx4_host". In theory, the issue should happen when I then do a write as this is almost the same thing as what you have. However, in this case, I am not able to reproduce the problem with either read or write. I believe the difference when you flip the from and to parameters is who initiates the test. |
| Comment by Mahmoud Hanafi [ 01/Dec/16 ] |
|
Here is update from mellanox engineering "The failure is happening because of fast reg mr called twice, second time the mkey is not free - but it set for a check if free and its meaning the operation will succeed only if mkey is free. In order to call for second fast reg mr customer needed to do local invalidate before that call. (These operation are explained in IB spec)." |
| Comment by Doug Oucharek (Inactive) [ 01/Dec/16 ] |
|
Thank you for that feedback. I will dig into the code for the path where we are making this mistake. |
| Comment by Gerrit Updater [ 06/Dec/16 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/24162 |
| Comment by Gerrit Updater [ 12/Dec/16 ] |
|
Doug Oucharek (doug.s.oucharek@intel.com) uploaded a new patch: https://review.whamcloud.com/24306 |
| Comment by Doug Oucharek (Inactive) [ 13/Dec/16 ] |
|
I have submitted a fix to this issue above (ignore the earlier Gerrit patch as it was just for my debugging). I have validated it on our mlx4 / mlx5 mixed cluster. I still need to validate it on OmniPath to ensure it does not cause a problem there. |
| Comment by Mahmoud Hanafi [ 13/Dec/16 ] |
|
Thanks we will build and test it.
|
| Comment by Doug Oucharek (Inactive) [ 13/Dec/16 ] |
|
OmniPath is not affected by this patch as it uses FMR and not FastReg. So this change should only affect MLX5 based cards. Did you need me to push a 2.7FE patch? |
| Comment by Nathan Dauchy (Inactive) [ 13/Dec/16 ] |
|
Yes, hopefully it is trivial, but we will want to use the 2.7fe patch. Thanks! |
| Comment by Doug Oucharek (Inactive) [ 14/Dec/16 ] |
|
The 2.7FE patch is: https://review.whamcloud.com/24336 |
| Comment by Mahmoud Hanafi [ 19/Dec/16 ] |
|
Here is some feed back from MLX. " According to IB spec state should be set to Free. Under "MEMORY REGION TYPES" section there is an explanation of states, specifically: " |
| Comment by Jay Lan (Inactive) [ 27/Dec/16 ] |
|
Do we need this patch for Lustre 2.9 release? |
| Comment by Peter Jones [ 28/Dec/16 ] |
|
Jay Are you running with 2.9.x? I had thought that you were using 2.7.x... Peter |
| Comment by James A Simmons [ 28/Dec/16 ] |
|
ORNL needs it for the 2.8 release. |
| Comment by Jian Yu [ 28/Dec/16 ] |
|
Hi James, |
| Comment by Jay Lan (Inactive) [ 28/Dec/16 ] |
|
Hi Peter,
|
| Comment by Doug Oucharek (Inactive) [ 03/Jan/17 ] |
|
Mahmoud: I suspect there is a bug in the Mellanox MLX5 driver somewhere. My debugging showed that the first time we use the L-key/R-key for an RDMA operation (reading from MLX5), it is being rejected as invalid. I spent a lot of time tracing the code to see if we have invoked anything which could have changed the state of the R-key/L-key and found nothing. That is why I came up with the solution of just invalidating the first key we set and advancing to the next one. I have not found an issue with invalidating a free key so am hoping this fix will not cause any problems down the road (Dmitry and Amir's concerns in the code review). Doug |
| Comment by Gerrit Updater [ 18/Jan/17 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/24306/ |
| Comment by Peter Jones [ 18/Jan/17 ] |
|
Landed for 2.10 |