[LU-8485] workqueue overflows with mlx5 on power8 platforms. Created: 08/Aug/16 Updated: 12/Aug/16 Resolved: 12/Aug/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | James A Simmons | Assignee: | Doug Oucharek (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Power8 client nodes running RHEL7.2 with Mellanox OFED 3.2-1.04 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
Currently in my testing on the Power8 platform I from time to time see the following errors on the clients and the lustre becomes unusable. [ 3499.198051] mlx5_warn:mlx5_0:begin_wqe:4013:(pid 7712): work queue overflow |
| Comments |
| Comment by Peter Jones [ 08/Aug/16 ] |
|
Doug is looking into this |
| Comment by Doug Oucharek (Inactive) [ 08/Aug/16 ] |
|
James, is this failure with or without your patch: http://review.whamcloud.com/21304/? |
| Comment by James A Simmons [ 08/Aug/16 ] |
|
With since without patch 21304 ko2iblnd doesn't work on Power8 platforms. A small bug exist in the patch that I submitted but I have a version locally that appears to work. |
| Comment by Doug Oucharek (Inactive) [ 08/Aug/16 ] |
|
Are NETERRORS turned on? I'm curious to see if o2iblnd has any messages for us to help. |
| Comment by James A Simmons [ 09/Aug/16 ] |
|
Yes they are on. It will take me some time to get any lctl debug logs since this problem happens randomly. |
| Comment by James A Simmons [ 09/Aug/16 ] |
|
Here is a lctl dump from my power8 client nodes. For the server side we are using standard x86_64 platforms which is why we are having issues. |
| Comment by Doug Oucharek (Inactive) [ 11/Aug/16 ] |
|
Please see my comments on http://review.whamcloud.com/21304/. A solution to this ticket may come about when addressing my comments. |
| Comment by James A Simmons [ 12/Aug/16 ] |
|
I updated the patch and the problem still exist. I will push a new version of the 21304 patch. |
| Comment by Doug Oucharek (Inactive) [ 12/Aug/16 ] |
|
After a debugging session, we seem to have tracked down a few problems with the 21304 patch and are close to have a version which works without problems (i.e. overflows). As such, I'm going to mark this ticket resolved and leave the final changes to 21304 in its ticket |
| Comment by James A Simmons [ 12/Aug/16 ] |
|
Worked with Doug to track down the issue reported in this ticket. The main problem was due to the IBLND_SEND_WRS macro in o2iblnd not creating a deep enough queue. It was using the local frag size (16) but it needs to assume the worst case of working with a external node (256) so the queue ended up too small and it would be easily overrun. The latest patch http://review.whamcloud.com/21304 should address these problems. Besides the fixes in the 21304 patch the queue problems existed in that it they were to small. The solution to that was to reduce the concurrent_sends from 63 down to 31 and Lustre started to function in all my test cases. |