Details
-
Bug
-
Resolution: Fixed
-
Critical
-
None
-
Lustre 2.8.0, Lustre 2.9.0
-
None
-
Power8 client nodes running RHEL7.2 with Mellanox OFED 3.2-1.04
-
3
-
9223372036854775807
Description
Currently in my testing on the Power8 platform I from time to time see the following errors on the clients and the lustre becomes unusable.
[ 3499.198051] mlx5_warn:mlx5_0:begin_wqe:4013:(pid 7712): work queue overflow
[ 3499.198176] mlx5_warn:mlx5_0:mlx5_ib_post_send:4112:(pid 7712): Failed to prepare WQE
[ 3499.198209] mlx5_warn:mlx5_0:begin_wqe:4013:(pid 7715): work queue overflow
[ 3499.198240] LustreError: 7712:0:(events.c:203:client_bulk_callback()) event type 1, status -12, desc c000001772778c00
[ 3499.198428] mlx5_warn:mlx5_0:mlx5_ib_post_send:4112:(pid 7715): Failed to prepare WQE
[ 3499.198527] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -12, desc c000000788600c00
[ 3499.199804] LustreError: 7713:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000e27e06800
[ 3499.199928] LustreError: 7714:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000788602200
[ 3499.200740] LustreError: 7712:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c00000077cec7400
[ 3499.201667] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c00000039da2f400
[ 3499.202216] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000780129c00
[ 3499.202422] LustreError: 7713:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000e270c3000
[ 3499.202642] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000001b98441800
[ 3499.202864] LustreError: 7712:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000c6d9fd600
[ 3499.203091] LustreError: 7714:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000dd0309200
[ 3499.203942] LustreError: 7713:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000e27e06200
[ 3499.558222] LNet: 7659:0:(o2iblnd_cb.c:1360:kiblnd_reconnect_peer()) Abort reconnection of 10.37.248.77@o2ib1: connected
[ 3499.558317] LNet: 7659:0:(o2iblnd_cb.c:1360:kiblnd_reconnect_peer()) Skipped 4 previous similar messages
Attachments
Issue Links
- is related to
-
LU-6387 Add Power8 support to Lustre
-
- Resolved
-
Worked with Doug to track down the issue reported in this ticket. The main problem was due to the IBLND_SEND_WRS macro in o2iblnd not creating a deep enough queue. It was using the local frag size (16) but it needs to assume the worst case of working with a external node (256) so the queue ended up too small and it would be easily overrun. The latest patch http://review.whamcloud.com/21304 should address these problems.
Besides the fixes in the 21304 patch the queue problems existed in that it they were to small. The solution to that was to reduce the concurrent_sends from 63 down to 31 and Lustre started to function in all my test cases.