Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8485

workqueue overflows with mlx5 on power8 platforms.

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • None
    • Lustre 2.8.0, Lustre 2.9.0
    • None
    • Power8 client nodes running RHEL7.2 with Mellanox OFED 3.2-1.04
    • 3
    • 9223372036854775807

    Description

      Currently in my testing on the Power8 platform I from time to time see the following errors on the clients and the lustre becomes unusable.

      [ 3499.198051] mlx5_warn:mlx5_0:begin_wqe:4013:(pid 7712): work queue overflow
      [ 3499.198176] mlx5_warn:mlx5_0:mlx5_ib_post_send:4112:(pid 7712): Failed to prepare WQE
      [ 3499.198209] mlx5_warn:mlx5_0:begin_wqe:4013:(pid 7715): work queue overflow
      [ 3499.198240] LustreError: 7712:0:(events.c:203:client_bulk_callback()) event type 1, status -12, desc c000001772778c00
      [ 3499.198428] mlx5_warn:mlx5_0:mlx5_ib_post_send:4112:(pid 7715): Failed to prepare WQE
      [ 3499.198527] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -12, desc c000000788600c00
      [ 3499.199804] LustreError: 7713:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000e27e06800
      [ 3499.199928] LustreError: 7714:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000788602200
      [ 3499.200740] LustreError: 7712:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c00000077cec7400
      [ 3499.201667] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c00000039da2f400
      [ 3499.202216] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000780129c00
      [ 3499.202422] LustreError: 7713:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000e270c3000
      [ 3499.202642] LustreError: 7715:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000001b98441800
      [ 3499.202864] LustreError: 7712:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000c6d9fd600
      [ 3499.203091] LustreError: 7714:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000dd0309200
      [ 3499.203942] LustreError: 7713:0:(events.c:203:client_bulk_callback()) event type 1, status -5, desc c000000e27e06200
      [ 3499.558222] LNet: 7659:0:(o2iblnd_cb.c:1360:kiblnd_reconnect_peer()) Abort reconnection of 10.37.248.77@o2ib1: connected
      [ 3499.558317] LNet: 7659:0:(o2iblnd_cb.c:1360:kiblnd_reconnect_peer()) Skipped 4 previous similar messages

      Attachments

        Issue Links

          Activity

            [LU-8485] workqueue overflows with mlx5 on power8 platforms.
            simmonsja James A Simmons added a comment - - edited

            Here is a lctl dump from my power8 client nodes. For the server side we are using standard x86_64 platforms which is why we are having issues.

            simmonsja James A Simmons added a comment - - edited Here is a lctl dump from my power8 client nodes. For the server side we are using standard x86_64 platforms which is why we are having issues.
            simmonsja James A Simmons added a comment - - edited

            Yes they are on. It will take me some time to get any lctl debug logs since this problem happens randomly.

            simmonsja James A Simmons added a comment - - edited Yes they are on. It will take me some time to get any lctl debug logs since this problem happens randomly.

            Are NETERRORS turned on? I'm curious to see if o2iblnd has any messages for us to help.

            doug Doug Oucharek (Inactive) added a comment - Are NETERRORS turned on? I'm curious to see if o2iblnd has any messages for us to help.

            With since without patch 21304 ko2iblnd doesn't work on Power8 platforms. A small bug exist in the patch that I submitted but I have a version locally that appears to work.

            simmonsja James A Simmons added a comment - With since without patch 21304 ko2iblnd doesn't work on Power8 platforms. A small bug exist in the patch that I submitted but I have a version locally that appears to work.

            James, is this failure with or without your patch: http://review.whamcloud.com/21304/?

            doug Doug Oucharek (Inactive) added a comment - James, is this failure with or without your patch: http://review.whamcloud.com/21304/?
            pjones Peter Jones added a comment -

            Doug is looking into this

            pjones Peter Jones added a comment - Doug is looking into this

            People

              doug Doug Oucharek (Inactive)
              simmonsja James A Simmons
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: