Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16349

Excessive number of OPA disconnects / LNET network errors in cluster

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.3
    • None
    • None
    • Server: Lustre 2.12.6 or 2.12.9 on official Centos 7 Lustre kernels for the respective versions
      Client: Centos 8 with various kernels and verious Lustre 2.12.x versions (including 2.12.8 and 2.12.9)
    • 3
    • 9223372036854775807

    Description

      We find massive unresponsiveness of the Lustre on many clients. Sometimes there are temporary stalls (several minutes) which go away eventually, sometimes only rebooting the client helps.

      We suspected OPA first, but couldn't find any problems with RDMA when used otherwise (e.g. MPI).

      The problem has been ongoing for a long time and is completely mysterious to us.

      Typically when the issue appears, kernel messages of the kind shown below appear.

      OPA counters do not show any errors and according to the customer they don't see network problems for compute (MPI, etc.)

      It doesn't seem to make a meaningful difference with versions of Lustre 2.12.x are installed on servers/clients. Apparently the fix from LU-14733 is not sufficient.

      What can we do to resolve the problem?

      Server:

      3298078.549239] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff9114021dd800
      [3298078.560918] LustreError: 155816:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff9115bbdb2050 x1739338816146624/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:690/0 lens 488/448 e 0 to 0 dl 1664192075 ref 1 fl Interpret:/2/0 rc 0/0
      [3298078.562601] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9111f8549800
      ...
      3298079.099509] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff911704472800
      [3298079.801646] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.4.16.11@o2ib1: -125
      [3298079.814642] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 68 previous similar messages
      [3298079.826073] LustreError: 24838:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff91169debe400
      ...
      [3298166.354998] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff91152a79d400
      [3298166.366511] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff91154f9b4850 x1739338816563968/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:23/0 lens 488/448 e 0 to 0 dl 1664192163 ref 1 fl Interpret:/0/0 rc 0/0
      [3298166.392860] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) Skipped 286 previous similar messages
      [3298166.411524] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911524fb7c00
      [3298166.422885] LustreError: 24827:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff9115a243b400
      

      Client:

      [5453641.210037] LustreError: 2380:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 00000000292896ee
      [5454253.579062] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1664191251/real 1664191251]  req@00000000d70c7694 x1739323570965888/t0(0) o3->work-OST0004-osc-ffff888108388000@10.4.104.104@o2ib1:6/4 lens 488/4536 e 0 to 1 dl 1664191352 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
      [5454253.608388] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 38 previous similar messages
      [5454253.618300] Lustre: work-OST0004-osc-ffff888108388000: Connection to work-OST0004 (at 10.4.104.104@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      [5454253.635574] Lustre: Skipped 36 previous similar messages
      [5454253.641478] Lustre: work-OST0004-osc-ffff888108388000: Connection restored to 10.4.104.104@o2ib1 (at 10.4.104.104@o2ib1)
      [5454253.652508] Lustre: Skipped 37 previous similar messages
      [5454253.676598] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Error -22 posting transmit to 10.4.104.104@o2ib1
      [5454253.687807] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Skipped 25 previous similar messages
      [5454559.560649] LustreError: 2379:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 000000000f1e9a15
      [5454559.587903] LustreError: 2381:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
      [5454559.599428] LustreError: 2378:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
      

       

      Attachments

        1. minimal-fix.patch.gz
          2 kB
          Dean Luick
        2. no-post.patch.gz
          0.7 kB
          Dean Luick
        3. o2iblnd-debug.tar.gz
          2 kB
          Dean Luick
        4. testing-patches-230119.patch
          29 kB
          Oliver Mangold

        Issue Links

          Activity

            People

              cbordage Cyril Bordage
              omangold Oliver Mangold
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: