Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-16349

Excessive number of OPA disconnects / LNET network errors in cluster

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.16.0, Lustre 2.15.3
    • None
    • None
    • Server: Lustre 2.12.6 or 2.12.9 on official Centos 7 Lustre kernels for the respective versions
      Client: Centos 8 with various kernels and verious Lustre 2.12.x versions (including 2.12.8 and 2.12.9)
    • 3
    • 9223372036854775807

    Description

      We find massive unresponsiveness of the Lustre on many clients. Sometimes there are temporary stalls (several minutes) which go away eventually, sometimes only rebooting the client helps.

      We suspected OPA first, but couldn't find any problems with RDMA when used otherwise (e.g. MPI).

      The problem has been ongoing for a long time and is completely mysterious to us.

      Typically when the issue appears, kernel messages of the kind shown below appear.

      OPA counters do not show any errors and according to the customer they don't see network problems for compute (MPI, etc.)

      It doesn't seem to make a meaningful difference with versions of Lustre 2.12.x are installed on servers/clients. Apparently the fix from LU-14733 is not sufficient.

      What can we do to resolve the problem?

      Server:

      3298078.549239] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff9114021dd800
      [3298078.560918] LustreError: 155816:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff9115bbdb2050 x1739338816146624/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:690/0 lens 488/448 e 0 to 0 dl 1664192075 ref 1 fl Interpret:/2/0 rc 0/0
      [3298078.562601] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9111f8549800
      ...
      3298079.099509] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff911704472800
      [3298079.801646] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.4.16.11@o2ib1: -125
      [3298079.814642] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 68 previous similar messages
      [3298079.826073] LustreError: 24838:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff91169debe400
      ...
      [3298166.354998] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff91152a79d400
      [3298166.366511] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff91154f9b4850 x1739338816563968/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:23/0 lens 488/448 e 0 to 0 dl 1664192163 ref 1 fl Interpret:/0/0 rc 0/0
      [3298166.392860] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) Skipped 286 previous similar messages
      [3298166.411524] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911524fb7c00
      [3298166.422885] LustreError: 24827:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff9115a243b400
      

      Client:

      [5453641.210037] LustreError: 2380:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 00000000292896ee
      [5454253.579062] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1664191251/real 1664191251]  req@00000000d70c7694 x1739323570965888/t0(0) o3->work-OST0004-osc-ffff888108388000@10.4.104.104@o2ib1:6/4 lens 488/4536 e 0 to 1 dl 1664191352 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
      [5454253.608388] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 38 previous similar messages
      [5454253.618300] Lustre: work-OST0004-osc-ffff888108388000: Connection to work-OST0004 (at 10.4.104.104@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
      [5454253.635574] Lustre: Skipped 36 previous similar messages
      [5454253.641478] Lustre: work-OST0004-osc-ffff888108388000: Connection restored to 10.4.104.104@o2ib1 (at 10.4.104.104@o2ib1)
      [5454253.652508] Lustre: Skipped 37 previous similar messages
      [5454253.676598] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Error -22 posting transmit to 10.4.104.104@o2ib1
      [5454253.687807] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Skipped 25 previous similar messages
      [5454559.560649] LustreError: 2379:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 000000000f1e9a15
      [5454559.587903] LustreError: 2381:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
      [5454559.599428] LustreError: 2378:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
      

       

      Attachments

        1. minimal-fix.patch.gz
          2 kB
        2. no-post.patch.gz
          0.7 kB
        3. o2iblnd-debug.tar.gz
          2 kB
        4. testing-patches-230119.patch
          29 kB

        Issue Links

          Activity

            [LU-16349] Excessive number of OPA disconnects / LNET network errors in cluster

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50214/
            Subject: LU-16349 o2iblnd: Fix key mismatch issue
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set:
            Commit: da98e1a6f31462ab76ce7c3a48c21eb4c9eda151

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50214/ Subject: LU-16349 o2iblnd: Fix key mismatch issue Project: fs/lustre-release Branch: b2_15 Current Patch Set: Commit: da98e1a6f31462ab76ce7c3a48c21eb4c9eda151

            "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50564
            Subject: LU-16349 o2iblnd: Fix key mismatch issue
            Project: fs/lustre-release
            Branch: b2_14
            Current Patch Set: 1
            Commit: 329448005cefab9fe13da33c50a51f664a8cce66

            gerrit Gerrit Updater added a comment - "Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50564 Subject: LU-16349 o2iblnd: Fix key mismatch issue Project: fs/lustre-release Branch: b2_14 Current Patch Set: 1 Commit: 329448005cefab9fe13da33c50a51f664a8cce66

            "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50214
            Subject: LU-16349 o2iblnd: Fix key mismatch issue
            Project: fs/lustre-release
            Branch: b2_15
            Current Patch Set: 1
            Commit: 8811fa2cd9db66d4acdabf17e2cc93ceef5f6752

            gerrit Gerrit Updater added a comment - "Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50214 Subject: LU-16349 o2iblnd: Fix key mismatch issue Project: fs/lustre-release Branch: b2_15 Current Patch Set: 1 Commit: 8811fa2cd9db66d4acdabf17e2cc93ceef5f6752
            pjones Peter Jones added a comment -

            Fix landed for 2.16 so closing ticket. Should port fix to b2_15 for inclusion in a future 2.15.x release.

            pjones Peter Jones added a comment - Fix landed for 2.16 so closing ticket. Should port fix to b2_15 for inclusion in a future 2.15.x release.

            actual state:

            ~600 clients with patched client

            we still see on some clients lustre errors, but nodes were recovered and not stuck

            bschaefer Benedikt Schaefer added a comment - actual state: ~600 clients with patched client we still see on some clients lustre errors, but nodes were recovered and not stuck

            Hi,

             

            pachted client is actual in rollout on cluster.

            ~250 nodes are running with new lustre client now.

             

            best regards

            Benedikt

            bschaefer Benedikt Schaefer added a comment - Hi,   pachted client is actual in rollout on cluster. ~250 nodes are running with new lustre client now.   best regards Benedikt

            Hello Oliver,

            Was the patch deployed?

            Thank you.

             

            cbordage Cyril Bordage added a comment - Hello Oliver, Was the patch deployed? Thank you.  

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49714/
            Subject: LU-16349 o2iblnd: Fix key mismatch issue
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 0c93919f1375ce16d42ea13755ca6ffcc66b9969

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49714/ Subject: LU-16349 o2iblnd: Fix key mismatch issue Project: fs/lustre-release Branch: master Current Patch Set: Commit: 0c93919f1375ce16d42ea13755ca6ffcc66b9969

            Thank you, Oliver. I did not know Serguei's patch was also deployed.

            Here is my analysis of the timeout errors.
            The error message from o2iblnd_cb.c:3442 (1), should be followed by another error message in o2iblnd_cb.c:3517 (2). But there is nothing before watchdog message...
            Between the two messages, there is only function returns and a test of a variable. It seems the machine is blocked somehow.
            When the code runs normally, the mutex is unlocked right away after message 2.

            In n0428-31-01-2023-journal.log, we can see the two messages:

            Jan 30 16:08:36 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
            Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s
            Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z E! [agent] Error killing process: os: process already finished
            Jan 30 16:08:55 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3517:kiblnd_check_conns()) Timed out RDMA with 10.4.104.103@o2ib1 (19): c: 27, oc: 0, rc: 32

            But they are separated by 19 seconds, with some error in the middle saying that collection for telegraf was too long. That seems to confirm what we saw in the previous error.

             

            The flow of debug messages could be very high (more than 1000/s). We will see with the new patch if the problem disappears.

             

            cbordage Cyril Bordage added a comment - Thank you, Oliver. I did not know Serguei's patch was also deployed. Here is my analysis of the timeout errors. The error message from o2iblnd_cb.c:3442 (1), should be followed by another error message in o2iblnd_cb.c:3517 (2). But there is nothing before watchdog message... Between the two messages, there is only function returns and a test of a variable. It seems the machine is blocked somehow. When the code runs normally, the mutex is unlocked right away after message 2. In n0428-31-01-2023-journal.log, we can see the two messages: Jan 30 16:08:36 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z E! [agent] Error killing process: os: process already finished Jan 30 16:08:55 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3517:kiblnd_check_conns()) Timed out RDMA with 10.4.104.103@o2ib1 (19): c: 27, oc: 0, rc: 32 But they are separated by 19 seconds, with some error in the middle saying that collection for telegraf was too long. That seems to confirm what we saw in the previous error.   The flow of debug messages could be very high (more than 1000/s). We will see with the new patch if the problem disappears.  

            I just attached to the ticket the patches to be applied on top of 2.12.9.

            omangold Oliver Mangold added a comment - I just attached to the ticket the patches to be applied on top of 2.12.9.

            People

              cbordage Cyril Bordage
              omangold Oliver Mangold
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: