[LU-16349] Excessive number of OPA disconnects / LNET network errors in cluster - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3
Affects Version/s: None
Labels:
None
Environment:
Server: Lustre 2.12.6 or 2.12.9 on official Centos 7 Lustre kernels for the respective versions
Client: Centos 8 with various kernels and verious Lustre 2.12.x versions (including 2.12.8 and 2.12.9)

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We find massive unresponsiveness of the Lustre on many clients. Sometimes there are temporary stalls (several minutes) which go away eventually, sometimes only rebooting the client helps.

We suspected OPA first, but couldn't find any problems with RDMA when used otherwise (e.g. MPI).

The problem has been ongoing for a long time and is completely mysterious to us.

Typically when the issue appears, kernel messages of the kind shown below appear.

OPA counters do not show any errors and according to the customer they don't see network problems for compute (MPI, etc.)

It doesn't seem to make a meaningful difference with versions of Lustre 2.12.x are installed on servers/clients. Apparently the fix from ~~LU-14733~~ is not sufficient.

What can we do to resolve the problem?

Server:

3298078.549239] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff9114021dd800
[3298078.560918] LustreError: 155816:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff9115bbdb2050 x1739338816146624/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:690/0 lens 488/448 e 0 to 0 dl 1664192075 ref 1 fl Interpret:/2/0 rc 0/0
[3298078.562601] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9111f8549800
...
3298079.099509] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff911704472800
[3298079.801646] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.4.16.11@o2ib1: -125
[3298079.814642] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 68 previous similar messages
[3298079.826073] LustreError: 24838:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff91169debe400
...
[3298166.354998] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff91152a79d400
[3298166.366511] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff91154f9b4850 x1739338816563968/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:23/0 lens 488/448 e 0 to 0 dl 1664192163 ref 1 fl Interpret:/0/0 rc 0/0
[3298166.392860] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) Skipped 286 previous similar messages
[3298166.411524] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911524fb7c00
[3298166.422885] LustreError: 24827:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff9115a243b400

Client:

[5453641.210037] LustreError: 2380:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 00000000292896ee
[5454253.579062] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1664191251/real 1664191251]  req@00000000d70c7694 x1739323570965888/t0(0) o3->work-OST0004-osc-ffff888108388000@10.4.104.104@o2ib1:6/4 lens 488/4536 e 0 to 1 dl 1664191352 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
[5454253.608388] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 38 previous similar messages
[5454253.618300] Lustre: work-OST0004-osc-ffff888108388000: Connection to work-OST0004 (at 10.4.104.104@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
[5454253.635574] Lustre: Skipped 36 previous similar messages
[5454253.641478] Lustre: work-OST0004-osc-ffff888108388000: Connection restored to 10.4.104.104@o2ib1 (at 10.4.104.104@o2ib1)
[5454253.652508] Lustre: Skipped 37 previous similar messages
[5454253.676598] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Error -22 posting transmit to 10.4.104.104@o2ib1
[5454253.687807] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Skipped 25 previous similar messages
[5454559.560649] LustreError: 2379:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 000000000f1e9a15
[5454559.587903] LustreError: 2381:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
[5454559.599428] LustreError: 2378:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

minimal-fix.patch.gz
2 kB
18/Jan/23 3:30 PM
no-post.patch.gz
0.7 kB
10/Jan/23 10:08 PM
o2iblnd-debug.tar.gz
2 kB
17/Jan/23 9:18 PM
testing-patches-230119.patch
29 kB
02/Feb/23 10:18 AM

Issue Links

is duplicated by

LU-16244 after dropping router during ior lst add_group fails with "create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22", lnetctl ping fails intermittently

Resolved

is related to

LU-16484 Linux kernel BUG when deleting and adding a peer and using a filesystem

Open

Excessive number of OPA disconnects / LNET network errors in cluster

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates