Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
Server: Lustre 2.12.6 or 2.12.9 on official Centos 7 Lustre kernels for the respective versions
Client: Centos 8 with various kernels and verious Lustre 2.12.x versions (including 2.12.8 and 2.12.9)
-
3
-
9223372036854775807
Description
We find massive unresponsiveness of the Lustre on many clients. Sometimes there are temporary stalls (several minutes) which go away eventually, sometimes only rebooting the client helps.
We suspected OPA first, but couldn't find any problems with RDMA when used otherwise (e.g. MPI).
The problem has been ongoing for a long time and is completely mysterious to us.
Typically when the issue appears, kernel messages of the kind shown below appear.
OPA counters do not show any errors and according to the customer they don't see network problems for compute (MPI, etc.)
It doesn't seem to make a meaningful difference with versions of Lustre 2.12.x are installed on servers/clients. Apparently the fix from LU-14733 is not sufficient.
What can we do to resolve the problem?
Server:
3298078.549239] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff9114021dd800 [3298078.560918] LustreError: 155816:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff9115bbdb2050 x1739338816146624/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:690/0 lens 488/448 e 0 to 0 dl 1664192075 ref 1 fl Interpret:/2/0 rc 0/0 [3298078.562601] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9111f8549800 ... 3298079.099509] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff911704472800 [3298079.801646] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.4.16.11@o2ib1: -125 [3298079.814642] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 68 previous similar messages [3298079.826073] LustreError: 24838:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff91169debe400 ... [3298166.354998] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff91152a79d400 [3298166.366511] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk WRITE req@ffff91154f9b4850 x1739338816563968/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:23/0 lens 488/448 e 0 to 0 dl 1664192163 ref 1 fl Interpret:/0/0 rc 0/0 [3298166.392860] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) Skipped 286 previous similar messages [3298166.411524] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911524fb7c00 [3298166.422885] LustreError: 24827:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff9115a243b400
Client:
[5453641.210037] LustreError: 2380:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 00000000292896ee [5454253.579062] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1664191251/real 1664191251] req@00000000d70c7694 x1739323570965888/t0(0) o3->work-OST0004-osc-ffff888108388000@10.4.104.104@o2ib1:6/4 lens 488/4536 e 0 to 1 dl 1664191352 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 [5454253.608388] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 38 previous similar messages [5454253.618300] Lustre: work-OST0004-osc-ffff888108388000: Connection to work-OST0004 (at 10.4.104.104@o2ib1) was lost; in progress operations using this service will wait for recovery to complete [5454253.635574] Lustre: Skipped 36 previous similar messages [5454253.641478] Lustre: work-OST0004-osc-ffff888108388000: Connection restored to 10.4.104.104@o2ib1 (at 10.4.104.104@o2ib1) [5454253.652508] Lustre: Skipped 37 previous similar messages [5454253.676598] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Error -22 posting transmit to 10.4.104.104@o2ib1 [5454253.687807] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Skipped 25 previous similar messages [5454559.560649] LustreError: 2379:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 000000000f1e9a15 [5454559.587903] LustreError: 2381:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49 [5454559.599428] LustreError: 2378:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
Attachments
Issue Links
- is duplicated by
-
LU-16244 after dropping router during ior lst add_group fails with "create session RPC failed on 12345-192.168.128.110@o2ib44: Unknown error -22", lnetctl ping fails intermittently
- Resolved
- is related to
-
LU-16484 Linux kernel BUG when deleting and adding a peer and using a filesystem
- Open