[LU-16349] Excessive number of OPA disconnects / LNET network errors in cluster Created: 29/Nov/22 Updated: 01/May/23 Resolved: 14/Feb/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.16.0, Lustre 2.15.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Oliver Mangold | Assignee: | Cyril Bordage |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Server: Lustre 2.12.6 or 2.12.9 on official Centos 7 Lustre kernels for the respective versions |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
We find massive unresponsiveness of the Lustre on many clients. Sometimes there are temporary stalls (several minutes) which go away eventually, sometimes only rebooting the client helps. We suspected OPA first, but couldn't find any problems with RDMA when used otherwise (e.g. MPI). The problem has been ongoing for a long time and is completely mysterious to us. Typically when the issue appears, kernel messages of the kind shown below appear. OPA counters do not show any errors and according to the customer they don't see network problems for compute (MPI, etc.) It doesn't seem to make a meaningful difference with versions of Lustre 2.12.x are installed on servers/clients. Apparently the fix from What can we do to resolve the problem? Server: 3298078.549239] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff9114021dd800 [3298078.560918] LustreError: 155816:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk WRITE req@ffff9115bbdb2050 x1739338816146624/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:690/0 lens 488/448 e 0 to 0 dl 1664192075 ref 1 fl Interpret:/2/0 rc 0/0 [3298078.562601] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9111f8549800 ... 3298079.099509] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff911704472800 [3298079.801646] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.4.16.11@o2ib1: -125 [3298079.814642] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 68 previous similar messages [3298079.826073] LustreError: 24838:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff91169debe400 ... [3298166.354998] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff91152a79d400 [3298166.366511] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk WRITE req@ffff91154f9b4850 x1739338816563968/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:23/0 lens 488/448 e 0 to 0 dl 1664192163 ref 1 fl Interpret:/0/0 rc 0/0 [3298166.392860] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) Skipped 286 previous similar messages [3298166.411524] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911524fb7c00 [3298166.422885] LustreError: 24827:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff9115a243b400 Client: [5453641.210037] LustreError: 2380:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 00000000292896ee [5454253.579062] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1664191251/real 1664191251] req@00000000d70c7694 x1739323570965888/t0(0) o3->work-OST0004-osc-ffff888108388000@10.4.104.104@o2ib1:6/4 lens 488/4536 e 0 to 1 dl 1664191352 ref 2 fl Rpc:X/2/ffffffff rc -11/-1 [5454253.608388] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 38 previous similar messages [5454253.618300] Lustre: work-OST0004-osc-ffff888108388000: Connection to work-OST0004 (at 10.4.104.104@o2ib1) was lost; in progress operations using this service will wait for recovery to complete [5454253.635574] Lustre: Skipped 36 previous similar messages [5454253.641478] Lustre: work-OST0004-osc-ffff888108388000: Connection restored to 10.4.104.104@o2ib1 (at 10.4.104.104@o2ib1) [5454253.652508] Lustre: Skipped 37 previous similar messages [5454253.676598] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Error -22 posting transmit to 10.4.104.104@o2ib1 [5454253.687807] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Skipped 25 previous similar messages [5454559.560649] LustreError: 2379:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 000000000f1e9a15 [5454559.587903] LustreError: 2381:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49 [5454559.599428] LustreError: 2378:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
|
| Comments |
| Comment by Cyril Bordage [ 10/Jan/23 ] |
|
%% Just a copy of what Holger reported %%
I will share a theory from cornelis here, which is not yet confirmed
If you are willing, I have a failure theory I’d like you to test.
First, the theory: ----------------------- Background: When an lnet pool mr is created, rdmavt assigns a key to the mr. When lnet maps a mr, it (1) creates an invalidate wr with the current mr key then increments (changes) the key in the mr, (2) creates a wr that does a fast memory register with the updated key. A fast memory register will change the key rdmavt remembers for the mr. The intent is as follows: On first use, the two created wrs are pre-pended to the operation using the mr (e.g. a read or write). The two wr, together, will invalidate the original mr and fast register the mr with a new key.
Failure scenario: A pool mr is mapped, then unmapped without being used. If this is ever done, the keys in the mr and rdmavt for the memory region are forever out of sync because the invalidate and re-register never occurred for rdmavt but the key increment in the mr did occur. The next time the mr is mapped, the created invalidate will have a key that does not match rdmavt's version of the key.
Second, the test: ----------------------- Proposed test plan for my theory:
I have attached a patch for o2iblnd that adds debugging prints. The code is uncompiled and untested.
-Dean |
| Comment by Cyril Bordage [ 10/Jan/23 ] |
|
Could someone provide the proposed patch to check it? |
| Comment by Dean Luick [ 10/Jan/23 ] |
|
I have attached the debug patch I sent to Holger. The file is no-post.patch.gz. |
| Comment by Dean Luick [ 17/Jan/23 ] |
|
I am unable to force a map-unmap sequence using a 2-node setup and lnetctl. The result was a kernel BUG. See LU-16484 for details. I have added more o2iblnd debugging, including a patch that will perform a pool mr map-unmap after a given number of uses. Here is a mr being forced to map-unmap without being posted: Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_pool_map : mr 0xffff948ecf043800, rkey 0x00646500, is_rx 1, count 100 Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_map_tx: forcing map-unmap on mr 0xffff948ecf043800 Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800, rkey 0x00646501, posted 0 Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800 not posted Line 4 is the debug line that I sent to NEC. The other lines are extra debug lines I added locally to print more information. Line 1 shows that this is the 100th pool mr map - a call to both kiblnd_fmr_map_tx() and kiblnd_fmr_pool_map(). Line 2 is the commentary that I am forcing the map-unmap condition on a pool mr. On line 3, note that the rkey is incremented from its original value in line 1. This is the the issue/problem that occurs when unmapped without being posted - the mr key is changed in o2iblnd, but rdmavt still holds the original key. I then had to wait for the same pool mr to be re-used. Here is what happened on re-use: Jan 12 10:51:11 opahsx121 kernel: kiblnd_fmr_pool_map : mr 0xffff948ecf043800, rkey 0x00646501, is_rx 1, count 1124 Jan 12 10:51:11 opahsx121 kernel: kiblnd_post_tx_locked: mr 0xffff948ecf043800, posted 0, err -22, send failure Jan 12 10:51:11 opahsx121 kernel: LNetError: 10773:0:(o2iblnd_cb.c:1054:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.72.174@o2ib Jan 12 10:51:11 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800, rkey 0x00646502, posted 0 Jan 12 10:51:11 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800 not posted Jan 12 10:51:11 opahsx121 kernel: LustreError: 10773:0:(events.c:205:client_bulk_callback()) event type 2, status -22, desc 0000000010390fca Jan 12 10:51:11 opahsx121 kernel: Lustre: 13263:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1673538671/real 1673538671] req@00000000b9ba836a x1754831869785472/t0(0) o37->temp-MDT0000-mdc-ffff948000960000@192.168.72.174@o2ib:23/10 lens 448/440 e 0 to 1 dl 1673538678 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1 Jan 12 10:51:11 opahsx121 kernel: Lustre: temp-MDT0000-mdc-ffff948000960000: Connection to temp-MDT0000 (at 192.168.72.174@o2ib) was lost; in progress operations using this service will wait for recovery to complete Jan 12 10:51:11 opahsx121 kernel: Lustre: temp-MDT0000-mdc-ffff948000960000: Connection restored to 192.168.72.174@o2ib (at 192.168.72.174@o2ib) Lines 1, 4, and 5 are my debugging prints. On line 1 note that the count is now 1124. I had to wait 1024 pool mr uses before my victim mr was reused. Line 2 is the failure reported by NEC/ULM that was traced to an key mismatch. On line 4 note that the rkey is again incremented. In other words it stays wrong and will continue to be wrong until the user portion of the key (8 bits) rolls over and re-matches what rdmavt expects. Line 5, again, is the debug print sent to NEC. On the server, this is what I see: Jan 12 10:51:11 opahsx174 kernel: LustreError: 30188:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9d2ab07c6400 Jan 12 10:51:11 opahsx174 kernel: Lustre: temp-MDT0000: Client c4c295b0-1df0-37a4-3ada-e906e22cb42e (at 192.168.72.121@o2ib) reconnecting Jan 12 10:51:12 opahsx174 kernel: LustreError: 11726:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk READ req@ffff9d2ba987d050 x1754831869785472/t0(0) o37->c4c295b0-1df0-37a4-3ada-e906e22cb42e@192.168.72.121@o2ib:392/0 lens 448/440 e 0 to 0 dl 1673538677 ref 1 fl Interpret:/0/0 rc 0/0 Jan 12 12:10:21 opahsx174 kernel: Lustre: MGS: Client 5188c476-a2eb-f754-b30a-890947140fdf (at 192.168.72.121@o2ib) reconnecting Jan 12 12:10:21 opahsx174 kernel: Lustre: Skipped 1 previous similar message Jan 12 12:10:21 opahsx174 kernel: Lustre: MGS: Connection restored to 0be70565-3123-a02c-eb27-972e46a2c761 (at 192.168.72.121@o2ib) Jan 12 12:10:21 opahsx174 kernel: Lustre: Skipped 4 previous similar messages Lines 1-3 match the time of the client failure. Lines 4-7 are 19 minutes later. I don’t know if the latter lines are related. Line 1 matches one of the server error lines reported by NEC/ULM. Summary: I think this demonstrates that there is a bug in o2ibnld when a pool mr is mapped and unmapped without a post. When this occurs, we see messages similar to those seen by NEC/ULM. What this doesn’t demonstrate is this actually happening at NEC/ULM. The demonstration above was forced by an explicit coding change that simulates an error condition what may or may not be possible in practice. |
| Comment by Dean Luick [ 17/Jan/23 ] |
|
The extra debugging and forced map-unmap are two new patches on top of the original patch sent to NEC/ULM and previously attached in this ticket. I have saved all as a patch series and attached that here as file o2iblnd-debug.tar.gz. |
| Comment by Cyril Bordage [ 18/Jan/23 ] |
|
Hello Dean, thank you for this demonstration. It is pretty clear. You are right, this shows some problem, but we cannot be sure it is the one. I do not see why this issue was happening in this machine again and again, and not in other sites. I will focus on writing a patch, and will get back to you if I cannot manage to test it with your debugging process. My concern is not to know if the patch is responsible for avoiding new crashes. When the patch will be accepted, would it be possible not to patch all clients to see if it happens again on unpatched nodes? |
| Comment by Dean Luick [ 18/Jan/23 ] |
|
I have attached a proposed fix for the issue. I have tested it and it allows i2iblnd to correctly ride through a map-unmap without a key mismatch. This change is "minimal" in that it changes only enough to avoid the problem. This patch sits on top of the debugging and forced map-unmap patches - there may be some hand editing needed when applying without the debug patches. The changes are simple enough that this should not be a problem. The patch file is minimal-fix.patch.gz. |
| Comment by Gerrit Updater [ 19/Jan/23 ] |
|
"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49714 |
| Comment by Cyril Bordage [ 19/Jan/23 ] |
|
Thank you, Dean. Your patch seemed fine, so I pushed it. |
| Comment by Cyril Bordage [ 24/Jan/23 ] |
|
Just a quick update. Waiting for the patch to be reviewed. It should be done soon. Then, I will backport it. |
| Comment by Cyril Bordage [ 27/Jan/23 ] |
|
Hello, could you post the details of the last bug with all logs here. Thank you. |
| Comment by Dean Luick [ 31/Jan/23 ] |
|
At NEC's request, Ulm has been running a modified version of Lustre 2.12.9 on the client side. Known changes applied by NEC are attached to ticket as:
o2iblnd-debug.tar.gz (includes no-post.patch.gz) minimal-fix.patch.gz NEC sent the journal log of n0802 which includes the debugging output. A review of the log file shows:
In summary, I think the Cornelis proposed MR patch fixes the observed rkey mismatch issue. |
| Comment by Dean Luick [ 31/Jan/23 ] |
|
Ulm reported node crashes last weekend. Here is a review of the set of 6 BMC console logs sent by NEC to Cornelis. Summary:
Console: n0428-bmc There is an LNet error in the log then 6 seconds later a Watchdog NMI. This is followed 1 second later with a fatal NMI. [569811.333835] LNetError: 2371:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [569830.918525] NMI watchdog: Watchdog detected hard LOCKUP on cpu 50 Then 1 second later, another NMI, this one fatal, on the same CPU. Same traceback, although obscured by the additional traceback path through the NMI. Other observations:
This is running on [569830.918544] Workqueue: ib_cm cm_work_handler [ib_cm] First traceback: [569830.918552] Call Trace: [569830.918552] queued_write_lock_slowpath+0x75/0x80 [569830.918553] _raw_write_lock_irqsave+0x2b/0x30 [569830.918553] kiblnd_close_conn+0x1b/0x40 [ko2iblnd] [569830.918554] kiblnd_cm_callback+0x9e0/0x2270 [ko2iblnd] [569830.918554] cma_cm_event_handler+0x25/0xd0 [rdma_cm] [569830.918554] cma_ib_handler+0xa7/0x2e0 [rdma_cm] [569830.918555] cm_process_work+0x22/0xf0 [ib_cm] [569830.918555] cm_work_handler+0xa77/0x1410 [ib_cm] [569830.918555] ? __switch_to_asm+0x41/0x70 [569830.918556] ? __switch_to_asm+0x35/0x70 [569830.918556] ? __switch_to_asm+0x41/0x70 [569830.918556] ? __switch_to_asm+0x35/0x70 [569830.918557] ? __switch_to_asm+0x41/0x70 [569830.918557] ? __switch_to_asm+0x35/0x70 [569830.918557] ? __switch_to_asm+0x41/0x70 [569830.918558] ? __switch_to+0x167/0x470 [569830.918558] ? finish_task_switch+0xaa/0x2e0 [569830.918558] process_one_work+0x1a7/0x360 [569830.918559] ? create_worker+0x1a0/0x1a0 [569830.918559] worker_thread+0x30/0x390 [569830.918559] ? create_worker+0x1a0/0x1a0 [569830.918559] kthread+0x116/0x130 [569830.918560] ? kthread_flush_work_fn+0x10/0x10 [569830.918560] ret_from_fork+0x1f/0x40
Console: n0501-bmc There is an LNet error in the log then 6 seconds later a Watchdog NMI. This is followed 1 second later with a fatal NMI.
[638125.931425] LNetError: 2377:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [638141.798616] NMI watchdog: Watchdog detected hard LOCKUP on cpu 9 Observations:
First Traceback:
[638141.798641] Call Trace: [638141.798641] queued_read_lock_slowpath+0x6e/0x80 [638141.798642] _raw_read_lock_irqsave+0x31/0x40 [638141.798642] kiblnd_launch_tx+0x41/0xa70 [ko2iblnd] [638141.798642] ? lnet_copy_iov2iov+0x158/0x250 [lnet] [638141.798643] kiblnd_send+0x1f9/0x9a0 [ko2iblnd] [638141.798643] lnet_ni_send+0x42/0xd0 [lnet] [638141.798643] lnet_send+0x7e/0x1b0 [lnet] [638141.798644] LNetPut+0x2b7/0xaf0 [lnet] [638141.798644] ? cfs_percpt_unlock+0x15/0xb0 [libcfs] [638141.798644] ptl_send_buf+0x20b/0x560 [ptlrpc] [638141.798645] ptl_send_rpc+0x462/0xda0 [ptlrpc] [638141.798645] ? __switch_to_asm+0x41/0x70 [638141.798645] ptlrpc_send_new_req+0x596/0xa70 [ptlrpc] [638141.798646] ? __switch_to_asm+0x35/0x70 [638141.798646] ptlrpc_check_set.part.30+0x725/0x1f20 [ptlrpc] [638141.798646] ? __switch_to+0x167/0x470 [638141.798647] ? finish_task_switch+0xaa/0x2e0 [638141.798647] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [638141.798647] ptlrpcd+0x374/0x4b0 [ptlrpc] [638141.798648] ? wake_up_q+0x80/0x80 [638141.798648] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [638141.798648] kthread+0x116/0x130 [638141.798649] ? kthread_flush_work_fn+0x10/0x10 [638141.798649] ret_from_fork+0x1f/0x40
Console: n0515-bmc There is an LNet error in the log then 6 seconds later a Watchdog NMI. This is followed 1 second later with a fatal NMI.
[39917.837921] LNetError: 2365:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [39930.022676] NMI watchdog: Watchdog detected hard LOCKUP on cpu 66 Observations:
First traceback:
[39930.022703] Call Trace: [39930.022703] queued_read_lock_slowpath+0x6e/0x80 [39930.022704] _raw_read_lock_irqsave+0x31/0x40 [39930.022704] kiblnd_launch_tx+0x41/0xa70 [ko2iblnd] [39930.022704] ? lnet_copy_iov2iov+0x158/0x250 [lnet] [39930.022705] kiblnd_send+0x1f9/0x9a0 [ko2iblnd] [39930.022705] lnet_ni_send+0x42/0xd0 [lnet] [39930.022705] lnet_send+0x7e/0x1b0 [lnet] [39930.022706] LNetPut+0x2b7/0xaf0 [lnet] [39930.022706] ? cfs_percpt_unlock+0x15/0xb0 [libcfs] [39930.022706] ptl_send_buf+0x20b/0x560 [ptlrpc] [39930.022707] ptl_send_rpc+0x462/0xda0 [ptlrpc] [39930.022707] ? __switch_to_asm+0x41/0x70 [39930.022707] ptlrpc_send_new_req+0x596/0xa70 [ptlrpc] [39930.022708] ? __switch_to_asm+0x35/0x70 [39930.022708] ptlrpc_check_set.part.30+0x725/0x1f20 [ptlrpc] [39930.022708] ? __switch_to+0x167/0x470 [39930.022708] ? finish_task_switch+0xaa/0x2e0 [39930.022709] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [39930.022709] ptlrpcd+0x374/0x4b0 [ptlrpc] [39930.022709] ? wake_up_q+0x80/0x80 [39930.022710] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [39930.022710] kthread+0x116/0x130 [39930.022710] ? kthread_flush_work_fn+0x10/0x10 [39930.022711] ret_from_fork+0x1f/0x40
Console: n0601-bmc There is an LNet error in the log then 24 seconds later a Watchdog NMI. This is followed 1 second later with a fatal NMI.
[576238.943801] LNetError: 2377:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [576262.045550] NMI watchdog: Watchdog detected hard LOCKUP on cpu 12Modules linked in: mgc(OE) lustre(OE) Observations:
First traceback:
[576262.045570] Call Trace: [576262.045570] _raw_read_lock_irqsave+0x31/0x40 [576262.045570] kiblnd_launch_tx+0x41/0xa70 [ko2iblnd] [576262.045571] ? cfs_percpt_lock+0x16/0xf0 [libcfs] [576262.045571] kiblnd_send+0x1f9/0x9a0 [ko2iblnd] [576262.045571] lnet_ni_send+0x42/0xd0 [lnet] [576262.045571] lnet_send+0x7e/0x1b0 [lnet] [576262.045572] lnet_finalize+0x41d/0xf90 [lnet] [576262.045572] kiblnd_tx_done+0x119/0x2f0 [ko2iblnd] [576262.045572] ? kiblnd_handle_completion+0xc6/0x1a0 [ko2iblnd] [576262.045573] kiblnd_handle_rx+0x38a/0x680 [ko2iblnd] [576262.045573] kiblnd_scheduler+0x1036/0x10b0 [ko2iblnd] [576262.045573] ? wake_up_q+0x80/0x80 [576262.045573] ? __switch_to_asm+0x41/0x70 [576262.045574] ? kiblnd_cq_event+0x80/0x80 [ko2iblnd] [576262.045574] kthread+0x116/0x130 [576262.045574] ? kthread_flush_work_fn+0x10/0x10 [576262.045575] ret_from_fork+0x1f/0x40
Console: n0922-bmc There is an LNet error in the log then 16 seconds later a Watchdog NMI. This is followed 1 second later with a fatal NMI.
[52223.817213] LNetError: 2353:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [52239.190601] NMI watchdog: Watchdog detected hard LOCKUP on cpu 71 Observations:
First call trace:
[52239.190627] Call Trace: [52239.190627] queued_read_lock_slowpath+0x6e/0x80 [52239.190628] _raw_read_lock_irqsave+0x31/0x40 [52239.190628] kiblnd_launch_tx+0x41/0xa70 [ko2iblnd] [52239.190628] ? lnet_copy_iov2iov+0x158/0x250 [lnet] [52239.190629] kiblnd_send+0x1f9/0x9a0 [ko2iblnd] [52239.190629] lnet_ni_send+0x42/0xd0 [lnet] [52239.190629] lnet_send+0x7e/0x1b0 [lnet] [52239.190629] LNetPut+0x2b7/0xaf0 [lnet] [52239.190630] ? cfs_percpt_unlock+0x15/0xb0 [libcfs] [52239.190630] ptl_send_buf+0x20b/0x560 [ptlrpc] [52239.190630] ptl_send_rpc+0x462/0xda0 [ptlrpc] [52239.190631] ? __switch_to_asm+0x41/0x70 [52239.190631] ptlrpc_send_new_req+0x596/0xa70 [ptlrpc] [52239.190631] ? __switch_to_asm+0x35/0x70 [52239.190632] ptlrpc_check_set.part.30+0x725/0x1f20 [ptlrpc] [52239.190632] ? __switch_to+0x167/0x470 [52239.190633] ? finish_task_switch+0xaa/0x2e0 [52239.190633] ptlrpcd_check+0x3d5/0x5b0 [ptlrpc] [52239.190633] ptlrpcd+0x374/0x4b0 [ptlrpc] [52239.190634] ? wake_up_q+0x80/0x80 [52239.190634] ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc] [52239.190634] kthread+0x116/0x130 [52239.190635] ? kthread_flush_work_fn+0x10/0x10 [52239.190635] ret_from_fork+0x1f/0x40
Console: n1618-bmc No NMI or tracebacks. I see:
|
| Comment by Gerrit Updater [ 01/Feb/23 ] |
|
"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49864 |
| Comment by Cyril Bordage [ 01/Feb/23 ] |
|
Which version of Lustre it is? I thought it was 2.12.9 with the 4 given patches. In log, we have "o2iblnd_cb.c:3442". But from the code, I should have "o2iblnd_cb.c:3443". |
| Comment by Oliver Mangold [ 02/Feb/23 ] |
|
It was 2.12.9 with the patches from Dean. There were quite some changes (debug output) before line 3442, so I thing one-off is plausible. |
| Comment by Cyril Bordage [ 02/Feb/23 ] |
|
I applied all Dean's patches, so we should have exactly the same line numbers. I wanted to be sure, we are working on the same code base. In my code, "CWARN" call is from line 3440 to 3443. What lines do you have? Could you provide the patched files for me to compare? In the meantime, I will keep trying to find a path that shows a lock problem. |
| Comment by Oliver Mangold [ 02/Feb/23 ] |
|
I just attached to the ticket the patches to be applied on top of 2.12.9. |
| Comment by Cyril Bordage [ 02/Feb/23 ] |
|
Thank you, Oliver. I did not know Serguei's patch was also deployed. Here is my analysis of the timeout errors. In n0428-31-01-2023-journal.log, we can see the two messages: Jan 30 16:08:36 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z E! [agent] Error killing process: os: process already finished Jan 30 16:08:55 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3517:kiblnd_check_conns()) Timed out RDMA with 10.4.104.103@o2ib1 (19): c: 27, oc: 0, rc: 32 But they are separated by 19 seconds, with some error in the middle saying that collection for telegraf was too long. That seems to confirm what we saw in the previous error.
The flow of debug messages could be very high (more than 1000/s). We will see with the new patch if the problem disappears.
|
| Comment by Gerrit Updater [ 03/Feb/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49714/ |
| Comment by Cyril Bordage [ 06/Feb/23 ] |
|
Hello Oliver, Was the patch deployed? Thank you.
|
| Comment by Benedikt Schaefer [ 07/Feb/23 ] |
|
Hi,
pachted client is actual in rollout on cluster. ~250 nodes are running with new lustre client now.
best regards Benedikt |
| Comment by Benedikt Schaefer [ 14/Feb/23 ] |
|
actual state: ~600 clients with patched client we still see on some clients lustre errors, but nodes were recovered and not stuck |
| Comment by Peter Jones [ 14/Feb/23 ] |
|
Fix landed for 2.16 so closing ticket. Should port fix to b2_15 for inclusion in a future 2.15.x release. |
| Comment by Gerrit Updater [ 06/Mar/23 ] |
|
"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50214 |
| Comment by Gerrit Updater [ 06/Apr/23 ] |
|
"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50564 |
| Comment by Gerrit Updater [ 11/Apr/23 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50214/ |