[LU-16349] Excessive number of OPA disconnects / LNET network errors in cluster Created: 29/Nov/22  Updated: 01/May/23  Resolved: 14/Feb/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: Lustre 2.16.0, Lustre 2.15.3

Type: Bug Priority: Major
Reporter: Oliver Mangold Assignee: Cyril Bordage
Resolution: Fixed Votes: 0
Labels: None
Environment:

Server: Lustre 2.12.6 or 2.12.9 on official Centos 7 Lustre kernels for the respective versions
Client: Centos 8 with various kernels and verious Lustre 2.12.x versions (including 2.12.8 and 2.12.9)


Attachments: File minimal-fix.patch.gz     File no-post.patch.gz     File o2iblnd-debug.tar.gz     File testing-patches-230119.patch    
Issue Links:
Duplicate
is duplicated by LU-16244 after dropping router during ior lst ... Resolved
Related
is related to LU-16484 Linux kernel BUG when deleting and ad... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We find massive unresponsiveness of the Lustre on many clients. Sometimes there are temporary stalls (several minutes) which go away eventually, sometimes only rebooting the client helps.

We suspected OPA first, but couldn't find any problems with RDMA when used otherwise (e.g. MPI).

The problem has been ongoing for a long time and is completely mysterious to us.

Typically when the issue appears, kernel messages of the kind shown below appear.

OPA counters do not show any errors and according to the customer they don't see network problems for compute (MPI, etc.)

It doesn't seem to make a meaningful difference with versions of Lustre 2.12.x are installed on servers/clients. Apparently the fix from LU-14733 is not sufficient.

What can we do to resolve the problem?

Server:

3298078.549239] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff9114021dd800
[3298078.560918] LustreError: 155816:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk WRITE  req@ffff9115bbdb2050 x1739338816146624/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:690/0 lens 488/448 e 0 to 0 dl 1664192075 ref 1 fl Interpret:/2/0 rc 0/0
[3298078.562601] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9111f8549800
...
3298079.099509] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff911704472800
[3298079.801646] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Error sending GET to 12345-10.4.16.11@o2ib1: -125
[3298079.814642] LNetError: 24838:0:(lib-move.c:2955:lnet_resend_pending_msgs_locked()) Skipped 68 previous similar messages
[3298079.826073] LustreError: 24838:0:(events.c:450:server_bulk_callback()) event type 5, status -125, desc ffff91169debe400
...
[3298166.354998] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 3, status -103, desc ffff91152a79d400
[3298166.366511] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) @@@ network error on bulk WRITE  req@ffff91154f9b4850 x1739338816563968/t0(0) o4->2f66151f-7d0d-7f3c-dee4-35be6a0f2efc@10.4.16.11@o2ib1:23/0 lens 488/448 e 0 to 0 dl 1664192163 ref 1 fl Interpret:/0/0 rc 0/0
[3298166.392860] LustreError: 156019:0:(ldlm_lib.c:3344:target_bulk_io()) Skipped 286 previous similar messages
[3298166.411524] LustreError: 24824:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff911524fb7c00
[3298166.422885] LustreError: 24827:0:(events.c:450:server_bulk_callback()) event type 3, status -5, desc ffff9115a243b400

Client:

[5453641.210037] LustreError: 2380:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 00000000292896ee
[5454253.579062] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1664191251/real 1664191251]  req@00000000d70c7694 x1739323570965888/t0(0) o3->work-OST0004-osc-ffff888108388000@10.4.104.104@o2ib1:6/4 lens 488/4536 e 0 to 1 dl 1664191352 ref 2 fl Rpc:X/2/ffffffff rc -11/-1
[5454253.608388] Lustre: 2475:0:(client.c:2169:ptlrpc_expire_one_request()) Skipped 38 previous similar messages
[5454253.618300] Lustre: work-OST0004-osc-ffff888108388000: Connection to work-OST0004 (at 10.4.104.104@o2ib1) was lost; in progress operations using this service will wait for recovery to complete
[5454253.635574] Lustre: Skipped 36 previous similar messages
[5454253.641478] Lustre: work-OST0004-osc-ffff888108388000: Connection restored to 10.4.104.104@o2ib1 (at 10.4.104.104@o2ib1)
[5454253.652508] Lustre: Skipped 37 previous similar messages
[5454253.676598] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Error -22 posting transmit to 10.4.104.104@o2ib1
[5454253.687807] LNetError: 2379:0:(o2iblnd_cb.c:1034:kiblnd_post_tx_locked()) Skipped 25 previous similar messages
[5454559.560649] LustreError: 2379:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 000000000f1e9a15
[5454559.587903] LustreError: 2381:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49
[5454559.599428] LustreError: 2378:0:(events.c:205:client_bulk_callback()) event type 1, status -22, desc 0000000068d1ba49

 



 Comments   
Comment by Cyril Bordage [ 10/Jan/23 ]

%% Just a copy of what Holger reported %%

 

I will share a theory from cornelis here, which is not yet confirmed

 

If you are willing, I have a failure theory I’d like you to test.

 

First, the theory:

-----------------------

Background:  When an lnet pool mr is created, rdmavt assigns a key to the mr.  When lnet maps a mr, it (1) creates an invalidate wr with the current mr key then increments (changes) the key in the mr, (2) creates a wr that does a fast memory register with the updated key.  A fast memory register will change the key rdmavt remembers for the mr.  The intent is as follows: On first use, the two created wrs are pre-pended to the operation using the mr (e.g. a read or write).  The two wr, together, will invalidate the original mr and fast register the mr with a new key.

 

Failure scenario: A pool mr is mapped, then unmapped without being used.  If this is ever done, the keys in the mr and rdmavt for the memory region are forever out of sync because the invalidate and re-register never occurred for rdmavt but the key increment in the mr did occur.  The next time the mr is mapped, the created invalidate will have a key that does not match rdmavt's version of the key.

 

Second, the test:

-----------------------

Proposed test plan for my theory:

  • Add debugging:
  • kiblnd_fmr_pool_unmap() :  Just before where fmr_frd is set to false, check its value.  If it is already false, then the mr was mapped and now unmapped without being used.  This mr has entered into the suspect error condition.  Print the mr pointer, frd->frd_mr, for comparison purposes.
  • kiblnd_post_tx_locked():  If the call to ib_post_send() fails and frd was used, print out the value of the failing mr address, frd->frd_mr, for comparison purposes.
  • Test steps:
  • On the Lustre client, unload and reload o2iblnd to make sure all pool mrs are good.
  • Add debugging as described above.  One of (a) compiled into o2iblnd, (b) kprobe, ( c ) ktrace, (d) systemtap
  • Recreate the failure in kiblnd_post_tx_locked(): ib_post_send() returns an error, -22.
  • Compare the mr address printed in the kiblnd_post_tx_locked() debugging with any mr addresses that were printed in kiblnd_fmr_pool_unmap() due to not being used after being mapped.
  • Notes:
  • If there are any prints from kiblnd_fmr_pool_unmap(), then, IMO, there is a problem.
  • It would be best if the debugging for rdmavt also be used for additional failure key information.

 

I have attached a patch for o2iblnd that adds debugging prints.  The code is uncompiled and untested.

 

-Dean

Comment by Cyril Bordage [ 10/Jan/23 ]

Could someone provide the proposed patch to check it?

Comment by Dean Luick [ 10/Jan/23 ]

I have attached the debug patch I sent to Holger.  The file is no-post.patch.gz.

Comment by Dean Luick [ 17/Jan/23 ]

I am unable to force a map-unmap sequence using a 2-node setup and lnetctl.  The result was a kernel BUG.  See LU-16484 for details.

I have added more o2iblnd debugging, including a patch that will perform a pool mr map-unmap after a given number of uses.

Here is a mr being forced to map-unmap without being posted:

Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_pool_map : mr 0xffff948ecf043800, rkey 0x00646500, is_rx 1, count 100
Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_map_tx: forcing map-unmap on mr 0xffff948ecf043800
Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800, rkey 0x00646501, posted 0
Jan 12 10:43:10 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800 not posted 

Line 4 is the debug line that I sent to NEC. The other lines are extra debug lines I added locally to print more information. Line 1 shows that this is the 100th pool mr map - a call to both kiblnd_fmr_map_tx() and kiblnd_fmr_pool_map(). Line 2 is the commentary that I am forcing the map-unmap condition on a pool mr. On line 3, note that the rkey is incremented from its original value in line 1. This is the the issue/problem that occurs when unmapped without being posted - the mr key is changed in o2iblnd, but rdmavt still holds the original key.

I then had to wait for the same pool mr to be re-used. Here is what happened on re-use:

Jan 12 10:51:11 opahsx121 kernel: kiblnd_fmr_pool_map : mr 0xffff948ecf043800, rkey 0x00646501, is_rx 1, count 1124
Jan 12 10:51:11 opahsx121 kernel: kiblnd_post_tx_locked: mr 0xffff948ecf043800, posted 0, err -22, send failure
Jan 12 10:51:11 opahsx121 kernel: LNetError: 10773:0:(o2iblnd_cb.c:1054:kiblnd_post_tx_locked()) Error -22 posting transmit to 192.168.72.174@o2ib
Jan 12 10:51:11 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800, rkey 0x00646502, posted 0
Jan 12 10:51:11 opahsx121 kernel: kiblnd_fmr_pool_unmap: mr 0xffff948ecf043800 not posted
Jan 12 10:51:11 opahsx121 kernel: LustreError: 10773:0:(events.c:205:client_bulk_callback()) event type 2, status -22, desc 0000000010390fca
Jan 12 10:51:11 opahsx121 kernel: Lustre: 13263:0:(client.c:2169:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1673538671/real 1673538671] req@00000000b9ba836a x1754831869785472/t0(0) o37->temp-MDT0000-mdc-ffff948000960000@192.168.72.174@o2ib:23/10 lens 448/440 e 0 to 1 dl 1673538678 ref 2 fl Rpc:eX/0/ffffffff rc 0/-1
Jan 12 10:51:11 opahsx121 kernel: Lustre: temp-MDT0000-mdc-ffff948000960000: Connection to temp-MDT0000 (at 192.168.72.174@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Jan 12 10:51:11 opahsx121 kernel: Lustre: temp-MDT0000-mdc-ffff948000960000: Connection restored to 192.168.72.174@o2ib (at 192.168.72.174@o2ib) 

Lines 1, 4, and 5 are my debugging prints. On line 1 note that the count is now 1124. I had to wait 1024 pool mr uses before my victim mr was reused. Line 2 is the failure reported by NEC/ULM that was traced to an key mismatch. On line 4 note that the rkey is again incremented. In other words it stays wrong and will continue to be wrong until the user portion of the key (8 bits) rolls over and re-matches what rdmavt expects. Line 5, again, is the debug print sent to NEC.

On the server, this is what I see:

Jan 12 10:51:11 opahsx174 kernel: LustreError: 30188:0:(events.c:450:server_bulk_callback()) event type 5, status -103, desc ffff9d2ab07c6400
Jan 12 10:51:11 opahsx174 kernel: Lustre: temp-MDT0000: Client c4c295b0-1df0-37a4-3ada-e906e22cb42e (at 192.168.72.121@o2ib) reconnecting
Jan 12 10:51:12 opahsx174 kernel: LustreError: 11726:0:(ldlm_lib.c:3338:target_bulk_io()) @@@ Reconnect on bulk READ req@ffff9d2ba987d050 x1754831869785472/t0(0) o37->c4c295b0-1df0-37a4-3ada-e906e22cb42e@192.168.72.121@o2ib:392/0 lens 448/440 e 0 to 0 dl 1673538677 ref 1 fl Interpret:/0/0 rc 0/0
Jan 12 12:10:21 opahsx174 kernel: Lustre: MGS: Client 5188c476-a2eb-f754-b30a-890947140fdf (at 192.168.72.121@o2ib) reconnecting
Jan 12 12:10:21 opahsx174 kernel: Lustre: Skipped 1 previous similar message
Jan 12 12:10:21 opahsx174 kernel: Lustre: MGS: Connection restored to 0be70565-3123-a02c-eb27-972e46a2c761 (at 192.168.72.121@o2ib)
Jan 12 12:10:21 opahsx174 kernel: Lustre: Skipped 4 previous similar messages 

Lines 1-3 match the time of the client failure. Lines 4-7 are 19 minutes later. I don’t know if the latter lines are related. Line 1 matches one of the server error lines reported by NEC/ULM.

Summary:

I think this demonstrates that there is a bug in o2ibnld when a pool mr is mapped and unmapped without a post. When this occurs, we see messages similar to those seen by NEC/ULM. What this doesn’t demonstrate is this actually happening at NEC/ULM. The demonstration above was forced by an explicit coding change that simulates an error condition what may or may not be possible in practice.

Comment by Dean Luick [ 17/Jan/23 ]

The extra debugging and forced map-unmap are two new patches on top of the original patch sent to NEC/ULM and previously attached in this ticket.  I have saved all as a patch series and attached that here as file o2iblnd-debug.tar.gz.

o2iblnd-debug.tar.gz

Comment by Cyril Bordage [ 18/Jan/23 ]

Hello Dean,

thank you for this demonstration. It is pretty clear.

You are right, this shows some problem, but we cannot be sure it is the one. I do not see why this issue was happening in this machine again and again, and not in other sites. I will focus on writing a patch, and will get back to you if I cannot manage to test it with your debugging process.

My concern is not to know if the patch is responsible for avoiding new crashes. When the patch will be accepted, would it be possible not to patch all clients to see if it happens again on unpatched nodes?

Comment by Dean Luick [ 18/Jan/23 ]

I have attached a proposed fix for the issue.  I have tested it and it allows i2iblnd to correctly ride through a map-unmap without a key mismatch.  This change is "minimal" in that it changes only enough to avoid the problem.  This patch sits on top of the debugging and forced map-unmap patches - there may be some hand editing needed when applying without the debug patches.  The changes are simple enough that this should not be a problem.  The patch file is minimal-fix.patch.gz.

minimal-fix.patch.gz

Comment by Gerrit Updater [ 19/Jan/23 ]

"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49714
Subject: LU-16349 o2iblnd: Fix key mismatch issue
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: eb697083c32d19dbdb0e0f38d2bbef183fc113bf

Comment by Cyril Bordage [ 19/Jan/23 ]

Thank you, Dean. Your patch seemed fine, so I pushed it.

Comment by Cyril Bordage [ 24/Jan/23 ]

Just a quick update.

Waiting for the patch to be reviewed. It should be done soon. Then, I will backport it.

Comment by Cyril Bordage [ 27/Jan/23 ]

Hello,

could you post the details of the last bug with all logs here.

Thank you.

Comment by Dean Luick [ 31/Jan/23 ]

At NEC's request, Ulm has been running a modified version of Lustre 2.12.9 on the client side.  Known changes applied by NEC are attached to ticket as:

 

o2iblnd-debug.tar.gz (includes no-post.patch.gz)
minimal-fix.patch.gz

NEC sent the journal log of n0802 which includes the debugging output.  A review of the log file shows:

  1. Several cases of a mr being mapped then unmapped without being posted.  These same mr's were successfully used later.
  2. There were no cases of kiblnd_post_tx_locked() returning Error -22 - a sign of the rkey mismatch.
  3. There were other Lustre issues seen within the journal log.

In summary, I think the Cornelis proposed MR patch fixes the observed rkey mismatch issue.

Comment by Dean Luick [ 31/Jan/23 ]

Ulm reported node crashes last weekend.  Here is a review of the set of 6 BMC console logs sent by NEC to Cornelis.

Summary:

  • 5 of the 6 logs show a watchdog NMI following a LNet error.  The 6th log shows no fatal error - it looks like a normal start and shutdown.
  • All fatal NMI watchdog timeouts are preceded by the same LNet error.
  • 4 of the 5 NMI tracebacks are the same.  The call stack includes the modules ptlrpc, lnet, ko2iblnd.
  • 1 of the 5 NMI tracebacks is slightly different.  This is on a kernel workqueue thread.  The call stack includes the modules ib_cm, rdma_cm, ko2iblnd.
  • None of the traces involve hfi1 or rdmavt.
  • The NMI watchdog failures indicate deadlock of some form.
  • My tentative conclusion: No evidence of an OPA issue at this time.
  • My guess, with no evidence, is that the NMI is due to a code error path that contains a circular lock dependency.  The cause of the original error is unknown.
  • Alternative theory: There is a rare lockup that eventually triggers the LNet timeout error, then later the NMI watchdog.

 


Console: n0428-bmc

There is an LNet error in the log then 6 seconds later a Watchdog NMI.  This is followed 1 second later with a fatal NMI.

[569811.333835] LNetError: 2371:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[569830.918525] NMI watchdog: Watchdog detected hard LOCKUP on cpu 50

Then 1 second later, another NMI, this one fatal, on the same CPU.   Same traceback, although obscured by the additional traceback path through the NMI.

Other observations:

  • None of the tracebacks have hfi or rdmavt in them
  • The tracebacks do have functions in ib_cm, rdma_cm, and ko2iblnd

This is running on

[569830.918544] Workqueue: ib_cm cm_work_handler [ib_cm]

First traceback:

[569830.918552] Call Trace:
[569830.918552]  queued_write_lock_slowpath+0x75/0x80
[569830.918553]  _raw_write_lock_irqsave+0x2b/0x30
[569830.918553]  kiblnd_close_conn+0x1b/0x40 [ko2iblnd]
[569830.918554]  kiblnd_cm_callback+0x9e0/0x2270 [ko2iblnd]
[569830.918554]  cma_cm_event_handler+0x25/0xd0 [rdma_cm]
[569830.918554]  cma_ib_handler+0xa7/0x2e0 [rdma_cm]
[569830.918555]  cm_process_work+0x22/0xf0 [ib_cm]
[569830.918555]  cm_work_handler+0xa77/0x1410 [ib_cm]
[569830.918555]  ? __switch_to_asm+0x41/0x70
[569830.918556]  ? __switch_to_asm+0x35/0x70
[569830.918556]  ? __switch_to_asm+0x41/0x70
[569830.918556]  ? __switch_to_asm+0x35/0x70
[569830.918557]  ? __switch_to_asm+0x41/0x70
[569830.918557]  ? __switch_to_asm+0x35/0x70
[569830.918557]  ? __switch_to_asm+0x41/0x70
[569830.918558]  ? __switch_to+0x167/0x470
[569830.918558]  ? finish_task_switch+0xaa/0x2e0
[569830.918558]  process_one_work+0x1a7/0x360
[569830.918559]  ? create_worker+0x1a0/0x1a0
[569830.918559]  worker_thread+0x30/0x390
[569830.918559]  ? create_worker+0x1a0/0x1a0
[569830.918559]  kthread+0x116/0x130
[569830.918560]  ? kthread_flush_work_fn+0x10/0x10
[569830.918560]  ret_from_fork+0x1f/0x40

 


Console: n0501-bmc

There is an LNet error in the log then 6 seconds later a Watchdog NMI.  This is followed 1 second later with a fatal NMI.

 

[638125.931425] LNetError: 2377:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[638141.798616] NMI watchdog: Watchdog detected hard LOCKUP on cpu 9

Observations:

  • Different traceback than the first node:
  • no connection manager calls
  • not on a worker thread

First Traceback:

 

[638141.798641] Call Trace:
[638141.798641]  queued_read_lock_slowpath+0x6e/0x80
[638141.798642]  _raw_read_lock_irqsave+0x31/0x40
[638141.798642]  kiblnd_launch_tx+0x41/0xa70 [ko2iblnd]
[638141.798642]  ? lnet_copy_iov2iov+0x158/0x250 [lnet]
[638141.798643]  kiblnd_send+0x1f9/0x9a0 [ko2iblnd]
[638141.798643]  lnet_ni_send+0x42/0xd0 [lnet]
[638141.798643]  lnet_send+0x7e/0x1b0 [lnet]
[638141.798644]  LNetPut+0x2b7/0xaf0 [lnet]
[638141.798644]  ? cfs_percpt_unlock+0x15/0xb0 [libcfs]
[638141.798644]  ptl_send_buf+0x20b/0x560 [ptlrpc]
[638141.798645]  ptl_send_rpc+0x462/0xda0 [ptlrpc]
[638141.798645]  ? __switch_to_asm+0x41/0x70
[638141.798645]  ptlrpc_send_new_req+0x596/0xa70 [ptlrpc]
[638141.798646]  ? __switch_to_asm+0x35/0x70
[638141.798646]  ptlrpc_check_set.part.30+0x725/0x1f20 [ptlrpc]
[638141.798646]  ? __switch_to+0x167/0x470
[638141.798647]  ? finish_task_switch+0xaa/0x2e0
[638141.798647]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[638141.798647]  ptlrpcd+0x374/0x4b0 [ptlrpc]
[638141.798648]  ? wake_up_q+0x80/0x80
[638141.798648]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[638141.798648]  kthread+0x116/0x130
[638141.798649]  ? kthread_flush_work_fn+0x10/0x10
[638141.798649]  ret_from_fork+0x1f/0x40

 


Console: n0515-bmc

There is an LNet error in the log then 6 seconds later a Watchdog NMI.  This is followed 1 second later with a fatal NMI.

 

[39917.837921] LNetError: 2365:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[39930.022676] NMI watchdog: Watchdog detected hard LOCKUP on cpu 66

Observations:

  • This traceback matches the second node.

First traceback:

 

[39930.022703] Call Trace:
[39930.022703]  queued_read_lock_slowpath+0x6e/0x80
[39930.022704]  _raw_read_lock_irqsave+0x31/0x40
[39930.022704]  kiblnd_launch_tx+0x41/0xa70 [ko2iblnd]
[39930.022704]  ? lnet_copy_iov2iov+0x158/0x250 [lnet]
[39930.022705]  kiblnd_send+0x1f9/0x9a0 [ko2iblnd]
[39930.022705]  lnet_ni_send+0x42/0xd0 [lnet]
[39930.022705]  lnet_send+0x7e/0x1b0 [lnet]
[39930.022706]  LNetPut+0x2b7/0xaf0 [lnet]
[39930.022706]  ? cfs_percpt_unlock+0x15/0xb0 [libcfs]
[39930.022706]  ptl_send_buf+0x20b/0x560 [ptlrpc]
[39930.022707]  ptl_send_rpc+0x462/0xda0 [ptlrpc]
[39930.022707]  ? __switch_to_asm+0x41/0x70
[39930.022707]  ptlrpc_send_new_req+0x596/0xa70 [ptlrpc]
[39930.022708]  ? __switch_to_asm+0x35/0x70
[39930.022708]  ptlrpc_check_set.part.30+0x725/0x1f20 [ptlrpc]
[39930.022708]  ? __switch_to+0x167/0x470
[39930.022708]  ? finish_task_switch+0xaa/0x2e0
[39930.022709]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[39930.022709]  ptlrpcd+0x374/0x4b0 [ptlrpc]
[39930.022709]  ? wake_up_q+0x80/0x80
[39930.022710]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[39930.022710]  kthread+0x116/0x130
[39930.022710]  ? kthread_flush_work_fn+0x10/0x10
[39930.022711]  ret_from_fork+0x1f/0x40

 


Console: n0601-bmc

There is an LNet error in the log then 24 seconds later a Watchdog NMI.  This is followed 1 second later with a fatal NMI.

 

[576238.943801] LNetError: 2377:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[576262.045550] NMI watchdog: Watchdog detected hard LOCKUP on cpu 12Modules linked in: mgc(OE) lustre(OE)

Observations:

  • Same traceback as nodes #2 and #3 (this is #4).
  • So far, only dump 1 is different (on a worker thread, different call stack).

First traceback:

 

[576262.045570] Call Trace:
[576262.045570]  _raw_read_lock_irqsave+0x31/0x40
[576262.045570]  kiblnd_launch_tx+0x41/0xa70 [ko2iblnd]
[576262.045571]  ? cfs_percpt_lock+0x16/0xf0 [libcfs]
[576262.045571]  kiblnd_send+0x1f9/0x9a0 [ko2iblnd]
[576262.045571]  lnet_ni_send+0x42/0xd0 [lnet]
[576262.045571]  lnet_send+0x7e/0x1b0 [lnet]
[576262.045572]  lnet_finalize+0x41d/0xf90 [lnet]
[576262.045572]  kiblnd_tx_done+0x119/0x2f0 [ko2iblnd]
[576262.045572]  ? kiblnd_handle_completion+0xc6/0x1a0 [ko2iblnd]
[576262.045573]  kiblnd_handle_rx+0x38a/0x680 [ko2iblnd]
[576262.045573]  kiblnd_scheduler+0x1036/0x10b0 [ko2iblnd]
[576262.045573]  ? wake_up_q+0x80/0x80
[576262.045573]  ? __switch_to_asm+0x41/0x70
[576262.045574]  ? kiblnd_cq_event+0x80/0x80 [ko2iblnd]
[576262.045574]  kthread+0x116/0x130
[576262.045574]  ? kthread_flush_work_fn+0x10/0x10
[576262.045575]  ret_from_fork+0x1f/0x40

 

 


Console: n0922-bmc

There is an LNet error in the log then 16 seconds later a Watchdog NMI.  This is followed 1 second later with a fatal NMI.

 

[52223.817213] LNetError: 2353:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
[52239.190601] NMI watchdog: Watchdog detected hard LOCKUP on cpu 71

Observations:

  • Same traceback as nodes #2, #3, and #4 (this is #5).
  • So far, only dump 1 is different (on a worker thread, different call stack).

First call trace:

 

[52239.190627] Call Trace:
[52239.190627]  queued_read_lock_slowpath+0x6e/0x80
[52239.190628]  _raw_read_lock_irqsave+0x31/0x40
[52239.190628]  kiblnd_launch_tx+0x41/0xa70 [ko2iblnd]
[52239.190628]  ? lnet_copy_iov2iov+0x158/0x250 [lnet]
[52239.190629]  kiblnd_send+0x1f9/0x9a0 [ko2iblnd]
[52239.190629]  lnet_ni_send+0x42/0xd0 [lnet]
[52239.190629]  lnet_send+0x7e/0x1b0 [lnet]
[52239.190629]  LNetPut+0x2b7/0xaf0 [lnet]
[52239.190630]  ? cfs_percpt_unlock+0x15/0xb0 [libcfs]
[52239.190630]  ptl_send_buf+0x20b/0x560 [ptlrpc]
[52239.190630]  ptl_send_rpc+0x462/0xda0 [ptlrpc]
[52239.190631]  ? __switch_to_asm+0x41/0x70
[52239.190631]  ptlrpc_send_new_req+0x596/0xa70 [ptlrpc]
[52239.190631]  ? __switch_to_asm+0x35/0x70
[52239.190632]  ptlrpc_check_set.part.30+0x725/0x1f20 [ptlrpc]
[52239.190632]  ? __switch_to+0x167/0x470
[52239.190633]  ? finish_task_switch+0xaa/0x2e0
[52239.190633]  ptlrpcd_check+0x3d5/0x5b0 [ptlrpc]
[52239.190633]  ptlrpcd+0x374/0x4b0 [ptlrpc]
[52239.190634]  ? wake_up_q+0x80/0x80
[52239.190634]  ? ptlrpcd_check+0x5b0/0x5b0 [ptlrpc]
[52239.190634]  kthread+0x116/0x130
[52239.190635]  ? kthread_flush_work_fn+0x10/0x10
[52239.190635]  ret_from_fork+0x1f/0x40

 


Console: n1618-bmc

No NMI or tracebacks.  I see:

  • Normal boot output.
  • One lustre error.  Not like the final errors seen in the other log files examined.
  • Repeated messages for CPU throttling due to temperature above threshold.
  • At the end, a controlled shutdown.

 

Comment by Gerrit Updater [ 01/Feb/23 ]

"Cyril Bordage <cbordage@whamcloud.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/49864
Subject: LU-16349 o2iblnd: Fix key mismatch issue
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 15eb62f2ef0c24a91fee049ef5d324cc40074a80

Comment by Cyril Bordage [ 01/Feb/23 ]

Which version of Lustre it is? I thought it was 2.12.9 with the 4 given patches.

In log, we have "o2iblnd_cb.c:3442". But from the code, I should have "o2iblnd_cb.c:3443".

Comment by Oliver Mangold [ 02/Feb/23 ]

It was 2.12.9 with the patches from Dean. There were quite some changes (debug output) before line 3442, so I thing one-off is plausible.

Comment by Cyril Bordage [ 02/Feb/23 ]

I applied all Dean's patches, so we should have exactly the same line numbers. I wanted to be sure, we are working on the same code base. In my code, "CWARN" call is from line 3440 to 3443. What lines do you have? Could you provide the patched files for me to compare?

In the meantime, I will keep trying to find a path that shows a lock problem.

Comment by Oliver Mangold [ 02/Feb/23 ]

I just attached to the ticket the patches to be applied on top of 2.12.9.

Comment by Cyril Bordage [ 02/Feb/23 ]

Thank you, Oliver. I did not know Serguei's patch was also deployed.

Here is my analysis of the timeout errors.
The error message from o2iblnd_cb.c:3442 (1), should be followed by another error message in o2iblnd_cb.c:3517 (2). But there is nothing before watchdog message...
Between the two messages, there is only function returns and a test of a variable. It seems the machine is blocked somehow.
When the code runs normally, the mutex is unlocked right away after message 2.

In n0428-31-01-2023-journal.log, we can see the two messages:

Jan 30 16:08:36 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3442:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z W! [inputs.exec] Collection took longer than expected; not complete after interval of 10s
Jan 30 16:08:54 n0428 telegraf[2037]: 2023-01-30T15:08:54Z E! [agent] Error killing process: os: process already finished
Jan 30 16:08:55 n0428 kernel: LNetError: 2371:0:(o2iblnd_cb.c:3517:kiblnd_check_conns()) Timed out RDMA with 10.4.104.103@o2ib1 (19): c: 27, oc: 0, rc: 32

But they are separated by 19 seconds, with some error in the middle saying that collection for telegraf was too long. That seems to confirm what we saw in the previous error.

 

The flow of debug messages could be very high (more than 1000/s). We will see with the new patch if the problem disappears.

 

Comment by Gerrit Updater [ 03/Feb/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/49714/
Subject: LU-16349 o2iblnd: Fix key mismatch issue
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 0c93919f1375ce16d42ea13755ca6ffcc66b9969

Comment by Cyril Bordage [ 06/Feb/23 ]

Hello Oliver,

Was the patch deployed?

Thank you.

 

Comment by Benedikt Schaefer [ 07/Feb/23 ]

Hi,

 

pachted client is actual in rollout on cluster.

~250 nodes are running with new lustre client now.

 

best regards

Benedikt

Comment by Benedikt Schaefer [ 14/Feb/23 ]

actual state:

~600 clients with patched client

we still see on some clients lustre errors, but nodes were recovered and not stuck

Comment by Peter Jones [ 14/Feb/23 ]

Fix landed for 2.16 so closing ticket. Should port fix to b2_15 for inclusion in a future 2.15.x release.

Comment by Gerrit Updater [ 06/Mar/23 ]

"Xing Huang <hxing@ddn.com>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50214
Subject: LU-16349 o2iblnd: Fix key mismatch issue
Project: fs/lustre-release
Branch: b2_15
Current Patch Set: 1
Commit: 8811fa2cd9db66d4acdabf17e2cc93ceef5f6752

Comment by Gerrit Updater [ 06/Apr/23 ]

"Gian-Carlo DeFazio <defazio1@llnl.gov>" uploaded a new patch: https://review.whamcloud.com/c/fs/lustre-release/+/50564
Subject: LU-16349 o2iblnd: Fix key mismatch issue
Project: fs/lustre-release
Branch: b2_14
Current Patch Set: 1
Commit: 329448005cefab9fe13da33c50a51f664a8cce66

Comment by Gerrit Updater [ 11/Apr/23 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/c/fs/lustre-release/+/50214/
Subject: LU-16349 o2iblnd: Fix key mismatch issue
Project: fs/lustre-release
Branch: b2_15
Current Patch Set:
Commit: da98e1a6f31462ab76ce7c3a48c21eb4c9eda151

Generated at Sat Feb 10 03:26:15 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.