[LU-15541] Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked() Created: 10/Feb/22 Updated: 02/Aug/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Olaf Faaland | Assignee: | Serguei Smirnov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | llnl, topllnl | ||
| Environment: |
3.10.0-1160.45.1.1chaos.ch6.x86_64 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. Almost immediately after boot, clients begin reporting soft lockups on the console, with stacks like this: 2022-02-08 09:43:10 [1644342190.528916] Call Trace: queued_spin_lock_slowpath+0xb/0xf _raw_spin_lock+0x30/0x40 cfs_percpt_lock+0xc1/0x110 [libcfs] lnet_discover_peer_locked+0xa0/0x450 [lnet] ? wake_up_atomic_t+0x30/0x30 LNetPrimaryNID+0xd5/0x220 [lnet] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] target_handle_connect+0x12f1/0x2b90 [ptlrpc] ? enqueue_task_fair+0x208/0x6c0 ? check_preempt_curr+0x80/0xa0 ? ttwu_do_wakeup+0x19/0x100 tgt_request_handle+0x4fa/0x1570 [ptlrpc] ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc] ? __getnstimeofday64+0x3f/0xd0 ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc] ? __wake_up_common_lock+0x91/0xc0 ? sched_feat_set+0xf0/0xf0 ptlrpc_main+0xc49/0x1c50 [ptlrpc] ? __switch_to+0xce/0x5a0 ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc] kthread+0xd1/0xe0 ? insert_kthread_work+0x40/0x40 ret_from_fork_nospec_begin+0x21/0x21 ? insert_kthread_work+0x40/0x40 Some servers never exit recovery, and others do but seem to be unable to service requests. Seen during the same lustre server update where we saw Patch stacks are: |
| Comments |
| Comment by Olaf Faaland [ 10/Feb/22 ] |
|
behlendo found this patch, which seems to fit the stack reported with the soft lockups |
| Comment by Olaf Faaland [ 10/Feb/22 ] |
|
I've attached I attached vmcore-dmesg.txt files from two servers where this occurred, while they were running 2.12.8_6.llnl and the later of the two kernels I mentioned. We reverted the servers back to the earlier kernel and lustre 2.12.7_2.llnl and saw the same symptoms. |
| Comment by Serguei Smirnov [ 10/Feb/22 ] |
|
Hi Olaf, With 2.12.7_2.llnl, does the client eventually recover, or is it stuck forever? If possible, I'd recommend trying lustre-2.12.8_6.llnl with reverted change for Thanks, Serguei. |
| Comment by Chris Horn [ 10/Feb/22 ] |
|
> Chris Horn can comment why it was abandoned. I only abandoned it because I found it was not actually the root cause of the issue in that ticket. I can revive the patch for b2_12 if you would like it landed there. |
| Comment by Olaf Faaland [ 10/Feb/22 ] |
|
Hi Serguei, > With 2.12.7_2.llnl, does the client eventually recover, or is it stuck forever? I think after "mount -t lustre <lustre-target> <dir>" the longest I waited was about 20 minutes. At that point most of the 52 targets (16 MDTs, 36 OSTs) had exited recovery, but a couple had not. Also at that point, nodes started crashing. > If possible, I'd recommend trying lustre-2.12.8_6.llnl with reverted change for I tried that, and saw the same soft lockups. thanks, |
| Comment by Olaf Faaland [ 10/Feb/22 ] |
|
I have crash dumps as well, if that helps. |
| Comment by Serguei Smirnov [ 10/Feb/22 ] |
|
Olaf, Are clients being upgraded at the same time and to the same version as the server? What was the last version which doesn't cause lock ups? I'm trying to identify the change which could have introduced this problem. That would make it easier to see if the fix is already available. Thanks, Serguei. |
| Comment by Olaf Faaland [ 10/Feb/22 ] |
|
Hi Serguei, > Are clients being upgraded at the same time and to the same version as the server? No. We have 10-15 compute clusters and 3 server clusters; each server cluster hosts one file system. We typically update 3-5 clusters per day. The "old version" is 2.12.7_2.llnl. When this happened Tuesday, we were on the first day of our update cycle. 2 compute clusters were updated to 2.12.8_6.llnl at the same time that 1 server cluster (copper, the where we saw the issue) was updated to 2.12.8_6.llnl. The remainder of the clusters were still at 2.12.7_2.llnl During the update attempt on Tuesday, we initially brought copper up with 2.12.8_6.llnl and saw these soft lockups and also > What was the last version which doesn't cause lock ups? Unfortunately, I don't know. It seems to me that the issue exists in both 2.12.7_2.llnl and 2.12.8_6.llnl. I'll look back and see if I have console logs indicating this issue prior to this week. |
| Comment by Chris Horn [ 10/Feb/22 ] |
FWIW, this patch is really just an optimization for a particular case where clients with discovery disabled are mounting a filesystem when one or more servers are down. If all of your servers are up/healthy when the clients mount, or if your clients have discovery enabled, then this patch would not really have an impact, |
| Comment by Serguei Smirnov [ 10/Feb/22 ] |
|
One of the following patches may help resolve the issue: https://review.whamcloud.com/35444 https://review.whamcloud.com/42109 https://review.whamcloud.com/42111 The first one (
|
| Comment by Olaf Faaland [ 14/Feb/22 ] |
|
A little more detail. At the time this happened, the file system was mounted by about 5,400 clients (based on count of exports reported by OSTs) and was running Lustre 2.12.7_2.llnl. The process we followed is one we always use - we shut down the file system without umounting the clients (so I/O on clients would hang), then rebooted the server nodes. After all the server nodes came up into the new image which had Lustre 2.12.8_6.llnl, we mounted the Lustre targets within the space of 30 seconds or a minute (first MGS, then all MDTS in parallel, then all OSTs in parallel). The recovery_status procfile on the servers indicated they all entered recovery almost immediately. So there were a large number clients all reconnecting at the same time. In some cases we saw the soft lockup console messages before we saw "Denying connection for new client ... X recovered, Y in progress, and Z evicted" messages, and in other cases we saw the soft lockup messages appear later. The next day, we umounted (with --force) all but 4 client nodes. We then brought rebooted the Lustre server nodes and mounted the targets. The 4 clients reconnected, the servers evicted the other 5,400-ish clients, and the servers exited recovery. We did not see the soft lockups. We then re-mounted all the clients, and did not see the soft lockups. So based on that experiment, it seems as if starting the servers with all the clients reconnecting at the same time triggered this issue. But it is our normal process to leave the clients mounted during file system updates like this, and we have not seen soft lockups with this stack before. So perhaps something else changed in the system between Tuesday and Wednesday, but if so it was fairly subtle (there were no clusters added or removed, no routes changed, no LNet or Lustre configuration changes in general). |
| Comment by Olaf Faaland [ 14/Feb/22 ] |
|
I checked past logs. Our servers have not reported soft lockups with stacks including LNetPrimaryNID() and lnet_discover_peer_locked() before the 2022 Feb 8th instance. The only similar soft lockup stack was on a client node in a computer cluster (catalyst220) running Lustre 2.12.7_2.llnl, in September 2021: Call Trace: [<ffffffff98ba9ae6>] queued_spin_lock_slowpath+0xb/0xf [<ffffffff98bb8b00>] _raw_spin_lock+0x30/0x40 [<ffffffffc0df7b51>] cfs_percpt_lock+0xc1/0x110 [libcfs] [<ffffffffc10637a0>] lnet_discover_peer_locked+0xa0/0x450 [lnet] [<ffffffff984cc540>] ? wake_up_atomic_t+0x30/0x30 [<ffffffffc1063c25>] LNetPrimaryNID+0xd5/0x220 [lnet] [<ffffffffc15cf57e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [<ffffffffc15c344c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] [<ffffffffc1594292>] import_set_conn+0xb2/0x7e0 [ptlrpc] [<ffffffffc15949d3>] client_import_add_conn+0x13/0x20 [ptlrpc] [<ffffffffc1339e98>] class_add_conn+0x418/0x630 [obdclass] [<ffffffffc133bb31>] class_process_config+0x1a81/0x2830 [obdclass] ... |
| Comment by Olaf Faaland [ 14/Feb/22 ] |
|
So on the bright side, our servers and clients are up and our clusters are doing work again. But I don't have a reproducer. We will try to update this server cluster (copper) to lustre 2.12.8_6.llnl again in a week or two, so we'll see if the issue comes up again. |
| Comment by Olaf Faaland [ 14/Feb/22 ] |
Chris, why would this only apply when clients are performing a mount? As long as discovery is off, couldn't lnet_peer_is_uptodate() return false on a server just like it could on a client? |
| Comment by Etienne Aujames [ 16/Feb/22 ] |
|
Hello, The CEA has seen this kind of symptoms on servers for missing routes (asymmetrical routes) or on clients (at mount time) when a target is failed over another node (with the original node not responding, lnet module unloaded). I have backported the https://review.whamcloud.com/#/c/43537/ because the CEA doesn't use multirail. Maybe the https://review.whamcloud.com/45898/ (" |
| Comment by Olaf Faaland [ 16/Feb/22 ] |
|
Thanks, Etienne |
| Comment by Sarah Liu [ 15/Jun/22 ] |
|
similar one on 2.12.9 https://testing.whamcloud.com/test_sets/c858c157-7ecd-4c98-bfe3-1da2ce125f8c |
| Comment by Etienne Aujames [ 30/Sep/22 ] |
|
Hello, We observed the same kind of stack trace than Olaf when mounting new clients after losing a DDN controller (4 OSS, targets are mounted on the other controller): [<ffffffff98ba9ae6>] queued_spin_lock_slowpath+0xb/0xf [<ffffffff98bb8b00>] _raw_spin_lock+0x30/0x40 [<ffffffffc0df7b51>] cfs_percpt_lock+0xc1/0x110 [libcfs] [<ffffffffc10637a0>] lnet_discover_peer_locked+0xa0/0x450 [lnet] [<ffffffff984cc540>] ? wake_up_atomic_t+0x30/0x30 [<ffffffffc1063c25>] LNetPrimaryNID+0xd5/0x220 [lnet] [<ffffffffc15cf57e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [<ffffffffc15c344c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] [<ffffffffc1594292>] import_set_conn+0xb2/0x7e0 [ptlrpc] [<ffffffffc15949d3>] client_import_add_conn+0x13/0x20 [ptlrpc] [<ffffffffc1339e98>] class_add_conn+0x418/0x630 [obdclass] [<ffffffffc133bb31>] class_process_config+0x1a81/0x2830 [obdclass] I have done some testing on master branch with discovery enable, and I was able to reproduce this. The client parses the MGS configuration llog "<fsname>-client" to initialize all the lustre device. #35 (224)marker 24 (flags=0x01, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022- #36 (080)add_uuid nid=10.0.2.5@tcp(0x200000a000205) 0: 1:10.0.2.5@tcp #37 (128)attach 0:lustrefs-OST0000-osc 1:osc 2:lustrefs-clilov_UUID #38 (136)setup 0:lustrefs-OST0000-osc 1:lustrefs-OST0000_UUID 2:10.0.2.5@tcp #39 (080)add_uuid nid=10.0.2.4@tcp(0x200000a000204) 0: 1:10.0.2.4@tcp #40 (104)add_conn 0:lustrefs-OST0000-osc 1:10.0.2.4@tcp #41 (128)lov_modify_tgts add 0:lustrefs-clilov 1:lustrefs-OST0000_UUID 2:0 3:1 #42 (224)END marker 24 (flags=0x02, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022- The records are parsed sequentially:
The issue here is that client_import_add_conn() call LNetPrimaryNID() and do discovery to get the remote node interfaces. On client side (2.12.7 LTS) we do not have the https://review.whamcloud.com/#/c/43537/ (" With the https://review.whamcloud.com/39613 ("LU-10360 mgc: Use IR for client->MDS/OST connections"), I am not sure if we have to do discovery when parsing the configuration. But I have not played a lot with multirail, so someone have to confirm/infirm this. |
| Comment by Etienne Aujames [ 16/Mar/23 ] |
|
The patches of client_import_add_conn() do not hang anymore because LNetPrimaryNID() does not wait the end of node discovery (it does the discovery in background). |
| Comment by Peter Jones [ 01/Apr/23 ] |
| Comment by Xing Huang [ 03/Apr/23 ] |
|
OK, I'll Port them to b2_15. |
| Comment by Olaf Faaland [ 23/May/23 ] |
|
> OK, I'll Port them to b2_15. Is this still being done? thanks |
| Comment by Serguei Smirnov [ 23/May/23 ] |
|
Hi Olaf, Yes, there were some distractions so I started on this only late last week. I'm still porting the patches. There's a chance I'll push the ports by the end of this week. Thanks, Serguei. |
| Comment by Serguei Smirnov [ 31/May/23 ] |
|
Here's the link to the |
| Comment by Olaf Faaland [ 03/Jun/23 ] |
|
Thank you, Serguei. We'll add them to our stack and do some testing. We haven't successfully reproduced the original issue, so we'll only be able to tell you if we have unexpected new symptoms with LNet; but that's a start. |
| Comment by Peter Jones [ 02/Aug/23 ] |
|
The patch series has merged to b2_15 for 2.15.4 |