[LU-15541] Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked() Created: 10/Feb/22  Updated: 02/Aug/23

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.7
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Unresolved Votes: 0
Labels: llnl, topllnl
Environment:

3.10.0-1160.45.1.1chaos.ch6.x86_64
lustre-2.12.7_2.llnl
3.10.0-1160.53.1.1chaos.ch6.x86_64
lustre-2.12.8_6.llnl
RHEL7.9
zfs-0.7.11-9.8llnl


Attachments: Text File vmcore-dmesg.copper1.txt     Text File vmcore-dmesg.copper2.txt    
Issue Links:
Related
is related to LU-14668 LNet: do discovery in the background Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. Almost immediately after boot, clients begin reporting soft lockups on the console, with stacks like this:

2022-02-08 09:43:10 [1644342190.528916] 
Call Trace:
 queued_spin_lock_slowpath+0xb/0xf
 _raw_spin_lock+0x30/0x40
 cfs_percpt_lock+0xc1/0x110 [libcfs]
 lnet_discover_peer_locked+0xa0/0x450 [lnet]
 ? wake_up_atomic_t+0x30/0x30
 LNetPrimaryNID+0xd5/0x220 [lnet]
 ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
 target_handle_connect+0x12f1/0x2b90 [ptlrpc]
 ? enqueue_task_fair+0x208/0x6c0
 ? check_preempt_curr+0x80/0xa0
 ? ttwu_do_wakeup+0x19/0x100
 tgt_request_handle+0x4fa/0x1570 [ptlrpc]
 ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
 ? __getnstimeofday64+0x3f/0xd0
 ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
 ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc]
 ? __wake_up_common_lock+0x91/0xc0
 ? sched_feat_set+0xf0/0xf0
 ptlrpc_main+0xc49/0x1c50 [ptlrpc]
 ? __switch_to+0xce/0x5a0
 ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
 kthread+0xd1/0xe0
 ? insert_kthread_work+0x40/0x40
 ret_from_fork_nospec_begin+0x21/0x21
 ? insert_kthread_work+0x40/0x40

Some servers never exit recovery, and others do but seem to be unable to service requests.

Seen during the same lustre server update where we saw LU-15539 but appears to be a separate issue.

Patch stacks are:
https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl
https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl



 Comments   
Comment by Olaf Faaland [ 10/Feb/22 ]

behlendo found this patch, which seems to fit the stack reported with the soft lockups
https://review.whamcloud.com/#/c/43537/

Comment by Olaf Faaland [ 10/Feb/22 ]

I've attached I attached vmcore-dmesg.txt files from two servers where this occurred, while they were running 2.12.8_6.llnl and the later of the two kernels I mentioned.

We reverted the servers back to the earlier kernel and lustre 2.12.7_2.llnl and saw the same symptoms.

Comment by Serguei Smirnov [ 10/Feb/22 ]

Hi Olaf,

With 2.12.7_2.llnl, does the client eventually recover, or is it stuck forever?

If possible, I'd recommend trying lustre-2.12.8_6.llnl with reverted change for LU-15234 "lnet: Race on discovery queue". It never landed on lustre b2_12, perhaps hornc can comment why it was abandoned.

Thanks,

Serguei.

Comment by Chris Horn [ 10/Feb/22 ]

> Chris Horn can comment why it was abandoned.

I only abandoned it because I found it was not actually the root cause of the issue in that ticket. I can revive the patch for b2_12 if you would like it landed there.

Comment by Olaf Faaland [ 10/Feb/22 ]

Hi Serguei,

> With 2.12.7_2.llnl, does the client eventually recover, or is it stuck forever?

I think after "mount -t lustre <lustre-target> <dir>" the longest I waited was about 20 minutes. At that point most of the 52 targets (16 MDTs, 36 OSTs) had exited recovery, but a couple had not. Also at that point, nodes started crashing.

> If possible, I'd recommend trying lustre-2.12.8_6.llnl with reverted change for LU-15234 "lnet: Race on discovery queue".

I tried that, and saw the same soft lockups.

thanks,
Olaf

Comment by Olaf Faaland [ 10/Feb/22 ]

I have crash dumps as well, if that helps.

Comment by Serguei Smirnov [ 10/Feb/22 ]

Olaf,

Are clients being upgraded at the same time and to the same version as the server?

What was the last version which doesn't cause lock ups?

I'm trying to identify the change which could have introduced this problem. That would make it easier to see if the fix is already available.

Thanks,

Serguei. 

Comment by Olaf Faaland [ 10/Feb/22 ]

Hi Serguei,

> Are clients being upgraded at the same time and to the same version as the server?

No. We have 10-15 compute clusters and 3 server clusters; each server cluster hosts one file system. We typically update 3-5 clusters per day.

The "old version" is 2.12.7_2.llnl. When this happened Tuesday, we were on the first day of our update cycle. 2 compute clusters were updated to 2.12.8_6.llnl at the same time that 1 server cluster (copper, the where we saw the issue) was updated to 2.12.8_6.llnl. The remainder of the clusters were still at 2.12.7_2.llnl

During the update attempt on Tuesday, we initially brought copper up with 2.12.8_6.llnl and saw these soft lockups and also LU-15539. We then brought copper down and brought it back up with its old file system image, which had 2.12.7_2.llnl. We saw these soft lockups again with 2.12.7_2.llnl.

> What was the last version which doesn't cause lock ups?

Unfortunately, I don't know. It seems to me that the issue exists in both 2.12.7_2.llnl and 2.12.8_6.llnl. I'll look back and see if I have console logs indicating this issue prior to this week.

Comment by Chris Horn [ 10/Feb/22 ]

behlendo found this patch, which seems to fit the stack reported with the soft lockups
https://review.whamcloud.com/#/c/43537/

FWIW, this patch is really just an optimization for a particular case where clients with discovery disabled are mounting a filesystem when one or more servers are down. If all of your servers are up/healthy when the clients mount, or if your clients have discovery enabled, then this patch would not really have an impact,

Comment by Serguei Smirnov [ 10/Feb/22 ]

One of the following patches may help resolve the issue:

https://review.whamcloud.com/35444

https://review.whamcloud.com/42109

https://review.whamcloud.com/42111

The first one (LU-10931) is probably more relevant because the other two deal with the situation when remote peer is not ready to connect.

 

Comment by Olaf Faaland [ 14/Feb/22 ]

A little more detail.

At the time this happened, the file system was mounted by about 5,400 clients (based on count of exports reported by OSTs) and was running Lustre 2.12.7_2.llnl.

The process we followed is one we always use - we shut down the file system without umounting the clients (so I/O on clients would hang), then rebooted the server nodes. After all the server nodes came up into the new image which had Lustre 2.12.8_6.llnl, we mounted the Lustre targets within the space of 30 seconds or a minute (first MGS, then all MDTS in parallel, then all OSTs in parallel). The recovery_status procfile on the servers indicated they all entered recovery almost immediately. So there were a large number clients all reconnecting at the same time.

In some cases we saw the soft lockup console messages before we saw "Denying connection for new client ... X recovered, Y in progress, and Z evicted" messages, and in other cases we saw the soft lockup messages appear later.
After trying a different lustre versions and reverting a couple of patches, we stopped for the day.

The next day, we umounted (with --force) all but 4 client nodes. We then brought rebooted the Lustre server nodes and mounted the targets. The 4 clients reconnected, the servers evicted the other 5,400-ish clients, and the servers exited recovery. We did not see the soft lockups. We then re-mounted all the clients, and did not see the soft lockups.

So based on that experiment, it seems as if starting the servers with all the clients reconnecting at the same time triggered this issue. But it is our normal process to leave the clients mounted during file system updates like this, and we have not seen soft lockups with this stack before. So perhaps something else changed in the system between Tuesday and Wednesday, but if so it was fairly subtle (there were no clusters added or removed, no routes changed, no LNet or Lustre configuration changes in general).

Comment by Olaf Faaland [ 14/Feb/22 ]

I checked past logs. Our servers have not reported soft lockups with stacks including LNetPrimaryNID() and lnet_discover_peer_locked() before the 2022 Feb 8th instance.

The only similar soft lockup stack was on a client node in a computer cluster (catalyst220) running Lustre 2.12.7_2.llnl, in September 2021:

Call Trace:
 [<ffffffff98ba9ae6>] queued_spin_lock_slowpath+0xb/0xf
 [<ffffffff98bb8b00>] _raw_spin_lock+0x30/0x40
 [<ffffffffc0df7b51>] cfs_percpt_lock+0xc1/0x110 [libcfs]
 [<ffffffffc10637a0>] lnet_discover_peer_locked+0xa0/0x450 [lnet]
 [<ffffffff984cc540>] ? wake_up_atomic_t+0x30/0x30
 [<ffffffffc1063c25>] LNetPrimaryNID+0xd5/0x220 [lnet]
 [<ffffffffc15cf57e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
 [<ffffffffc15c344c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
 [<ffffffffc1594292>] import_set_conn+0xb2/0x7e0 [ptlrpc]
 [<ffffffffc15949d3>] client_import_add_conn+0x13/0x20 [ptlrpc]
 [<ffffffffc1339e98>] class_add_conn+0x418/0x630 [obdclass]
 [<ffffffffc133bb31>] class_process_config+0x1a81/0x2830 [obdclass]
... 
Comment by Olaf Faaland [ 14/Feb/22 ]

So on the bright side, our servers and clients are up and our clusters are doing work again.  But I don't have a reproducer.  We will try to update this server cluster (copper) to lustre 2.12.8_6.llnl again in a week or two, so we'll see if the issue comes up again.

Comment by Olaf Faaland [ 14/Feb/22 ]

behlendo found this patch, which seems to fit the stack reported with the soft lockups

https://review.whamcloud.com/#/c/43537/

FWIW, this patch is really just an optimization for a particular case where clients with discovery disabled are mounting a filesystem when one or more servers are down. 

Chris, why would this only apply when clients are performing a mount? As long as discovery is off, couldn't lnet_peer_is_uptodate() return false on a server just like it could on a client?
 

Comment by Etienne Aujames [ 16/Feb/22 ]

Hello,

The CEA has seen this kind of symptoms on servers for missing routes (asymmetrical routes) or on clients (at mount time) when a target is failed over another node (with the original node not responding, lnet module unloaded).
At that time the CEA had lnet credit issue (starvation) for the client mount cases. Mounting all the clients at the same time could result to lnet credits starvation.

I have backported the https://review.whamcloud.com/#/c/43537/ because the CEA doesn't use multirail.
And we set drop_asym_route=1 on servers to protect against routes misconfiguration between clients and servers.

Maybe the https://review.whamcloud.com/45898/ ("LU-10931 lnet: handle unlink before send completes") could fix that issue with discovery on.

Comment by Olaf Faaland [ 16/Feb/22 ]

Thanks, Etienne

Comment by Sarah Liu [ 15/Jun/22 ]

similar one on 2.12.9 https://testing.whamcloud.com/test_sets/c858c157-7ecd-4c98-bfe3-1da2ce125f8c

Comment by Etienne Aujames [ 30/Sep/22 ]

Hello,

We observed the same kind of stack trace than Olaf when mounting new clients after losing a DDN controller (4 OSS, targets are mounted on the other controller):

 [<ffffffff98ba9ae6>] queued_spin_lock_slowpath+0xb/0xf
 [<ffffffff98bb8b00>] _raw_spin_lock+0x30/0x40
 [<ffffffffc0df7b51>] cfs_percpt_lock+0xc1/0x110 [libcfs]
 [<ffffffffc10637a0>] lnet_discover_peer_locked+0xa0/0x450 [lnet]
 [<ffffffff984cc540>] ? wake_up_atomic_t+0x30/0x30
 [<ffffffffc1063c25>] LNetPrimaryNID+0xd5/0x220 [lnet]
 [<ffffffffc15cf57e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
 [<ffffffffc15c344c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
 [<ffffffffc1594292>] import_set_conn+0xb2/0x7e0 [ptlrpc]
 [<ffffffffc15949d3>] client_import_add_conn+0x13/0x20 [ptlrpc]
 [<ffffffffc1339e98>] class_add_conn+0x418/0x630 [obdclass]
 [<ffffffffc133bb31>] class_process_config+0x1a81/0x2830 [obdclass]

I have done some testing on master branch with discovery enable, and I was able to reproduce this.
I can only reproduce this with client behind the LNet router or with message drop rules on the OSS (lctl net_drop_add).

The client parses the MGS configuration llog "<fsname>-client" to initialize all the lustre device.
e.g: MGS client osc configuration

#35 (224)marker  24 (flags=0x01, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022-       
#36 (080)add_uuid  nid=10.0.2.5@tcp(0x200000a000205)  0:  1:10.0.2.5@tcp                                
#37 (128)attach    0:lustrefs-OST0000-osc  1:osc  2:lustrefs-clilov_UUID                                
#38 (136)setup     0:lustrefs-OST0000-osc  1:lustrefs-OST0000_UUID  2:10.0.2.5@tcp                      
#39 (080)add_uuid  nid=10.0.2.4@tcp(0x200000a000204)  0:  1:10.0.2.4@tcp                                
#40 (104)add_conn  0:lustrefs-OST0000-osc  1:10.0.2.4@tcp                                               
#41 (128)lov_modify_tgts add 0:lustrefs-clilov  1:lustrefs-OST0000_UUID  2:0  3:1                       
#42 (224)END   marker  24 (flags=0x02, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022- 

The records are parsed sequentially:

  • setup -> client_obd_setup(): initialize the device and the primary connection (10.0.2.5@tcp, client_import_add_conn())
  • add_conn -> client_import_add_conn(): initialize the failover connections (10.0.2.4@tcp, client_import_add_conn())

The issue here is that client_import_add_conn() call LNetPrimaryNID() and do discovery to get the remote node interfaces.
Discovery thread start by pinging the node and take transaction_timeout (+- transaction_timeout/2) for it.
In our case, we lost 4 oss with 2 failover unreachable nodes for each: 50s * 4 * 2 = 400s (max time is 600s).

On client side (2.12.7 LTS) we do not have the https://review.whamcloud.com/#/c/43537/ ("LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled"), so to workaround this issue we decrease the transaction_timeout before mounting the client and then restore it: 2s * 4 * 2 = 16s (max time: 24s).

With the https://review.whamcloud.com/39613 ("LU-10360 mgc: Use IR for client->MDS/OST connections"), I am not sure if we have to do discovery when parsing the configuration. But I have not played a lot with multirail, so someone have to confirm/infirm this.

Comment by Etienne Aujames [ 16/Mar/23 ]

The patches of LU-14668 seem to resolve this issue.

client_import_add_conn() do not hang anymore because LNetPrimaryNID() does not wait the end of node discovery (it does the discovery in background).

Comment by Peter Jones [ 01/Apr/23 ]

hxing could you please port the LU-14668 patches to b2_15?

Comment by Xing Huang [ 03/Apr/23 ]

OK, I'll Port them to b2_15.

Comment by Olaf Faaland [ 23/May/23 ]

> OK, I'll Port them to b2_15.

Is this still being done?

thanks

Comment by Serguei Smirnov [ 23/May/23 ]

Hi Olaf,

Yes, there were some distractions so I started on this only late last week. I'm still porting the patches. There's a chance I'll push the ports by the end of this week.

Thanks,

Serguei.

Comment by Serguei Smirnov [ 31/May/23 ]

Here's the link to the LU-14668 patch series ported to b2_15: https://review.whamcloud.com/51135/

Comment by Olaf Faaland [ 03/Jun/23 ]

Thank you, Serguei. We'll add them to our stack and do some testing. We haven't successfully reproduced the original issue, so we'll only be able to tell you if we have unexpected new symptoms with LNet; but that's a start.

Comment by Peter Jones [ 02/Aug/23 ]

The patch series has merged to b2_15 for 2.15.4

Generated at Sat Feb 10 03:19:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.