[LU-15541] Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: None
Affects Version/s: Lustre 2.12.7
Labels:
- llnl
Environment:
3.10.0-1160.45.1.1chaos.ch6.x86_64
lustre-2.12.7_2.llnl
3.10.0-1160.53.1.1chaos.ch6.x86_64
lustre-2.12.8_6.llnl
RHEL7.9
zfs-0.7.11-9.8llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We upgraded a lustre server cluster from lustre-2.12.7_2.llnl to lustre-2.12.8_6.llnl. Almost immediately after boot, clients begin reporting soft lockups on the console, with stacks like this:

2022-02-08 09:43:10 [1644342190.528916] 
Call Trace:
 queued_spin_lock_slowpath+0xb/0xf
 _raw_spin_lock+0x30/0x40
 cfs_percpt_lock+0xc1/0x110 [libcfs]
 lnet_discover_peer_locked+0xa0/0x450 [lnet]
 ? wake_up_atomic_t+0x30/0x30
 LNetPrimaryNID+0xd5/0x220 [lnet]
 ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
 target_handle_connect+0x12f1/0x2b90 [ptlrpc]
 ? enqueue_task_fair+0x208/0x6c0
 ? check_preempt_curr+0x80/0xa0
 ? ttwu_do_wakeup+0x19/0x100
 tgt_request_handle+0x4fa/0x1570 [ptlrpc]
 ? ptlrpc_nrs_req_get_nolock0+0xd1/0x170 [ptlrpc]
 ? __getnstimeofday64+0x3f/0xd0
 ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
 ? ptlrpc_wait_event+0xb8/0x370 [ptlrpc]
 ? __wake_up_common_lock+0x91/0xc0
 ? sched_feat_set+0xf0/0xf0
 ptlrpc_main+0xc49/0x1c50 [ptlrpc]
 ? __switch_to+0xce/0x5a0
 ? ptlrpc_register_service+0xf80/0xf80 [ptlrpc]
 kthread+0xd1/0xe0
 ? insert_kthread_work+0x40/0x40
 ret_from_fork_nospec_begin+0x21/0x21
 ? insert_kthread_work+0x40/0x40

Some servers never exit recovery, and others do but seem to be unable to service requests.

Seen during the same lustre server update where we saw ~~LU-15539~~ but appears to be a separate issue.

Patch stacks are:
https://github.com/LLNL/lustre/releases/tag/2.12.8_6.llnl
https://github.com/LLNL/lustre/releases/tag/2.12.7_2.llnl

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

vmcore-dmesg.copper1.txt
993 kB
10/Feb/22 2:43 AM
vmcore-dmesg.copper2.txt
569 kB
10/Feb/22 2:43 AM

Issue Links

is related to

LU-14668 LNet: do discovery in the background

Resolved

mentioned in: Page Loading...

Activity

[LU-15541] Soft lockups in LNetPrimaryNID() and lnet_discover_peer_locked()

Olaf Faaland added a comment - 03/Jun/23 12:48 AM

Thank you, Serguei. We'll add them to our stack and do some testing. We haven't successfully reproduced the original issue, so we'll only be able to tell you if we have unexpected new symptoms with LNet; but that's a start.

Olaf Faaland added a comment - 03/Jun/23 12:48 AM Thank you, Serguei. We'll add them to our stack and do some testing. We haven't successfully reproduced the original issue, so we'll only be able to tell you if we have unexpected new symptoms with LNet; but that's a start.

Serguei Smirnov added a comment - 31/May/23 7:14 PM

Here's the link to the ~~LU-14668~~ patch series ported to b2_15: https://review.whamcloud.com/51135/

Serguei Smirnov added a comment - 31/May/23 7:14 PM Here's the link to the LU-14668 patch series ported to b2_15: https://review.whamcloud.com/51135/

Serguei Smirnov added a comment - 23/May/23 10:27 PM

Hi Olaf,

Yes, there were some distractions so I started on this only late last week. I'm still porting the patches. There's a chance I'll push the ports by the end of this week.

Thanks,

Serguei.

Serguei Smirnov added a comment - 23/May/23 10:27 PM Hi Olaf, Yes, there were some distractions so I started on this only late last week. I'm still porting the patches. There's a chance I'll push the ports by the end of this week. Thanks, Serguei.

Olaf Faaland added a comment - 23/May/23 10:11 PM

> OK, I'll Port them to b2_15.

Is this still being done?

thanks

Olaf Faaland added a comment - 23/May/23 10:11 PM > OK, I'll Port them to b2_15. Is this still being done? thanks

Xing Huang added a comment - 03/Apr/23 12:59 AM

OK, I'll Port them to b2_15.

Xing Huang added a comment - 03/Apr/23 12:59 AM OK, I'll Port them to b2_15.

Peter Jones added a comment - 01/Apr/23 3:08 PM

hxing could you please port the ~~LU-14668~~ patches to b2_15?

Peter Jones added a comment - 01/Apr/23 3:08 PM hxing could you please port the LU-14668 patches to b2_15?

Etienne Aujames added a comment - 16/Mar/23 11:09 AM

The patches of ~~LU-14668~~ seem to resolve this issue.

client_import_add_conn() do not hang anymore because LNetPrimaryNID() does not wait the end of node discovery (it does the discovery in background).

Etienne Aujames added a comment - 16/Mar/23 11:09 AM The patches of LU-14668 seem to resolve this issue. client_import_add_conn() do not hang anymore because LNetPrimaryNID() does not wait the end of node discovery (it does the discovery in background).

Etienne Aujames added a comment - 30/Sep/22 8:56 AM - edited

Hello,

We observed the same kind of stack trace than Olaf when mounting new clients after losing a DDN controller (4 OSS, targets are mounted on the other controller):

 [<ffffffff98ba9ae6>] queued_spin_lock_slowpath+0xb/0xf
 [<ffffffff98bb8b00>] _raw_spin_lock+0x30/0x40
 [<ffffffffc0df7b51>] cfs_percpt_lock+0xc1/0x110 [libcfs]
 [<ffffffffc10637a0>] lnet_discover_peer_locked+0xa0/0x450 [lnet]
 [<ffffffff984cc540>] ? wake_up_atomic_t+0x30/0x30
 [<ffffffffc1063c25>] LNetPrimaryNID+0xd5/0x220 [lnet]
 [<ffffffffc15cf57e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc]
 [<ffffffffc15c344c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc]
 [<ffffffffc1594292>] import_set_conn+0xb2/0x7e0 [ptlrpc]
 [<ffffffffc15949d3>] client_import_add_conn+0x13/0x20 [ptlrpc]
 [<ffffffffc1339e98>] class_add_conn+0x418/0x630 [obdclass]
 [<ffffffffc133bb31>] class_process_config+0x1a81/0x2830 [obdclass]

I have done some testing on master branch with discovery enable, and I was able to reproduce this.
I can only reproduce this with client behind the LNet router or with message drop rules on the OSS (lctl net_drop_add).

The client parses the MGS configuration llog "<fsname>-client" to initialize all the lustre device.
e.g: MGS client osc configuration

#35 (224)marker  24 (flags=0x01, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022-       
#36 (080)add_uuid  nid=10.0.2.5@tcp(0x200000a000205)  0:  1:10.0.2.5@tcp                                
#37 (128)attach    0:lustrefs-OST0000-osc  1:osc  2:lustrefs-clilov_UUID                                
#38 (136)setup     0:lustrefs-OST0000-osc  1:lustrefs-OST0000_UUID  2:10.0.2.5@tcp                      
#39 (080)add_uuid  nid=10.0.2.4@tcp(0x200000a000204)  0:  1:10.0.2.4@tcp                                
#40 (104)add_conn  0:lustrefs-OST0000-osc  1:10.0.2.4@tcp                                               
#41 (128)lov_modify_tgts add 0:lustrefs-clilov  1:lustrefs-OST0000_UUID  2:0  3:1                       
#42 (224)END   marker  24 (flags=0x02, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022-

The records are parsed sequentially:

setup -> client_obd_setup(): initialize the device and the primary connection (10.0.2.5@tcp, client_import_add_conn())
add_conn -> client_import_add_conn(): initialize the failover connections (10.0.2.4@tcp, client_import_add_conn())

The issue here is that client_import_add_conn() call LNetPrimaryNID() and do discovery to get the remote node interfaces.
Discovery thread start by pinging the node and take transaction_timeout (+- transaction_timeout/2) for it.
In our case, we lost 4 oss with 2 failover unreachable nodes for each: 50s * 4 * 2 = 400s (max time is 600s).

On client side (2.12.7 LTS) we do not have the https://review.whamcloud.com/#/c/43537/ ("~~LU-14566~~ lnet: Skip discovery in LNetPrimaryNID if DD disabled"), so to workaround this issue we decrease the transaction_timeout before mounting the client and then restore it: 2s * 4 * 2 = 16s (max time: 24s).

With the https://review.whamcloud.com/39613 ("LU-10360 mgc: Use IR for client->MDS/OST connections"), I am not sure if we have to do discovery when parsing the configuration. But I have not played a lot with multirail, so someone have to confirm/infirm this.

Etienne Aujames added a comment - 30/Sep/22 8:56 AM - edited Hello, We observed the same kind of stack trace than Olaf when mounting new clients after losing a DDN controller (4 OSS, targets are mounted on the other controller): [<ffffffff98ba9ae6>] queued_spin_lock_slowpath+0xb/0xf [<ffffffff98bb8b00>] _raw_spin_lock+0x30/0x40 [<ffffffffc0df7b51>] cfs_percpt_lock+0xc1/0x110 [libcfs] [<ffffffffc10637a0>] lnet_discover_peer_locked+0xa0/0x450 [lnet] [<ffffffff984cc540>] ? wake_up_atomic_t+0x30/0x30 [<ffffffffc1063c25>] LNetPrimaryNID+0xd5/0x220 [lnet] [<ffffffffc15cf57e>] ptlrpc_connection_get+0x3e/0x450 [ptlrpc] [<ffffffffc15c344c>] ptlrpc_uuid_to_connection+0xec/0x1a0 [ptlrpc] [<ffffffffc1594292>] import_set_conn+0xb2/0x7e0 [ptlrpc] [<ffffffffc15949d3>] client_import_add_conn+0x13/0x20 [ptlrpc] [<ffffffffc1339e98>] class_add_conn+0x418/0x630 [obdclass] [<ffffffffc133bb31>] class_process_config+0x1a81/0x2830 [obdclass] I have done some testing on master branch with discovery enable, and I was able to reproduce this. I can only reproduce this with client behind the LNet router or with message drop rules on the OSS (lctl net_drop_add). The client parses the MGS configuration llog "<fsname>-client" to initialize all the lustre device. e.g: MGS client osc configuration #35 (224)marker 24 (flags=0x01, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022- #36 (080)add_uuid nid=10.0.2.5@tcp(0x200000a000205) 0: 1:10.0.2.5@tcp #37 (128)attach 0:lustrefs-OST0000-osc 1:osc 2:lustrefs-clilov_UUID #38 (136)setup 0:lustrefs-OST0000-osc 1:lustrefs-OST0000_UUID 2:10.0.2.5@tcp #39 (080)add_uuid nid=10.0.2.4@tcp(0x200000a000204) 0: 1:10.0.2.4@tcp #40 (104)add_conn 0:lustrefs-OST0000-osc 1:10.0.2.4@tcp #41 (128)lov_modify_tgts add 0:lustrefs-clilov 1:lustrefs-OST0000_UUID 2:0 3:1 #42 (224)END marker 24 (flags=0x02, v2.15.51.0) lustrefs-OST0000 'add osc' Fri Sep 30 09:57:22 2022- The records are parsed sequentially: setup -> client_obd_setup(): initialize the device and the primary connection (10.0.2.5@tcp, client_import_add_conn()) add_conn -> client_import_add_conn(): initialize the failover connections (10.0.2.4@tcp, client_import_add_conn()) The issue here is that client_import_add_conn() call LNetPrimaryNID() and do discovery to get the remote node interfaces. Discovery thread start by pinging the node and take transaction_timeout (+- transaction_timeout/2) for it. In our case, we lost 4 oss with 2 failover unreachable nodes for each: 50s * 4 * 2 = 400s (max time is 600s). On client side (2.12.7 LTS) we do not have the https://review.whamcloud.com/#/c/43537/ (" LU-14566 lnet: Skip discovery in LNetPrimaryNID if DD disabled"), so to workaround this issue we decrease the transaction_timeout before mounting the client and then restore it: 2s * 4 * 2 = 16s (max time: 24s). With the https://review.whamcloud.com/39613 (" LU-10360 mgc: Use IR for client->MDS/OST connections"), I am not sure if we have to do discovery when parsing the configuration. But I have not played a lot with multirail, so someone have to confirm/infirm this.

Sarah Liu added a comment - 15/Jun/22 9:51 PM

similar one on 2.12.9 https://testing.whamcloud.com/test_sets/c858c157-7ecd-4c98-bfe3-1da2ce125f8c

Sarah Liu added a comment - 15/Jun/22 9:51 PM similar one on 2.12.9 https://testing.whamcloud.com/test_sets/c858c157-7ecd-4c98-bfe3-1da2ce125f8c

Olaf Faaland added a comment - 16/Feb/22 6:16 PM

Thanks, Etienne

Olaf Faaland added a comment - 16/Feb/22 6:16 PM Thanks, Etienne

Etienne Aujames added a comment - 16/Feb/22 11:44 AM

Hello,

The CEA has seen this kind of symptoms on servers for missing routes (asymmetrical routes) or on clients (at mount time) when a target is failed over another node (with the original node not responding, lnet module unloaded).
At that time the CEA had lnet credit issue (starvation) for the client mount cases. Mounting all the clients at the same time could result to lnet credits starvation.

I have backported the https://review.whamcloud.com/#/c/43537/ because the CEA doesn't use multirail.
And we set drop_asym_route=1 on servers to protect against routes misconfiguration between clients and servers.

Maybe the https://review.whamcloud.com/45898/ ("~~LU-10931~~ lnet: handle unlink before send completes") could fix that issue with discovery on.

Etienne Aujames added a comment - 16/Feb/22 11:44 AM Hello, The CEA has seen this kind of symptoms on servers for missing routes (asymmetrical routes) or on clients (at mount time) when a target is failed over another node (with the original node not responding, lnet module unloaded). At that time the CEA had lnet credit issue (starvation) for the client mount cases. Mounting all the clients at the same time could result to lnet credits starvation. I have backported the https://review.whamcloud.com/#/c/43537/ because the CEA doesn't use multirail. And we set drop_asym_route=1 on servers to protect against routes misconfiguration between clients and servers. Maybe the https://review.whamcloud.com/45898/ (" LU-10931 lnet: handle unlink before send completes") could fix that issue with discovery on.

People

Assignee:: Serguei Smirnov

Reporter:: Olaf Faaland

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 10/Feb/22 2:19 AM

Updated:: 24/Sep/24 9:56 PM

Resolved:: 11/Mar/24 10:02 PM