[LU-13892] LNetError kiblnd_check_conns() Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8 Created: 07/Aug/20 Updated: 16/Jul/21 Resolved: 22/Oct/20 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.5 |
| Fix Version/s: | Lustre 2.12.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Olaf Faaland | Assignee: | Serguei Smirnov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | llnl | ||
| Environment: |
router node |
||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Severity: | 3 | ||||
| Rank (Obsolete): | 9223372036854775807 | ||||
| Description |
|
lctl and lnetctl commands hang. console log messages on router: [Thu Aug 6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds [Thu Aug 6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds [Thu Aug 6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8 [Thu Aug 6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds [Thu Aug 6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 Watchdog complains about hung kernel tasks: [Thu Aug 6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds. [Thu Aug 6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Aug 6 13:47:28 2020] kworker/43:1 D ffff98cc7df2e2a0 0 584 2 0x00000080 [Thu Aug 6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm] [Thu Aug 6 13:47:28 2020] Call Trace: [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90 [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250 [Thu Aug 6 13:47:28 2020] [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42 [Thu Aug 6 13:47:28 2020] [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet] [Thu Aug 6 13:47:28 2020] [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet] [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd] [Thu Aug 6 13:47:28 2020] [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0 [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd] [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd] [Thu Aug 6 13:47:28 2020] [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd] [Thu Aug 6 13:47:28 2020] [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0 [Thu Aug 6 13:47:28 2020] [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm] [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm] [Thu Aug 6 13:47:28 2020] [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm] [Thu Aug 6 13:47:28 2020] [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90 [Thu Aug 6 13:47:29 2020] [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0 [Thu Aug 6 13:47:29 2020] [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0 [Thu Aug 6 13:47:29 2020] [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0 [Thu Aug 6 13:47:29 2020] [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430 [Thu Aug 6 13:47:29 2020] [<ffffffffa94cca01>] kthread+0xd1/0xe0 [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40 [Thu Aug 6 13:47:29 2020] [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21 [Thu Aug 6 13:47:29 2020] [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40 utility commands hang on router with these stacks: [root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done
UID PID PPID C STIME TTY TIME CMD
root 21845 1 0 13:49 ? 00:00:00 lctl list_nids
[<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
[<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
[<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
[<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
[<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
[<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
[<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
[<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
[<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
[<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
UID PID PPID C STIME TTY TIME CMD
root 25481 1 0 14:07 ? 00:00:00 lctl ping 192.168.128.4@o2ib36
[<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
[<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
[<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
[<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
[<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
[<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
[<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
[<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
[<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
[<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
UID PID PPID C STIME TTY TIME CMD
root 18055 25762 0 14:28 pts/3 00:00:00 lnetctl lnet configure
[<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
[<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
[<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
[<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
[<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
[<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
[<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
[<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
[<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
[<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff
The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED. |
| Comments |
| Comment by Olaf Faaland [ 07/Aug/20 ] |
|
For my records, my locla ticket is TOSS4861 |
| Comment by Olaf Faaland [ 07/Aug/20 ] |
|
As far as I can tell both the internal fabric and the SAN are fine. The subnet managers aren't reporting issues, edit: ICMP pings are not working now. Last night I could ping fine, this morning I cannot. The LNet symptoms were present even when pings were succeeding 100% of the time. |
| Comment by Peter Jones [ 07/Aug/20 ] |
|
Serguei Can you please advise? Thanks Peter |
| Comment by Olaf Faaland [ 07/Aug/20 ] |
|
Load average on the router is 52, but sysrq "l" shows no processes on-core aside from the one handling the sysrq. I assume that means that some group of processes, perhaps lnet, are repeatedly waking up all at the same time, and then immediately going to sleep again. |
| Comment by Olaf Faaland [ 10/Aug/20 ] |
|
Serguei, |
| Comment by Serguei Smirnov [ 10/Aug/20 ] |
|
Hi Olaf, Did you just upgrade to version 2.12.5 for your routers and then started seeing this problem? If so, what was the version you used before? It looks like "lnetctl lnet configure" is getting stuck. Do you see it complete if you just load the modules and run it? Thanks, Serguei. |
| Comment by Olaf Faaland [ 10/Aug/20 ] |
|
Hi Serguei, Yes, this symptom appeared after booting the cluster for the first time as 2.12.5. The cluster was at 2.12.4_6.chaos before the upgrade. At the time I ran lnetctl lnet configure, the modules were already loaded. The module parameters look like this: options lnet forwarding="enabled" \ networks="o2ib100(san0),o2ib36(hsi0)" \ routes="o2ib600 172.19.2.[22-25]@o2ib100" options libcfs libcfs_panic_on_lbug=1 options libcfs libcfs_debug=0x3060580 options ptlrpc at_min=45 options ptlrpc at_max=600 options ksocklnd keepalive_count=100 options ksocklnd keepalive_idle=30 options lnet check_routers_before_use=1 We use systemd to start lnet. The unit file looks like this: [Unit] Description=lnet management Requires=network-online.target After=network-online.target Before=pacemaker_remote.service After=rc-local.service ConditionPathExists=!/proc/sys/lnet/ ConditionPathExists=!/etc/badnode [Service] Type=oneshot RemainAfterExit=true ExecStart=/sbin/modprobe lnet ExecStart=/usr/sbin/lnetctl lnet configure --all ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf ExecStart=/usr/sbin/lnetctl set discovery 0 ExecStop=/usr/sbin/lustre_rmmod ptlrpc ExecStop=/usr/sbin/lnetctl lnet unconfigure ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs [Install] WantedBy=multi-user.target RequiredBy=pacemaker_remote.service and /etc/lnet.conf is all comments: [root@corona86:~]# grep -v '^#' /etc/lnet.conf [root@corona86:~]# thanks, |
| Comment by Olaf Faaland [ 10/Aug/20 ] |
|
The issue is intermittent. The cluster has two router nodes. Both had the same symptoms initially. One was rebooted and came up without these errors and without hanging lctl/lnetctl commands. This node was rebooted and still had the issue. |
| Comment by Serguei Smirnov [ 10/Aug/20 ] |
|
Could you please try changing "check_routers_before_use" to be 0? This could help to narrow down where the initialization is getting stuck. Also, could you please make sure lnet logging is enabled (lctl set_param debug=+net)? Thanks, Serguei. |
| Comment by Olaf Faaland [ 10/Aug/20 ] |
|
Hi Serguei, I've set check_routers_before_use=0 and since then haven't seen the issue (4 reboots). I'm repeating the experiment now. I'll post the results in about an hour. thanks, Olaf |
| Comment by Serguei Smirnov [ 10/Aug/20 ] |
|
Hi Olaf, I'm wondering about the state of the routers that you have configured in lnet options. Can they be "lnetctl ping"ed at the time the router under test is being rebooted? Thanks, Serguei. |
| Comment by Olaf Faaland [ 10/Aug/20 ] |
|
Hi Serguei, lnetctl ping of those routers was successful from corona85, while corona86 was booting and then entered the problem state. I had about 15 reboots of corona86 in a row without the issue when check_routers_before_use=0, and as soon as I set it to 1 and rebooted, corona86 entered the problem state. thanks, |
| Comment by Serguei Smirnov [ 11/Aug/20 ] |
|
Hi Olaf, With "check_routers_before_use" on, the issue appears to be that corona86 stalls initialization because it believes that not all of the routers on its list are "alive", even though they seem to be pingable from another node. When you provide the lnet debug log we should be able to see more. Please capture one with "check_routers_before_use" turned off, too (for this one, allow ~2 minutes). I checked with Amir on this. Could you please also provide the output of "lnetctl global show" for corona86? If the health feature is on, you can try disabling and see if there's any effect. Thanks, Serguei. |
| Comment by Olaf Faaland [ 11/Aug/20 ] |
|
Hi Serguei, both have modprobe.conf files, dmesg, and debug logs. The one where check_routers_before_use is 0 has the output of lnetctl export which includes the global settings. That output isn't present for check_routers_before_use=1 because the lnetctl command hangs, as reported originally. thanks |
| Comment by Olaf Faaland [ 11/Aug/20 ] |
|
Disabling lnet health also seems to work around the issue. 3 reboots without incident, with check_routers_before_use=1 and lnet_health_sensitivity=0 I've uploaded logs in pass9.check_1.health_0.tgz including debug logs, dmesg, modprobe.d files, and lnetctl export output. |
| Comment by Olaf Faaland [ 12/Aug/20 ] |
|
In case it helps, the relevant lnet topology is: [corona compute nodes]--o2ib36/coronaIBfabric---[corona85,86]--o2ib100/IB_SAN_B451--[orelic2,3,4,5]--10gigE--[zrelic2,3,4,5]--o2ib600/IB_SAN_B654--[two lustre clusters] So the gateways specified in the router lnet module parameter are the orelic nodes in that diagram. # corona86 NID on the IB_SAN_B451 net is 172.19.2.27@o2ib100
[root@orelic2:~]# lnetctl export | grep -A14 "primary.*172.19.2.27@o2ib100"
- primary nid: 172.19.2.27@o2ib100
Multi-Rail: False
peer ni:
- nid: 172.19.2.27@o2ib100
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -23
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: -74
send_count: 1486791834
recv_count: 1548015297
drop_count: 6805
refcount: 4
[root@orelic2:~]# rpm -q lustre
lustre-2.10.8_9.chaos-1.ch6.x86_64
I wonder if our buffers or credits configuration is causing messages to be dropped on orelic nodes, which then causes ping failures on corona86? [root@zrelic2:~]# lctl list_nids
172.16.70.62@tcp
172.19.3.62@o2ib600
[root@zrelic2:~]# lnetctl export | grep -A10 'primary.*172.16.66.22@tcp'
- primary nid: 172.16.66.22@tcp
Multi-Rail: False
peer ni:
- nid: 172.16.66.22@tcp
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -233
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: -4
and [root@orelic2:~]# lctl list_nids
172.16.66.22@tcp
172.19.2.22@o2ib100
[root@orelic2:~]# lnetctl export | grep -A10 'primary.*172.16.70.62@tcp'
- primary nid: 172.16.70.62@tcp
Multi-Rail: False
peer ni:
- nid: 172.16.70.62@tcp
state: up
max_ni_tx_credits: 8
available_tx_credits: 8
min_tx_credits: -397
tx_q_num_of_buf: 0
available_rtr_credits: 8
min_rtr_credits: -5
|
| Comment by Serguei Smirnov [ 12/Aug/20 ] |
|
Hi Olaf, It may be a good idea to check the credit configuration, but I don't think it is related to the issue with the start-up we're trying to debug. Here are a few more experiments that would be nice to try: 1) With check_routers_before_use=1 and lnet_health_sensitivity=100, }}try setting {{retry_count=1 }}and {{transaction_timeout=100 2) This is unlikely but still would be good to check: in addition to linux routing setup described here, accept_local flag should be set to 1. Amir and I were trying to track the first attempt to send to the routers, but it looks like this moment is not captured in the logs. Would it be possible to start lnet logging sooner? You should be able to reproduce without rebooting the node, just reloading the modules should work. It may give us more clues as to why the connection times out in one scenario but not the other. Thanks, Serguei |
| Comment by Olaf Faaland [ 12/Aug/20 ] |
|
Hi Serguei, That's odd that the initial send attempt was not captured. I'm setting thanks |
| Comment by Serguei Smirnov [ 12/Aug/20 ] |
|
Hi Olaf, For example, if I search for "172.19.2.22" in the debug log of pass9.check_1.health_0, I see that the first occurrence is the callback that signals that the connection is established. If I understand correctly, there should be messages related to queuing the tx prior to this, leading to initiating and establishing the connection. Thanks, Serguei. |
| Comment by Olaf Faaland [ 12/Aug/20 ] |
|
Hi Serguei, |
| Comment by Serguei Smirnov [ 12/Aug/20 ] |
|
Hi Olaf, Just to clarify, did the system hang in pass10 or pass11? Thanks, Serguei |
| Comment by Olaf Faaland [ 12/Aug/20 ] |
|
Hi Serguei, 2020-08-12 14:06:38 [79017.582133] LNetError: 80234:0:(lib-move.c:4391:lnet_parse()) 172.19.1.184@o2ib100, src 172.19.1.184@o2ib100: Dropping GET (error -108 looking up sender) 2020-08-12 14:06:38 [79017.595904] LNetError: 80234:0:(lib-move.c:4391:lnet_parse()) Skipped 83936 previous similar messages 2020-08-12 14:13:39 [79438.224874] LNet: 80439:0:(api-ni.c:2010:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 172.19.2.27@o2ib100 2020-08-12 14:16:39 [79617.606824] LNetError: 80204:0:(lib-move.c:4391:lnet_parse()) 172.19.2.22@o2ib100, src 172.19.3.102@o2ib600: Dropping PUT (error -108 looking up sender) 2020-08-12 14:16:42 [79617.620470] LNetError: 80204:0:(lib-move.c:4391:lnet_parse()) Skipped 88073 previous similar messages 2020-08-12 14:26:40 [80217.667311] LNetError: 80224:0:(lib-move.c:4391:lnet_parse()) 172.19.1.229@o2ib100, src 172.19.1.229@o2ib100: Dropping GET (error -108 looking up sender) 2020-08-12 14:26:40 [80217.681198] LNetError: 80224:0:(lib-move.c:4391:lnet_parse()) Skipped 89526 previous similar messages 2020-08-12 14:36:40 [80817.759190] LNetError: 80224:0:(lib-move.c:4391:lnet_parse()) 172.19.1.229@o2ib100, src 172.19.1.229@o2ib100: Dropping GET (error -108 looking up sender) 2020-08-12 14:36:40 [80817.772927] LNetError: 80224:0:(lib-move.c:4391:lnet_parse()) Skipped 83044 previous similar messages and I ended up rebooting the node. edit: I pasted in the wrong log entries initially |
| Comment by Olaf Faaland [ 13/Aug/20 ] |
|
Hi Serguei, I'm considering continuing my rollout with settings options lnet check_routers_before_use=1 options lnet lnet_health_sensitivity=0 Do you have any clearer idea where the issue is yet? Do you have any idea whether the issue is really worked around this way, or just made less obvious? thanks, |
| Comment by Serguei Smirnov [ 14/Aug/20 ] |
|
Hi Olaf, My current understanding is that the issue is with the connection taking long time to establish, exceeding the lnd timeout. Something goes wrong then and the tx is retried too soon and that makes things even more wrong. I believe this is why there's no more hanging when the transaction timeout, from which the lnd timeout is derived, is increased this way or another. That said, I'm not sure why this behaviour was not seen with the release you used previously, as I haven't been able to locate the change in 2.12.5 that could affect this. So in my opinion, everything that you've tried successfully so far is a workaround: the issue still exists, but is hidden because the long enough lnd timeout is in effect. Thanks, Serguei. |
| Comment by Olaf Faaland [ 14/Aug/20 ] |
|
Hi Serguei, Thanks |
| Comment by Olaf Faaland [ 26/Aug/20 ] |
|
Hi Serguei, Have you learned anything further? thanks |
| Comment by Serguei Smirnov [ 26/Aug/20 ] |
|
Hi Olaf, I'm trying to simulate the issue using eth as I currently have no access to IB. I haven't been able to figure it out yet. Did you have a chance to check on the linux routing setup on your machines? I'm wondering about the value of accept_local flag. This can affect local interface recovery. There's more detail to be found here: http://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node and here: https://wiki.whamcloud.com/display/LNet/MR+Cluster+Setup Please try sysctl -w net.ipv4.conf.all.accept_local=1 if you see that accept_local is 0 on your system. Thanks, Serguei. |
| Comment by Olaf Faaland [ 26/Aug/20 ] |
|
Hi Serguei, Sorry, I didn't look into that because I misread the situation where that applies. Those routers do have accept_local set to 0. I'll try it. thanks, |
| Comment by Cameron Harr [ 10/Sep/20 ] |
|
As a data point, setting net.ipv4.conf.all.accept_local=1 doesn't appear to have an effect on lnet shutdown. I still get the following and indefinite hangs: 2020-09-10 10:48:14 [ 3433.237844] LNetError: 20855:0:(lib-move.c:4391:lnet_parse()) 192.168.128.123@o2ib35, src 192.168.128.123@o2ib35: Dropping GET (error -108 looking up sender) 2020-09-10 10:48:14 [ 3433.253593] LNetError: 20855:0:(lib-move.c:4391:lnet_parse()) Skipped 2 previous similar messages 2020-09-10 10:48:15 [ 3433.773088] LNetError: 20861:0:(lib-move.c:4391:lnet_parse()) 172.19.3.136@o2ib600, src 172.19.3.136@o2ib600: Dropping GET (error -108 looking up sender) 2020-09-10 10:48:15 [ 3433.788436] LNetError: 20861:0:(lib-move.c:4391:lnet_parse()) Skipped 272 previous similar messages 2020-09-10 10:48:15 [ 3434.225432] LNet: 34467:0:(api-ni.c:2010:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 172.19.3.66@o2ib600 ... |
| Comment by Serguei Smirnov [ 14/Sep/20 ] |
|
Hi Cameron, When you ran the tests with {{net.ipv4.conf.all.accept_local=1, }}did you see hangs on shutdown only? What were check_routers_before_use and lnet_health_sensitivity set to? Thanks, Serguei.
|
| Comment by Cameron Harr [ 14/Sep/20 ] |
|
Serguei, I can't say for certain, but judging by the settings on the 3 sibling routers, those settings were check_routers...=1 and lnet_health_sensitivity=0, and I'm fairly sure I didn't change those. |
| Comment by Olaf Faaland [ 24/Sep/20 ] |
|
Hi Serguei, Reproduced even with the recommended arp settings. Node was booted, and as you see lnetctl was hung. I forgot to set debug=-1, so I can do that and reproduce if it's helpful. I did capture SysRq w, if you want that let me know. [root@corona81:~]# pdsh -w e85 systemctl status lnet e85: * lnet.service - lnet management e85: Loaded: loaded (/etc/systemd/system/lnet.service; enabled; vendor preset: disabled) e85: Active: activating (start) since Thu 2020-09-24 15:57:18 PDT; 3min 14s ago e85: Process: 21710 ExecStart=/bin/grep --with-filename . /sys/module/lnet/parameters/lnet_health_sensitivity (code=exited, status=0/SUCCESS) e85: Process: 21697 ExecStart=/sbin/modprobe lnet (code=exited, status=0/SUCCESS) e85: Process: 21684 ExecStart=/tftpboot/dumps/set-arp-lnet set (code=exited, status=0/SUCCESS) e85: Process: 21676 ExecStart=/bin/bash -c echo lnet for Lustre: Build Version: $(rpm -q lustre) | /usr/bin/logger -t lnet.service -p local7.info (code=exited, status=0/SUCCESS) e85: Main PID: 21712 (lnetctl) e85: CGroup: /system.slice/lnet.service e85: `-21712 /usr/sbin/lnetctl lnet configure --all e85: e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.arp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.arp_announce = 2 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.rp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_ignore = 1 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_announce = 2 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.rp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.all.rp_filter = 0 e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.all.accept_local = 1 e85: Sep 24 15:57:18 corona85 grep[21710]: /sys/module/lnet/parameters/lnet_health_sensitivity:100 What next? Another pass with debug=-1? Another pass with debug from both this router node and the peers described in its routes? A debug patch? thanks |
| Comment by Serguei Smirnov [ 24/Sep/20 ] |
|
Hi Olaf, Good to know this is not because of linux routing settings. Sysrq w output would be helpful. Regarding next steps, if you'd be ok with a debug patch, that would be great, as I still haven't seen this happen anywhere else. Thanks, Serguei. |
| Comment by Olaf Faaland [ 25/Sep/20 ] |
|
Hi Serguei, Yes, a debug patch would be fine. Please just push it to gerrit so it goes through testing to show it doesn't itself blow things up, and so I can keep a record of where it came from. I've attached dmesg output including SysRq w content. It's called pass12.dmesg.corona85.1600988107.out Thank you. |
| Comment by Gerrit Updater [ 26/Sep/20 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40059 |
| Comment by Olaf Faaland [ 29/Sep/20 ] |
|
Hi Serguei, Looking at a crash dump I found that the "lnetctl configure --all" from the lnet.service file was the process that was holding the_lnet.ln_api_mutex, and it was in lnet_wait_known_routerstate(). Based on your patch and what you've said earlier, I think you figured this out a long time ago, I just mention it in case you had any doubt. Thanks |
| Comment by Serguei Smirnov [ 29/Sep/20 ] |
|
Hi Olaf, You're correct about the idea behind this patch. I was waiting for it to get reviewed first, but if you already looked at the change and are comfortable with it, maybe you can give it a try? If the issue can still be reproduced, please capture the lnet debug log. Thanks, Serguei. |
| Comment by Olaf Faaland [ 06/Oct/20 ] |
|
Hi Serguei, The patch looks to be working properly. Thanks!
|
| Comment by Gerrit Updater [ 07/Oct/20 ] |
|
Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40172 |
| Comment by Olaf Faaland [ 08/Oct/20 ] |
|
Serguei, Why is the "lock-up during router check" patch against b2_12 and not master? thanks |
| Comment by Serguei Smirnov [ 08/Oct/20 ] |
|
Hi Olaf, It didn't look to me that master needed this fix. There were lots of changes on master with the introduction of MRR, I guess one of these changes moved out the "wait_router_start" from under the api_mutex. Thanks, Serguei. |
| Comment by Olaf Faaland [ 08/Oct/20 ] |
|
Hi Serguei, |
| Comment by Gerrit Updater [ 22/Oct/20 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40172/ |
| Comment by Peter Jones [ 22/Oct/20 ] |
|
Landed for 2.12.6 |