[LU-13892] LNetError kiblnd_check_conns() Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8 Created: 07/Aug/20  Updated: 16/Jul/21  Resolved: 22/Oct/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.5
Fix Version/s: Lustre 2.12.6

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Serguei Smirnov
Resolution: Fixed Votes: 0
Labels: llnl
Environment:

router node
lustre-2.12.5_4.llnl-1.1mofed.ch6.x86_64
3.10.0-1127.18.2.1chaos.ch6.x86_64


Attachments: Text File dk.corona86.1596747314.txt     Text File dk.corona86.1596747437.txt     Text File dk.corona86.1596748295.txt     Text File dmesg.corona86.1596747297.txt     Text File dmesg.corona86.1596748292.txt     Text File dmesg.corona86.1596750724.txt     File pass10.sens100.retry1.check1.timeout100.tgz     File pass11.sens0.retry0.check1.timeout50.tgz     File pass12.dmesg.corona85.1600988107.out     File pass5.check_1.tgz     File pass6.check_0.tgz     File pass9.check_1.health_0.tgz    
Issue Links:
Related
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

lctl and lnetctl commands hang.

console log messages on router:

[Thu Aug  6 13:45:29 2020] LNet: 21326:0:(o2iblnd_cb.c:3397:kiblnd_check_conns()) Timed out tx for 172.19.2.24@o2ib100: 121 seconds
[Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 0 seconds
[Thu Aug  6 13:47:13 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.217@o2ib100 (107): c: 8, oc: 0, rc: 8
[Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3351:kiblnd_check_txs_locked()) Timed out tx: active_txs, 1 seconds
[Thu Aug  6 13:47:14 2020] LNetError: 21326:0:(o2iblnd_cb.c:3426:kiblnd_check_conns()) Timed out RDMA with 172.19.1.196@o2ib100 (108): c: 8, oc: 0, rc: 8 

Watchdog complains about hung kernel tasks:

[Thu Aug  6 13:47:28 2020] INFO: task kworker/43:1:584 blocked for more than 120 seconds.
[Thu Aug  6 13:47:28 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Thu Aug  6 13:47:28 2020] kworker/43:1    D ffff98cc7df2e2a0     0   584      2 0x00000080
[Thu Aug  6 13:47:28 2020] Workqueue: ib_cm cm_work_handler [ib_cm]
[Thu Aug  6 13:47:28 2020] Call Trace:
[Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb4cb9>] schedule_preempt_disabled+0x39/0x90
[Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb27ef>] __mutex_lock_slowpath+0x10f/0x250
[Thu Aug  6 13:47:28 2020]  [<ffffffffa9bb1a82>] mutex_lock+0x32/0x42
[Thu Aug  6 13:47:28 2020]  [<ffffffffc14d4aa1>] lnet_nid2peerni_locked+0x71/0x150 [lnet]
[Thu Aug  6 13:47:28 2020]  [<ffffffffc14c1591>] lnet_parse+0x791/0x1200 [lnet]
[Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad23a3>] kiblnd_handle_rx+0x223/0x6f0 [ko2iblnd]
[Thu Aug  6 13:47:28 2020]  [<ffffffffa94e26ea>] ? try_to_wake_up+0x19a/0x3c0
[Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad28ef>] kiblnd_handle_early_rxs+0x7f/0x120 [ko2iblnd]
[Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad385e>] kiblnd_connreq_done+0x29e/0x6e0 [ko2iblnd]
[Thu Aug  6 13:47:28 2020]  [<ffffffffc0ad6617>] kiblnd_cm_callback+0x11e7/0x23b0 [ko2iblnd]
[Thu Aug  6 13:47:28 2020]  [<ffffffffa94c2eda>] ? __queue_delayed_work+0xaa/0x1a0
[Thu Aug  6 13:47:28 2020]  [<ffffffffc11dffbf>] cma_ib_handler+0x13f/0x2b0 [rdma_cm]
[Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdafab>] cm_process_work+0x2b/0x140 [ib_cm]
[Thu Aug  6 13:47:28 2020]  [<ffffffffc0bdd3f3>] cm_work_handler+0xb53/0x158a [ib_cm]
[Thu Aug  6 13:47:28 2020]  [<ffffffffa94c197e>] ? move_linked_works+0x5e/0x90
[Thu Aug  6 13:47:29 2020]  [<ffffffffa94c20a3>] ? pwq_activate_delayed_work+0x43/0xe0
[Thu Aug  6 13:47:29 2020]  [<ffffffffa94c43cf>] process_one_work+0x18f/0x4a0
[Thu Aug  6 13:47:29 2020]  [<ffffffffa94c51b6>] worker_thread+0x126/0x3e0
[Thu Aug  6 13:47:29 2020]  [<ffffffffa94c5090>] ? rescuer_thread+0x430/0x430
[Thu Aug  6 13:47:29 2020]  [<ffffffffa94cca01>] kthread+0xd1/0xe0
[Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40
[Thu Aug  6 13:47:29 2020]  [<ffffffffa9bc0f64>] ret_from_fork_nospec_begin+0xe/0x21
[Thu Aug  6 13:47:29 2020]  [<ffffffffa94cc930>] ? insert_kthread_work+0x40/0x40

utility commands hang on router with these stacks:

[root@corona86:pass2]# for x in 21845 25481 18055; do ps -fp $x; cat /proc/${x}/stack; echo; done       
UID         PID   PPID  C STIME TTY          TIME CMD
root      21845      1  0 13:49 ?        00:00:00 lctl list_nids
[<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
[<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
[<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
[<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
[<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
[<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
[<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
[<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
[<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
[<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

UID         PID   PPID  C STIME TTY          TIME CMD
root      25481      1  0 14:07 ?        00:00:00 lctl ping 192.168.128.4@o2ib36
[<ffffffffc14a0bf6>] LNetNIInit+0x46/0xc20 [lnet]
[<ffffffffc14c29ae>] lnet_ioctl+0x4e/0x270 [lnet]
[<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
[<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
[<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
[<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
[<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
[<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
[<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
[<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

UID         PID   PPID  C STIME TTY          TIME CMD
root      18055  25762  0 14:28 pts/3    00:00:00 lnetctl lnet configure
[<ffffffffc14c28f6>] lnet_configure+0x16/0x80 [lnet]
[<ffffffffc14c2bb5>] lnet_ioctl+0x255/0x270 [lnet]
[<ffffffffa9bbbcbf>] notifier_call_chain+0x4f/0x70
[<ffffffffa94d34ed>] __blocking_notifier_call_chain+0x4d/0x70
[<ffffffffa94d3526>] blocking_notifier_call_chain+0x16/0x20
[<ffffffffc0a5cae3>] libcfs_ioctl+0x2a3/0x510 [libcfs]
[<ffffffffc0a58887>] libcfs_psdev_ioctl+0x67/0xf0 [libcfs]
[<ffffffffa9675800>] do_vfs_ioctl+0x420/0x6d0
[<ffffffffa9675b51>] SyS_ioctl+0xa1/0xc0
[<ffffffffa9bc1112>] system_call_fastpath+0x25/0x2a
[<ffffffffffffffff>] 0xffffffffffffffff

The router node is a dual-socket AMD EPYC 7401 24-Core Processor. The internal fabric is managed by a Mellanox SM and Lustre was built against MOFED.



 Comments   
Comment by Olaf Faaland [ 07/Aug/20 ]

For my records, my locla ticket is TOSS4861

Comment by Olaf Faaland [ 07/Aug/20 ]

As far as I can tell both the internal fabric and the SAN are fine. The subnet managers aren't reporting issues, and I can ping (/bin/ping) nodes on both the SAN and the internal fabric successfully.

edit: ICMP pings are not working now. Last night I could ping fine, this morning I cannot. The LNet symptoms were present even when pings were succeeding 100% of the time.

Comment by Peter Jones [ 07/Aug/20 ]

Serguei

Can you please advise?

Thanks

Peter

Comment by Olaf Faaland [ 07/Aug/20 ]

Load average on the router is 52, but sysrq "l" shows no processes on-core aside from the one handling the sysrq. I assume that means that some group of processes, perhaps lnet, are repeatedly waking up all at the same time, and then immediately going to sleep again.

Comment by Olaf Faaland [ 10/Aug/20 ]

Serguei,
Please let me know if there's other information you would like me to get from the node.
thanks,
Olaf

Comment by Serguei Smirnov [ 10/Aug/20 ]

Hi Olaf,

Did you just upgrade to version 2.12.5 for your routers and then started seeing this problem? If so, what was the version you used before? 

It looks like "lnetctl lnet configure" is getting stuck. Do you see it complete if you just load the modules and run it?

Thanks,

Serguei.

Comment by Olaf Faaland [ 10/Aug/20 ]

Hi Serguei,

Yes, this symptom appeared after booting the cluster for the first time as 2.12.5. The cluster was at 2.12.4_6.chaos before the upgrade.

At the time I ran lnetctl lnet configure, the modules were already loaded.

The module parameters look like this:

options lnet forwarding="enabled" \
        networks="o2ib100(san0),o2ib36(hsi0)" \
        routes="o2ib600 172.19.2.[22-25]@o2ib100" 

options libcfs libcfs_panic_on_lbug=1
options libcfs libcfs_debug=0x3060580
options ptlrpc at_min=45
options ptlrpc at_max=600
options ksocklnd keepalive_count=100
options ksocklnd keepalive_idle=30
options lnet check_routers_before_use=1

We use systemd to start lnet. The unit file looks like this:

[Unit]
Description=lnet management

Requires=network-online.target
After=network-online.target

Before=pacemaker_remote.service
After=rc-local.service

ConditionPathExists=!/proc/sys/lnet/
ConditionPathExists=!/etc/badnode

[Service]
Type=oneshot
RemainAfterExit=true
ExecStart=/sbin/modprobe lnet
ExecStart=/usr/sbin/lnetctl lnet configure --all
ExecStart=/usr/sbin/lnetctl import /etc/lnet.conf
ExecStart=/usr/sbin/lnetctl set discovery 0
ExecStop=/usr/sbin/lustre_rmmod ptlrpc
ExecStop=/usr/sbin/lnetctl lnet unconfigure
ExecStop=/usr/sbin/lustre_rmmod libcfs ldiskfs

[Install]
WantedBy=multi-user.target
RequiredBy=pacemaker_remote.service

and /etc/lnet.conf is all comments:

[root@corona86:~]# grep -v '^#' /etc/lnet.conf
[root@corona86:~]# 

thanks,
Olaf

Comment by Olaf Faaland [ 10/Aug/20 ]

The issue is intermittent.

The cluster has two router nodes.  Both had the same symptoms initially.  One was rebooted and came up without these errors and without hanging lctl/lnetctl commands.  This node was rebooted and still had the issue.

Comment by Serguei Smirnov [ 10/Aug/20 ]

Could you please try changing "check_routers_before_use" to be 0? This could help to narrow down where the initialization is getting stuck. Also, could you please make sure lnet logging is enabled (lctl set_param debug=+net)?

Thanks,

Serguei.

Comment by Olaf Faaland [ 10/Aug/20 ]

Hi Serguei,

I've set check_routers_before_use=0 and since then haven't seen the issue (4 reboots).  I'm repeating the experiment now.  I'll post the results in about an hour.

thanks,

Olaf

Comment by Serguei Smirnov [ 10/Aug/20 ]

Hi Olaf,

I'm wondering about the state of the routers that you have configured in lnet options. Can they be "lnetctl ping"ed at the time the router under test is being rebooted?

Thanks,

Serguei.

Comment by Olaf Faaland [ 10/Aug/20 ]

Hi Serguei,

lnetctl ping of those routers was successful from corona85, while corona86 was booting and then entered the problem state.

I had about 15 reboots of corona86 in a row without the issue when check_routers_before_use=0, and as soon as I set it to 1 and rebooted, corona86 entered the problem state.

thanks,
Olaf

Comment by Serguei Smirnov [ 11/Aug/20 ]

Hi Olaf,

With "check_routers_before_use" on, the issue appears to be that corona86 stalls initialization because it believes that not all of the routers on its list are "alive", even though they seem to be pingable from another node. When you provide the lnet debug log we should be able to see more. Please capture one with  "check_routers_before_use" turned off, too (for this one, allow ~2 minutes). 

I checked with Amir on this. Could you please also provide the output of "lnetctl global show" for corona86?  If the health feature is on, you can try disabling and see if there's any effect. 

Thanks,

Serguei.

Comment by Olaf Faaland [ 11/Aug/20 ]

Hi Serguei,
I uploaded two files:
pass5.check_1.tgz - check_routers_before_use is set to 1 and we see the issue
pass6.check_0.tgz - check_routers_before_use is set to 0 and we do not see the issue

both have modprobe.conf files, dmesg, and debug logs. The one where check_routers_before_use is 0 has the output of lnetctl export which includes the global settings. That output isn't present for check_routers_before_use=1 because the lnetctl command hangs, as reported originally.

thanks
Olaf

Comment by Olaf Faaland [ 11/Aug/20 ]

Disabling lnet health also seems to work around the issue.  3 reboots without incident, with check_routers_before_use=1 and lnet_health_sensitivity=0

I've uploaded logs in pass9.check_1.health_0.tgz including debug logs, dmesg, modprobe.d files, and lnetctl export output.

Comment by Olaf Faaland [ 12/Aug/20 ]

In case it helps, the relevant lnet topology is:

[corona compute nodes]--o2ib36/coronaIBfabric---[corona85,86]--o2ib100/IB_SAN_B451--[orelic2,3,4,5]--10gigE--[zrelic2,3,4,5]--o2ib600/IB_SAN_B654--[two lustre clusters]

So the gateways specified in the router lnet module parameter are the orelic nodes in that diagram.

# corona86 NID on the IB_SAN_B451 net is  172.19.2.27@o2ib100
[root@orelic2:~]# lnetctl export | grep -A14 "primary.*172.19.2.27@o2ib100"
    - primary nid: 172.19.2.27@o2ib100
      Multi-Rail: False
      peer ni:
        - nid: 172.19.2.27@o2ib100
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -23
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: -74
          send_count: 1486791834
          recv_count: 1548015297
          drop_count: 6805
          refcount: 4
[root@orelic2:~]# rpm -q lustre
lustre-2.10.8_9.chaos-1.ch6.x86_64

 

I wonder if our buffers or credits configuration is causing messages to be dropped on orelic nodes, which then causes ping failures on corona86?
See min_tx_credits on the routes between a pair of orelic and zrelic nodes:

[root@zrelic2:~]# lctl list_nids
172.16.70.62@tcp
172.19.3.62@o2ib600

[root@zrelic2:~]# lnetctl export | grep -A10 'primary.*172.16.66.22@tcp'
    - primary nid: 172.16.66.22@tcp
      Multi-Rail: False
      peer ni:
        - nid: 172.16.66.22@tcp
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -233
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: -4

and

[root@orelic2:~]# lctl list_nids
172.16.66.22@tcp
172.19.2.22@o2ib100

[root@orelic2:~]# lnetctl export | grep -A10 'primary.*172.16.70.62@tcp'
    - primary nid: 172.16.70.62@tcp
      Multi-Rail: False
      peer ni:
        - nid: 172.16.70.62@tcp
          state: up
          max_ni_tx_credits: 8
          available_tx_credits: 8
          min_tx_credits: -397
          tx_q_num_of_buf: 0
          available_rtr_credits: 8
          min_rtr_credits: -5
Comment by Serguei Smirnov [ 12/Aug/20 ]

Hi Olaf, 

It may be a good idea to check the credit configuration, but I don't think it is related to the issue with the start-up we're trying to debug.

Here are a few more experiments that would be nice to try:

1) With check_routers_before_use=1 and lnet_health_sensitivity=100, }}try setting {{retry_count=1 }}and {{transaction_timeout=100

2) This is unlikely but still would be good to check: in addition to linux routing setup described here, accept_local flag should be set to 1. 

Amir and I were trying to track the first attempt to send to the routers, but it looks like this moment is not captured in the logs. Would it be possible to start lnet logging sooner? You should be able to reproduce without rebooting the node, just reloading the modules should work. It may give us more clues as to why the connection times out in one scenario but not the other.

Thanks,

Serguei 

Comment by Olaf Faaland [ 12/Aug/20 ]

Hi Serguei,
OK. I'll try experiments 1 and 2.
The reboots are because it's often slower to wait for lnet to unload than it is to reboot the node.

That's odd that the initial send attempt was not captured. I'm setting
options libcfs libcfs_debug=-1
via modprobe.d config file, so I don't know how I could have missed it. Which "pass" contains the debug log dump you are referring to, so I can look to see if I can understand what happened?

thanks

Comment by Serguei Smirnov [ 12/Aug/20 ]

Hi Olaf,

For example, if I search for "172.19.2.22" in the debug log of pass9.check_1.health_0, I see that the first occurrence is the callback that signals that the connection is established. If I understand correctly, there should be messages related to queuing the tx prior to this, leading to initiating and establishing the connection.

Thanks,

Serguei.

Comment by Olaf Faaland [ 12/Aug/20 ]

Hi Serguei,
I uploaded logs for the test with transaction_timeout=100 and a re-run of the test with sensitivity=0. See logs with names pass10* and pass11*.
thanks,
Olaf

Comment by Serguei Smirnov [ 12/Aug/20 ]

Hi Olaf,

Just to clarify, did the system hang in pass10 or pass11?

Thanks,

Serguei

Comment by Olaf Faaland [ 12/Aug/20 ]

Hi Serguei,
The system did not hang on startup in either pass10 or pass11.
For pass10, I was able to stop lnet and unload the modules.
For pass11, the system hung when stopped lnet, with console log errors like this:

2020-08-12 14:06:38 [79017.582133] LNetError: 80234:0:(lib-move.c:4391:lnet_parse()) 172.19.1.184@o2ib100, src 172.19.1.184@o2ib100: Dropping GET (error -108 looking up sender)
2020-08-12 14:06:38 [79017.595904] LNetError: 80234:0:(lib-move.c:4391:lnet_parse()) Skipped 83936 previous similar messages
2020-08-12 14:13:39 [79438.224874] LNet: 80439:0:(api-ni.c:2010:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 172.19.2.27@o2ib100
2020-08-12 14:16:39 [79617.606824] LNetError: 80204:0:(lib-move.c:4391:lnet_parse()) 172.19.2.22@o2ib100, src 172.19.3.102@o2ib600: Dropping PUT (error -108 looking up sender)
2020-08-12 14:16:42 [79617.620470] LNetError: 80204:0:(lib-move.c:4391:lnet_parse()) Skipped 88073 previous similar messages
2020-08-12 14:26:40 [80217.667311] LNetError: 80224:0:(lib-move.c:4391:lnet_parse()) 172.19.1.229@o2ib100, src 172.19.1.229@o2ib100: Dropping GET (error -108 looking up sender)
2020-08-12 14:26:40 [80217.681198] LNetError: 80224:0:(lib-move.c:4391:lnet_parse()) Skipped 89526 previous similar messages
2020-08-12 14:36:40 [80817.759190] LNetError: 80224:0:(lib-move.c:4391:lnet_parse()) 172.19.1.229@o2ib100, src 172.19.1.229@o2ib100: Dropping GET (error -108 looking up sender)
2020-08-12 14:36:40 [80817.772927] LNetError: 80224:0:(lib-move.c:4391:lnet_parse()) Skipped 83044 previous similar messages

and I ended up rebooting the node.

edit: I pasted in the wrong log entries initially

Comment by Olaf Faaland [ 13/Aug/20 ]

Hi Serguei,

I'm considering continuing my rollout with settings

options lnet check_routers_before_use=1
options lnet lnet_health_sensitivity=0 

Do you have any clearer idea where the issue is yet? Do you have any idea whether the issue is really worked around this way, or just made less obvious?

thanks,
Olaf

Comment by Serguei Smirnov [ 14/Aug/20 ]

Hi Olaf,

My current understanding is that the issue is with the connection taking long time to establish, exceeding the lnd timeout. Something goes wrong then and the tx is retried too soon and that makes things even more wrong. I believe this is why there's no more hanging when the transaction timeout, from which the lnd timeout is derived, is increased this way or another. That said, I'm not sure why this behaviour was not seen with the release you used previously, as I haven't been able to locate the change in 2.12.5 that could affect this. 

So in my opinion, everything that you've tried successfully so far is a workaround: the issue still exists, but is hidden because the long enough lnd timeout is in effect.

Thanks,

Serguei.

Comment by Olaf Faaland [ 14/Aug/20 ]

Hi Serguei, Thanks

Comment by Olaf Faaland [ 26/Aug/20 ]

Hi Serguei,

Have you learned anything further?

thanks

Comment by Serguei Smirnov [ 26/Aug/20 ]

Hi Olaf,

I'm trying to simulate the issue using eth as I currently have no access to IB. I haven't been able to figure it out yet.

Did you have a chance to check on the linux routing setup on your machines? I'm wondering about the value of accept_local flag. This can affect local interface recovery. There's more detail to be found here: http://wiki.lustre.org/LNet_Router_Config_Guide#ARP_flux_issue_for_MR_node and here: https://wiki.whamcloud.com/display/LNet/MR+Cluster+Setup

Please try 

 sysctl -w net.ipv4.conf.all.accept_local=1

if you see that accept_local is 0 on your system.

Thanks,

Serguei.

Comment by Olaf Faaland [ 26/Aug/20 ]

Hi Serguei,

Sorry, I didn't look into that because I misread the situation where that applies. Those routers do have accept_local set to 0. I'll try it.

thanks,
Olaf

Comment by Cameron Harr [ 10/Sep/20 ]

As a data point, setting net.ipv4.conf.all.accept_local=1 doesn't appear to have an effect on lnet shutdown. I still get the following and indefinite hangs:

2020-09-10 10:48:14 [ 3433.237844] LNetError: 20855:0:(lib-move.c:4391:lnet_parse()) 192.168.128.123@o2ib35, src 192.168.128.123@o2ib35: Dropping GET (error -108 looking up sender)
2020-09-10 10:48:14 [ 3433.253593] LNetError: 20855:0:(lib-move.c:4391:lnet_parse()) Skipped 2 previous similar messages
2020-09-10 10:48:15 [ 3433.773088] LNetError: 20861:0:(lib-move.c:4391:lnet_parse()) 172.19.3.136@o2ib600, src 172.19.3.136@o2ib600: Dropping GET (error -108 looking up sender)
2020-09-10 10:48:15 [ 3433.788436] LNetError: 20861:0:(lib-move.c:4391:lnet_parse()) Skipped 272 previous similar messages
2020-09-10 10:48:15 [ 3434.225432] LNet: 34467:0:(api-ni.c:2010:lnet_clear_zombies_nis_locked()) Waiting for zombie LNI 172.19.3.66@o2ib600
... 
Comment by Serguei Smirnov [ 14/Sep/20 ]

Hi Cameron,

When you ran the tests with {{net.ipv4.conf.all.accept_local=1, }}did you see hangs on shutdown only? 

What were check_routers_before_use and lnet_health_sensitivity set to?

Thanks,

Serguei.

 

 

Comment by Cameron Harr [ 14/Sep/20 ]

Serguei,

I can't say for certain, but judging by the settings on the 3 sibling routers, those settings were check_routers...=1 and lnet_health_sensitivity=0, and I'm fairly sure I didn't change those.

Comment by Olaf Faaland [ 24/Sep/20 ]

Hi Serguei,

Reproduced even with the recommended arp settings.  Node was booted, and as you see lnetctl was hung.  I forgot to set debug=-1, so I can do that and reproduce if it's helpful. I did capture SysRq w, if you want that let me know.

[root@corona81:~]# pdsh -w e85 systemctl status lnet
e85: * lnet.service - lnet management
e85:    Loaded: loaded (/etc/systemd/system/lnet.service; enabled; vendor preset: disabled)
e85:    Active: activating (start) since Thu 2020-09-24 15:57:18 PDT; 3min 14s ago
e85:   Process: 21710 ExecStart=/bin/grep --with-filename . /sys/module/lnet/parameters/lnet_health_sensitivity (code=exited, status=0/SUCCESS)
e85:   Process: 21697 ExecStart=/sbin/modprobe lnet (code=exited, status=0/SUCCESS)
e85:   Process: 21684 ExecStart=/tftpboot/dumps/set-arp-lnet set (code=exited, status=0/SUCCESS)
e85:   Process: 21676 ExecStart=/bin/bash -c echo lnet for Lustre: Build Version: $(rpm -q lustre) | /usr/bin/logger -t lnet.service -p local7.info (code=exited, status=0/SUCCESS)
e85:  Main PID: 21712 (lnetctl)
e85:    CGroup: /system.slice/lnet.service
e85:            `-21712 /usr/sbin/lnetctl lnet configure --all
e85: 
e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.arp_filter = 0
e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.arp_announce = 2
e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.san0.rp_filter = 0
e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_ignore = 1
e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_filter = 0
e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.arp_announce = 2
e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.hsi0.rp_filter = 0
e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.all.rp_filter = 0
e85: Sep 24 15:57:18 corona85 set-arp-lnet[21684]: net.ipv4.conf.all.accept_local = 1
e85: Sep 24 15:57:18 corona85 grep[21710]: /sys/module/lnet/parameters/lnet_health_sensitivity:100

What next? Another pass with debug=-1? Another pass with debug from both this router node and the peers described in its routes? A debug patch?

thanks

Comment by Serguei Smirnov [ 24/Sep/20 ]

Hi Olaf,

Good to know this is not because of linux routing settings. Sysrq w output would be helpful. Regarding next steps, if you'd be ok with a debug patch, that would be great, as I still haven't seen this happen anywhere else.

Thanks,

Serguei.

Comment by Olaf Faaland [ 25/Sep/20 ]

Hi Serguei,

Yes, a debug patch would be fine.  Please just push it to gerrit so it goes through testing to show it doesn't itself blow things up, and so I can keep a record of where it came from.

I've attached dmesg output including SysRq w content.  It's called pass12.dmesg.corona85.1600988107.out

Thank you.

Comment by Gerrit Updater [ 26/Sep/20 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40059
Subject: LU-13892 lnet: lock-up during router check
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 74097295693793ca3566dccfe96abf3a7e56a659

Comment by Olaf Faaland [ 29/Sep/20 ]

Hi Serguei,

Looking at a crash dump I found that the "lnetctl configure --all" from the lnet.service file was the process that was holding the_lnet.ln_api_mutex, and it was in lnet_wait_known_routerstate().  Based on your patch and what you've said earlier, I think you figured this out a long time ago, I just mention it in case you had any doubt.

Thanks

Comment by Serguei Smirnov [ 29/Sep/20 ]

Hi Olaf,

You're correct about the idea behind this patch. I was waiting for it to get reviewed first, but if you already looked at the change and are comfortable with it, maybe you can give it a try? If the issue can still be reproduced, please capture the lnet debug log.

Thanks,

Serguei.

Comment by Olaf Faaland [ 06/Oct/20 ]

Hi Serguei,

The patch looks to be working properly.

Thanks!

 

Comment by Gerrit Updater [ 07/Oct/20 ]

Serguei Smirnov (ssmirnov@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40172
Subject: LU-13892 lnet: lock-up during router check
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: db1905826983024a1bb4808c5ae93d6d32446b12

Comment by Olaf Faaland [ 08/Oct/20 ]

Serguei,

Why is the "lock-up during router check" patch against b2_12 and not master?

thanks

Comment by Serguei Smirnov [ 08/Oct/20 ]

Hi Olaf,

It didn't look to me that master needed this fix. There were lots of changes on master with the introduction of MRR, I guess one of these changes moved out the "wait_router_start" from under the api_mutex.

Thanks,

Serguei.

Comment by Olaf Faaland [ 08/Oct/20 ]

Hi Serguei,
OK, thanks.

Comment by Gerrit Updater [ 22/Oct/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40172/
Subject: LU-13892 lnet: lock-up during router check
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 877d95b582db3d182d13dac4947c1f43b0e851dc

Comment by Peter Jones [ 22/Oct/20 ]

Landed for 2.12.6

Generated at Sat Feb 10 03:05:05 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.