[LU-9659] lnet assert after timeout on reconnect ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) Created: 13/Jun/17 Updated: 13/Jun/17 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | John Salinas (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | LS_RZ, prod | ||
| Environment: |
Centos 7.3 |
||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
Running on Centos 7.3, Lustre 2.9.58, build 15 coral-rc1-combined (based on 0.7.0 RC4). Found Lustre down this morning so I rebooted all of the Lustre servers to get ready for a demo. The mds came up but when I brought the OSS online it crashed. When I brought the mds node back up as soon as I mounted Lustre all the OSS nodes crashed: From ssu2_oss2 [ 1886.319606] LNet: 44156:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 48 seconds [ 1925.319158] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) ) failed: [ 1925.319167] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG [ 1925.319170] Pid: 44156, comm: kiblnd_connd [ 1925.319171] Call Trace: [ 1925.319198] [<ffffffffa09697ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 1925.319208] [<ffffffffa096987c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 1925.319225] [<ffffffffa0a6e206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd] [ 1925.319234] [<ffffffffa0a77214>] kiblnd_connd+0x464/0x900 [ko2iblnd] [ 1925.319245] [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20 [ 1925.319253] [<ffffffffa0a76db0>] ? kiblnd_connd+0x0/0x900 [ko2iblnd] [ 1925.319259] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 1925.319264] [<ffffffff810b0980>] ? kthread+0x0/0xe0 [ 1925.319269] [<ffffffff816970d8>] ret_from_fork+0x58/0x90 [ 1925.319273] [<ffffffff810b0980>] ? kthread+0x0/0xe0 [ 1925.319275] [ 1925.319277] Kernel panic - not syncing: LBUG [ 1925.319323] CPU: 3 PID: 44156 Comm: kiblnd_connd Tainted: P OE ------------ 3.10.0-514.16.1.el7.x86_64 #1 [ 1925.319401] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016 [ 1925.319477] ffffffffa0987dac 00000000665c43d0 ffff885e892f3d38 ffffffff81686ac3 [ 1925.319543] ffff885e892f3db8 ffffffff8167feca ffffffff00000008 ffff885e892f3dc8 [ 1925.319606] ffff885e892f3d68 00000000665c43d0 00000000665c43d0 0000000000000046 [ 1925.319668] Call Trace: [ 1925.319696] [<ffffffff81686ac3>] dump_stack+0x19/0x1b [ 1925.319741] [<ffffffff8167feca>] panic+0xe3/0x1f2 [ 1925.319789] [<ffffffffa0969894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 1925.319844] [<ffffffffa0a6e206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd] [ 1925.319904] [<ffffffffa0a77214>] kiblnd_connd+0x464/0x900 [ko2iblnd] [ 1925.319957] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20 [ 1925.320006] [<ffffffffa0a76db0>] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd] [ 1925.320062] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 1925.320103] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 1925.320154] [<ffffffff816970d8>] ret_from_fork+0x58/0x90 [ 1925.320198] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 Here is a different node ssu1_oss1: [ 726.973784] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [ 815.624944] Lustre: nlsdraid-OST0000: recovery is timed out, evict stale exports [ 815.624967] Lustre: nlsdraid-OST0000: disconnecting 2 stale clients [ 815.856729] Lustre: nlsdraid-OST0000: Recovery over after 5:01, of 9 clients 7 recovered and 2 were evicted. [ 876.965091] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360193/real 1497360193] req@ffff885e1fea5700 x1570095499313792/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360248 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 876.965100] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 11 previous similar messages [ 1151.949457] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360468/real 1497360468] req@ffff885e1fea6300 x1570095499314144/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360523 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 1151.949467] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 21 previous similar messages [ 1676.919121] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360993/real 1497360993] req@ffff885e1fea1b00 x1570095499314816/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497361048 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 1676.919131] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 41 previous similar messages [ 1942.553084] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 9 seconds [ 1955.552348] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 22 seconds [ 1955.552355] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 1 previous similar message [ 1977.551131] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) ) failed: [ 1977.551141] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG [ 1977.551144] Pid: 12875, comm: kiblnd_connd [ 1977.551146] Call Trace: [ 1977.551175] [<ffffffffa09dc7ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 1977.551185] [<ffffffffa09dc87c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 1977.551201] [<ffffffffa0b96206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd] [ 1977.551211] [<ffffffffa0b9f214>] kiblnd_connd+0x464/0x900 [ko2iblnd] [ 1977.551221] [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20 [ 1977.551230] [<ffffffffa0b9edb0>] ? kiblnd_connd+0x0/0x900 [ko2iblnd] [ 1977.551236] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 1977.551241] [<ffffffff810b0980>] ? kthread+0x0/0xe0 [ 1977.551246] [<ffffffff816970d8>] ret_from_fork+0x58/0x90 [ 1977.551251] [<ffffffff810b0980>] ? kthread+0x0/0xe0 [ 1977.551253] [ 1977.551255] Kernel panic - not syncing: LBUG [ 1977.551300] CPU: 17 PID: 12875 Comm: kiblnd_connd Tainted: P OE ------------ 3.10.0-514.16.1.el7.x86_64 #1 [ 1977.551378] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016 [ 1977.551454] ffffffffa09fadac 000000009c68e137 ffff882fb8a23d38 ffffffff81686ac3 [ 1977.551519] ffff882fb8a23db8 ffffffff8167feca ffffffff00000008 ffff882fb8a23dc8 [ 1977.551582] ffff882fb8a23d68 000000009c68e137 000000009c68e137 0000000000000046 [ 1977.551645] Call Trace: [ 1977.551674] [<ffffffff81686ac3>] dump_stack+0x19/0x1b [ 1977.551719] [<ffffffff8167feca>] panic+0xe3/0x1f2 [ 1977.551767] [<ffffffffa09dc894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 1977.551823] [<ffffffffa0b96206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd] [ 1977.551883] [<ffffffffa0b9f214>] kiblnd_connd+0x464/0x900 [ko2iblnd] [ 1977.551936] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20 [ 1977.551985] [<ffffffffa0b9edb0>] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd] [ 1977.552041] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 1977.552083] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 1977.552137] [<ffffffff816970d8>] ret_from_fork+0x58/0x90 [ 1977.552182] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 On onyx: /scratch/dumps/lustre/jsalinas/kiblnd_reconnect_peer_assert.tgz |
| Comments |
| Comment by John Salinas (Inactive) [ 13/Jun/17 ] |
|
Currently this blocks us from bringing up our cluster |
| Comment by Amir Shehata (Inactive) [ 13/Jun/17 ] |
|
That assertion has been removed in the latest code. Would you be able to try the latest master? The exact patch is:
|
| Comment by John Salinas (Inactive) [ 13/Jun/17 ] |
|
Hi Amir, let see if I can trigger a build with that patch |
| Comment by John Salinas (Inactive) [ 13/Jun/17 ] |
|
Any chance this could also be our mds issue? I am not getting a dump the last messages I see in /var/log/messages are: Jun 13 09:46:56 mgs_mds_1 kernel: LNet: 7800:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.103@o2ib: 4294948 seconds |
| Comment by Doug Oucharek (Inactive) [ 13/Jun/17 ] |
|
John, Is this system using OmniPath? |
| Comment by Amir Shehata (Inactive) [ 13/Jun/17 ] |
|
can you enable net and neterror to see more info from LNet/LND. lctl set_param debug=+net lctl set_param debug=+neterror Do your nodes have any OPA interfaces? Did you make sure that all the LND tunables are the same across your network? |
| Comment by John Salinas (Inactive) [ 13/Jun/17 ] |
|
Yes, this is OPA1 |
| Comment by John Salinas (Inactive) [ 13/Jun/17 ] |
|
I will try to set this and see Everything was working, I had some Lustre errors on a few nodes so I rebooted everything and couldn't come up after that. Maybe some settings are inconsistent? I will look and see. I can also try the debug settings. |
| Comment by Doug Oucharek (Inactive) [ 13/Jun/17 ] |
|
I think there are two issues here:
Doug |
| Comment by John Salinas (Inactive) [ 13/Jun/17 ] |
|
On mds node we rebooted switch, node, set the debug and we see: These timeouts appear to be to Lustre OSS that are not up: Testing with lnet self test I can produce failures that looks like this on head node: I have opened a ticket with admin of the system to look at that |
| Comment by Doug Oucharek (Inactive) [ 13/Jun/17 ] |
|
Yeah, seeing HFI take a port down is not a good sign. Could be a marginal cable. |