[LU-9659] lnet assert after timeout on reconnect ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) Created: 13/Jun/17  Updated: 13/Jun/17

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: John Salinas (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: LS_RZ, prod
Environment:

Centos 7.3
Lustre 2.9.58
ZFS from coral-rc1-combined (based on 0.7.0 RC4)


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Running on Centos 7.3, Lustre 2.9.58, build 15 coral-rc1-combined (based on 0.7.0 RC4).

Found Lustre down this morning so I rebooted all of the Lustre servers to get ready for a demo. The mds came up but when I brought the OSS online it crashed. When I brought the mds node back up as soon as I mounted Lustre all the OSS nodes crashed:

From ssu2_oss2

[ 1886.319606] LNet: 44156:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 48 seconds
[ 1925.319158] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) ) failed: 
[ 1925.319167] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG
[ 1925.319170] Pid: 44156, comm: kiblnd_connd
[ 1925.319171] 
Call Trace:
[ 1925.319198]  [<ffffffffa09697ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 1925.319208]  [<ffffffffa096987c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 1925.319225]  [<ffffffffa0a6e206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
[ 1925.319234]  [<ffffffffa0a77214>] kiblnd_connd+0x464/0x900 [ko2iblnd]
[ 1925.319245]  [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20
[ 1925.319253]  [<ffffffffa0a76db0>] ? kiblnd_connd+0x0/0x900 [ko2iblnd]
[ 1925.319259]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 1925.319264]  [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 1925.319269]  [<ffffffff816970d8>] ret_from_fork+0x58/0x90
[ 1925.319273]  [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 1925.319275] 
[ 1925.319277] Kernel panic - not syncing: LBUG
[ 1925.319323] CPU: 3 PID: 44156 Comm: kiblnd_connd Tainted: P           OE  ------------   3.10.0-514.16.1.el7.x86_64 #1
[ 1925.319401] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016
[ 1925.319477]  ffffffffa0987dac 00000000665c43d0 ffff885e892f3d38 ffffffff81686ac3
[ 1925.319543]  ffff885e892f3db8 ffffffff8167feca ffffffff00000008 ffff885e892f3dc8
[ 1925.319606]  ffff885e892f3d68 00000000665c43d0 00000000665c43d0 0000000000000046
[ 1925.319668] Call Trace:
[ 1925.319696]  [<ffffffff81686ac3>] dump_stack+0x19/0x1b
[ 1925.319741]  [<ffffffff8167feca>] panic+0xe3/0x1f2
[ 1925.319789]  [<ffffffffa0969894>] lbug_with_loc+0x64/0xb0 [libcfs]
[ 1925.319844]  [<ffffffffa0a6e206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
[ 1925.319904]  [<ffffffffa0a77214>] kiblnd_connd+0x464/0x900 [ko2iblnd]
[ 1925.319957]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
[ 1925.320006]  [<ffffffffa0a76db0>] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd]
[ 1925.320062]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 1925.320103]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 1925.320154]  [<ffffffff816970d8>] ret_from_fork+0x58/0x90
[ 1925.320198]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

Here is a different node ssu1_oss1:

 
[  726.973784] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
[  815.624944] Lustre: nlsdraid-OST0000: recovery is timed out, evict stale exports
[  815.624967] Lustre: nlsdraid-OST0000: disconnecting 2 stale clients
[  815.856729] Lustre: nlsdraid-OST0000: Recovery over after 5:01, of 9 clients 7 recovered and 2 were evicted.
[  876.965091] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360193/real 1497360193]  req@ffff885e1fea5700 x1570095499313792/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360248 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[  876.965100] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
[ 1151.949457] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360468/real 1497360468]  req@ffff885e1fea6300 x1570095499314144/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360523 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 1151.949467] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 21 previous similar messages
[ 1676.919121] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360993/real 1497360993]  req@ffff885e1fea1b00 x1570095499314816/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497361048 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
[ 1676.919131] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
[ 1942.553084] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 9 seconds
[ 1955.552348] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 22 seconds
[ 1955.552355] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 1 previous similar message
[ 1977.551131] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) ) failed: 
[ 1977.551141] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG
[ 1977.551144] Pid: 12875, comm: kiblnd_connd
[ 1977.551146] 
Call Trace:
[ 1977.551175]  [<ffffffffa09dc7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
[ 1977.551185]  [<ffffffffa09dc87c>] lbug_with_loc+0x4c/0xb0 [libcfs]
[ 1977.551201]  [<ffffffffa0b96206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
[ 1977.551211]  [<ffffffffa0b9f214>] kiblnd_connd+0x464/0x900 [ko2iblnd]
[ 1977.551221]  [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20
[ 1977.551230]  [<ffffffffa0b9edb0>] ? kiblnd_connd+0x0/0x900 [ko2iblnd]
[ 1977.551236]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 1977.551241]  [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 1977.551246]  [<ffffffff816970d8>] ret_from_fork+0x58/0x90
[ 1977.551251]  [<ffffffff810b0980>] ? kthread+0x0/0xe0
[ 1977.551253] 
[ 1977.551255] Kernel panic - not syncing: LBUG
[ 1977.551300] CPU: 17 PID: 12875 Comm: kiblnd_connd Tainted: P           OE  ------------   3.10.0-514.16.1.el7.x86_64 #1
[ 1977.551378] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016
[ 1977.551454]  ffffffffa09fadac 000000009c68e137 ffff882fb8a23d38 ffffffff81686ac3
[ 1977.551519]  ffff882fb8a23db8 ffffffff8167feca ffffffff00000008 ffff882fb8a23dc8
[ 1977.551582]  ffff882fb8a23d68 000000009c68e137 000000009c68e137 0000000000000046
[ 1977.551645] Call Trace:
[ 1977.551674]  [<ffffffff81686ac3>] dump_stack+0x19/0x1b
[ 1977.551719]  [<ffffffff8167feca>] panic+0xe3/0x1f2
[ 1977.551767]  [<ffffffffa09dc894>] lbug_with_loc+0x64/0xb0 [libcfs]
[ 1977.551823]  [<ffffffffa0b96206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
[ 1977.551883]  [<ffffffffa0b9f214>] kiblnd_connd+0x464/0x900 [ko2iblnd]
[ 1977.551936]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
[ 1977.551985]  [<ffffffffa0b9edb0>] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd]
[ 1977.552041]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[ 1977.552083]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[ 1977.552137]  [<ffffffff816970d8>] ret_from_fork+0x58/0x90
[ 1977.552182]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140

On onyx: /scratch/dumps/lustre/jsalinas/kiblnd_reconnect_peer_assert.tgz



 Comments   
Comment by John Salinas (Inactive) [ 13/Jun/17 ]

Currently this blocks us from bringing up our cluster

Comment by Amir Shehata (Inactive) [ 13/Jun/17 ]

That assertion has been removed in the latest code. Would you be able to try the latest master?

The exact patch is:
LU-9507 lnd: Don't Assert On Reconnect with MultiQP

 

Comment by John Salinas (Inactive) [ 13/Jun/17 ]

Hi Amir, let see if I can trigger a build with that patch

Comment by John Salinas (Inactive) [ 13/Jun/17 ]

Any chance this could also be our mds issue? I am not getting a dump the last messages I see in /var/log/messages are: Jun 13 09:46:56 mgs_mds_1 kernel: LNet: 7800:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.103@o2ib: 4294948 seconds
Jun 13 09:46:56 mgs_mds_1 kernel: LNet: 7800:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 3 previous similar messages
Jun 13 09:46:56 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497361611/rea
l 1497361616] req@ffff882f44c59b00 x1570097395139168/t0(0) o8->nlsdraid-OST0003-osc-MDT0000@192.168.1.106@o2ib:28/4 lens 520/544 e 0 to 1 dl 1497361626 ref 1 fl R
pc:eXN/0/ffffffff rc 0/-1
Jun 13 09:46:56 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Jun 13 09:47:21 mgs_mds_1 kernel: LNet: 7800:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.103@o2ib: 4294973 seconds
Jun 13 09:47:21 mgs_mds_1 kernel: LNet: 7800:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 3 previous similar messages
Jun 13 09:47:21 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497361636/rea
l 1497361641] req@ffff882f44c5ad00 x1570097395139264/t0(0) o8->nlsdraid-OST0003-osc-MDT0000@192.168.1.106@o2ib:28/4 lens 520/544 e 0 to 1 dl 1497361656 ref 1 fl R
pc:eXN/0/ffffffff rc 0/-1
Jun 13 09:47:21 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages
Jun 13 09:47:41 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497361661/rea
l 1497361661] req@ffff882f44c5b600 x1570097395139312/t0(0) o8->nlsdraid-OST0000-osc-MDT0000@192.168.1.103@o2ib:28/4 lens 520/544 e 0 to 1 dl 1497361686 ref 1 fl R
pc:eXN/0/ffffffff rc 0/-1
Jun 13 09:47:41 mgs_mds_1 kernel: Lustre: 7812:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 3 previous similar messages

Comment by Doug Oucharek (Inactive) [ 13/Jun/17 ]

John,

Is this system using OmniPath?

Comment by Amir Shehata (Inactive) [ 13/Jun/17 ]

can you enable net and neterror to see more info from LNet/LND.

lctl set_param debug=+net
lctl set_param debug=+neterror

Do your nodes have any OPA interfaces? Did you make sure that all the LND tunables are the same across your network?

Comment by John Salinas (Inactive) [ 13/Jun/17 ]

Yes, this is OPA1

Comment by John Salinas (Inactive) [ 13/Jun/17 ]

I will try to set this and see

Everything was working, I had some Lustre errors on a few nodes so I rebooted everything and couldn't come up after that. Maybe some settings are inconsistent? I will look and see. I can also try the debug settings.

Comment by Doug Oucharek (Inactive) [ 13/Jun/17 ]

I think there are two issues here:

  1. Asserting on reconnects.  This is addressed by LU-9507.
  2. Timeouts on transmissions is failing connections causing them to need to reconnect.  This one is a mystery and could indicate a hardware issue or switch which needs to be restarted, etc.

Doug

Comment by John Salinas (Inactive) [ 13/Jun/17 ]

On mds node we rebooted switch, node, set the debug and we see:
[ 428.470987] Lustre: Lustre: Build Version: 2.9.58_dirty
[ 428.527447] LNet: Using FMR for registration
[ 428.576708] LNet: Added LNI 192.168.1.101@o2ib [128/2048/0/180]
[ 428.795760] Lustre: MGS: Connection restored to 1ae30eb4-5113-4ca8-1a68-afd0617f82be (at 0@lo)
[ 428.902858] device-mapper: multipath: Reinstating path 66:192.
[ 429.283680] Lustre: nlsdraid-MDT0000: Imperative Recovery not enabled, recovery window 300-900
[ 429.385910] LustreError: 7665:0:(osd_oi.c:503:osd_oid()) nlsdraid-MDT0000-osd: unsupported quota oid: 0x16
[ 433.527252] LNet: 7723:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.103@o2ib: 4295100 seconds
[ 433.527328] Lustre: 7735:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497380279/real 1497380283] req@ffff882f857e0600 x1570117022384320/t0(0) o8->nlsdraid-OST0002-osc-MDT0000@192.168.1.105@o2ib:28/4 lens 520/544 e 0 to 1 dl 1497380284 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1

These timeouts appear to be to Lustre OSS that are not up:
192.168.1.103 ssu1_oss1 lustre1
192.168.1.104 ssu1_oss2 lustre2
192.168.1.105 ssu2_oss1 lustre3
192.168.1.106 ssu2_oss2 lustre4
192.168.1.101 mgs_mds_1 lustre5

Testing with lnet self test I can produce failures that looks like this on head node:
[3432045.164205] hfi1 0000:81:00.0: hfi1_0: 8051: Link down
[3432045.164236] hfi1 0000:81:00.0: hfi1_0: set_link_state: current ACTIVE, new OFFLINE
[3432046.288501] hfi1 0000:81:00.0: hfi1_0: logical state changed to PORT_DOWN (0x1)
[3432046.288574] hfi1 0000:81:00.0: hfi1_0: hfi1_ibphys_portstate: physical state changed to PHYS_OFFLINE (0x9), phy 0x90
[3432046.288587] hfi1 0000:81:00.0: hfi1_0: set_link_state: current OFFLINE, new POLL
[3432046.288688] hfi1 0000:81:00.0: hfi1_0: Entering SPC freeze

I have opened a ticket with admin of the system to look at that

Comment by Doug Oucharek (Inactive) [ 13/Jun/17 ]

Yeah, seeing HFI take a port down is not a good sign. Could be a marginal cable.

Generated at Sat Feb 10 02:28:06 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.