Details
Description
Running on Centos 7.3, Lustre 2.9.58, build 15 coral-rc1-combined (based on 0.7.0 RC4).
Found Lustre down this morning so I rebooted all of the Lustre servers to get ready for a demo. The mds came up but when I brought the OSS online it crashed. When I brought the mds node back up as soon as I mounted Lustre all the OSS nodes crashed:
From ssu2_oss2
[ 1886.319606] LNet: 44156:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 48 seconds [ 1925.319158] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) ) failed: [ 1925.319167] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG [ 1925.319170] Pid: 44156, comm: kiblnd_connd [ 1925.319171] Call Trace: [ 1925.319198] [<ffffffffa09697ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 1925.319208] [<ffffffffa096987c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 1925.319225] [<ffffffffa0a6e206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd] [ 1925.319234] [<ffffffffa0a77214>] kiblnd_connd+0x464/0x900 [ko2iblnd] [ 1925.319245] [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20 [ 1925.319253] [<ffffffffa0a76db0>] ? kiblnd_connd+0x0/0x900 [ko2iblnd] [ 1925.319259] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 1925.319264] [<ffffffff810b0980>] ? kthread+0x0/0xe0 [ 1925.319269] [<ffffffff816970d8>] ret_from_fork+0x58/0x90 [ 1925.319273] [<ffffffff810b0980>] ? kthread+0x0/0xe0 [ 1925.319275] [ 1925.319277] Kernel panic - not syncing: LBUG [ 1925.319323] CPU: 3 PID: 44156 Comm: kiblnd_connd Tainted: P OE ------------ 3.10.0-514.16.1.el7.x86_64 #1 [ 1925.319401] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016 [ 1925.319477] ffffffffa0987dac 00000000665c43d0 ffff885e892f3d38 ffffffff81686ac3 [ 1925.319543] ffff885e892f3db8 ffffffff8167feca ffffffff00000008 ffff885e892f3dc8 [ 1925.319606] ffff885e892f3d68 00000000665c43d0 00000000665c43d0 0000000000000046 [ 1925.319668] Call Trace: [ 1925.319696] [<ffffffff81686ac3>] dump_stack+0x19/0x1b [ 1925.319741] [<ffffffff8167feca>] panic+0xe3/0x1f2 [ 1925.319789] [<ffffffffa0969894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 1925.319844] [<ffffffffa0a6e206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd] [ 1925.319904] [<ffffffffa0a77214>] kiblnd_connd+0x464/0x900 [ko2iblnd] [ 1925.319957] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20 [ 1925.320006] [<ffffffffa0a76db0>] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd] [ 1925.320062] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 1925.320103] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 1925.320154] [<ffffffff816970d8>] ret_from_fork+0x58/0x90 [ 1925.320198] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
Here is a different node ssu1_oss1:
[ 726.973784] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 5 previous similar messages [ 815.624944] Lustre: nlsdraid-OST0000: recovery is timed out, evict stale exports [ 815.624967] Lustre: nlsdraid-OST0000: disconnecting 2 stale clients [ 815.856729] Lustre: nlsdraid-OST0000: Recovery over after 5:01, of 9 clients 7 recovered and 2 were evicted. [ 876.965091] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360193/real 1497360193] req@ffff885e1fea5700 x1570095499313792/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360248 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 876.965100] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 11 previous similar messages [ 1151.949457] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360468/real 1497360468] req@ffff885e1fea6300 x1570095499314144/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360523 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 1151.949467] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 21 previous similar messages [ 1676.919121] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360993/real 1497360993] req@ffff885e1fea1b00 x1570095499314816/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497361048 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1 [ 1676.919131] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 41 previous similar messages [ 1942.553084] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 9 seconds [ 1955.552348] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 22 seconds [ 1955.552355] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 1 previous similar message [ 1977.551131] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) ) failed: [ 1977.551141] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG [ 1977.551144] Pid: 12875, comm: kiblnd_connd [ 1977.551146] Call Trace: [ 1977.551175] [<ffffffffa09dc7ee>] libcfs_call_trace+0x4e/0x60 [libcfs] [ 1977.551185] [<ffffffffa09dc87c>] lbug_with_loc+0x4c/0xb0 [libcfs] [ 1977.551201] [<ffffffffa0b96206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd] [ 1977.551211] [<ffffffffa0b9f214>] kiblnd_connd+0x464/0x900 [ko2iblnd] [ 1977.551221] [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20 [ 1977.551230] [<ffffffffa0b9edb0>] ? kiblnd_connd+0x0/0x900 [ko2iblnd] [ 1977.551236] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 1977.551241] [<ffffffff810b0980>] ? kthread+0x0/0xe0 [ 1977.551246] [<ffffffff816970d8>] ret_from_fork+0x58/0x90 [ 1977.551251] [<ffffffff810b0980>] ? kthread+0x0/0xe0 [ 1977.551253] [ 1977.551255] Kernel panic - not syncing: LBUG [ 1977.551300] CPU: 17 PID: 12875 Comm: kiblnd_connd Tainted: P OE ------------ 3.10.0-514.16.1.el7.x86_64 #1 [ 1977.551378] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016 [ 1977.551454] ffffffffa09fadac 000000009c68e137 ffff882fb8a23d38 ffffffff81686ac3 [ 1977.551519] ffff882fb8a23db8 ffffffff8167feca ffffffff00000008 ffff882fb8a23dc8 [ 1977.551582] ffff882fb8a23d68 000000009c68e137 000000009c68e137 0000000000000046 [ 1977.551645] Call Trace: [ 1977.551674] [<ffffffff81686ac3>] dump_stack+0x19/0x1b [ 1977.551719] [<ffffffff8167feca>] panic+0xe3/0x1f2 [ 1977.551767] [<ffffffffa09dc894>] lbug_with_loc+0x64/0xb0 [libcfs] [ 1977.551823] [<ffffffffa0b96206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd] [ 1977.551883] [<ffffffffa0b9f214>] kiblnd_connd+0x464/0x900 [ko2iblnd] [ 1977.551936] [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20 [ 1977.551985] [<ffffffffa0b9edb0>] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd] [ 1977.552041] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 1977.552083] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 1977.552137] [<ffffffff816970d8>] ret_from_fork+0x58/0x90 [ 1977.552182] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
On onyx: /scratch/dumps/lustre/jsalinas/kiblnd_reconnect_peer_assert.tgz