Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-9659

lnet assert after timeout on reconnect ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • None
    • Lustre 2.9.0
    • Centos 7.3
      Lustre 2.9.58
      ZFS from coral-rc1-combined (based on 0.7.0 RC4)
    • 3
    • 9223372036854775807

    Description

      Running on Centos 7.3, Lustre 2.9.58, build 15 coral-rc1-combined (based on 0.7.0 RC4).

      Found Lustre down this morning so I rebooted all of the Lustre servers to get ready for a demo. The mds came up but when I brought the OSS online it crashed. When I brought the mds node back up as soon as I mounted Lustre all the OSS nodes crashed:

      From ssu2_oss2

      [ 1886.319606] LNet: 44156:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 48 seconds
      [ 1925.319158] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) ) failed: 
      [ 1925.319167] LNetError: 44156:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG
      [ 1925.319170] Pid: 44156, comm: kiblnd_connd
      [ 1925.319171] 
      Call Trace:
      [ 1925.319198]  [<ffffffffa09697ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
      [ 1925.319208]  [<ffffffffa096987c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      [ 1925.319225]  [<ffffffffa0a6e206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
      [ 1925.319234]  [<ffffffffa0a77214>] kiblnd_connd+0x464/0x900 [ko2iblnd]
      [ 1925.319245]  [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20
      [ 1925.319253]  [<ffffffffa0a76db0>] ? kiblnd_connd+0x0/0x900 [ko2iblnd]
      [ 1925.319259]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      [ 1925.319264]  [<ffffffff810b0980>] ? kthread+0x0/0xe0
      [ 1925.319269]  [<ffffffff816970d8>] ret_from_fork+0x58/0x90
      [ 1925.319273]  [<ffffffff810b0980>] ? kthread+0x0/0xe0
      [ 1925.319275] 
      [ 1925.319277] Kernel panic - not syncing: LBUG
      [ 1925.319323] CPU: 3 PID: 44156 Comm: kiblnd_connd Tainted: P           OE  ------------   3.10.0-514.16.1.el7.x86_64 #1
      [ 1925.319401] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016
      [ 1925.319477]  ffffffffa0987dac 00000000665c43d0 ffff885e892f3d38 ffffffff81686ac3
      [ 1925.319543]  ffff885e892f3db8 ffffffff8167feca ffffffff00000008 ffff885e892f3dc8
      [ 1925.319606]  ffff885e892f3d68 00000000665c43d0 00000000665c43d0 0000000000000046
      [ 1925.319668] Call Trace:
      [ 1925.319696]  [<ffffffff81686ac3>] dump_stack+0x19/0x1b
      [ 1925.319741]  [<ffffffff8167feca>] panic+0xe3/0x1f2
      [ 1925.319789]  [<ffffffffa0969894>] lbug_with_loc+0x64/0xb0 [libcfs]
      [ 1925.319844]  [<ffffffffa0a6e206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
      [ 1925.319904]  [<ffffffffa0a77214>] kiblnd_connd+0x464/0x900 [ko2iblnd]
      [ 1925.319957]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
      [ 1925.320006]  [<ffffffffa0a76db0>] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd]
      [ 1925.320062]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      [ 1925.320103]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      [ 1925.320154]  [<ffffffff816970d8>] ret_from_fork+0x58/0x90
      [ 1925.320198]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      

      Here is a different node ssu1_oss1:

       
      [  726.973784] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 5 previous similar messages
      [  815.624944] Lustre: nlsdraid-OST0000: recovery is timed out, evict stale exports
      [  815.624967] Lustre: nlsdraid-OST0000: disconnecting 2 stale clients
      [  815.856729] Lustre: nlsdraid-OST0000: Recovery over after 5:01, of 9 clients 7 recovered and 2 were evicted.
      [  876.965091] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360193/real 1497360193]  req@ffff885e1fea5700 x1570095499313792/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360248 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      [  876.965100] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 11 previous similar messages
      [ 1151.949457] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360468/real 1497360468]  req@ffff885e1fea6300 x1570095499314144/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497360523 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      [ 1151.949467] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 21 previous similar messages
      [ 1676.919121] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) @@@ Request sent has failed due to network error: [sent 1497360993/real 1497360993]  req@ffff885e1fea1b00 x1570095499314816/t0(0) o38->nlsdraid-MDT0000-lwp-OST0000@192.168.1.101@o2ib:12/10 lens 520/544 e 0 to 1 dl 1497361048 ref 1 fl Rpc:eXN/0/ffffffff rc 0/-1
      [ 1676.919131] Lustre: 12887:0:(client.c:2114:ptlrpc_expire_one_request()) Skipped 41 previous similar messages
      [ 1942.553084] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 9 seconds
      [ 1955.552348] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Timed out tx for 192.168.1.101@o2ib: 22 seconds
      [ 1955.552355] LNet: 12875:0:(o2iblnd_cb.c:3207:kiblnd_check_conns()) Skipped 1 previous similar message
      [ 1977.551131] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) ASSERTION( !peer_ni->ibp_accepting && !peer_ni->ibp_connecting && list_empty(&peer_ni->ibp_conns) ) failed: 
      [ 1977.551141] LNetError: 12875:0:(o2iblnd_cb.c:1355:kiblnd_reconnect_peer()) LBUG
      [ 1977.551144] Pid: 12875, comm: kiblnd_connd
      [ 1977.551146] 
      Call Trace:
      [ 1977.551175]  [<ffffffffa09dc7ee>] libcfs_call_trace+0x4e/0x60 [libcfs]
      [ 1977.551185]  [<ffffffffa09dc87c>] lbug_with_loc+0x4c/0xb0 [libcfs]
      [ 1977.551201]  [<ffffffffa0b96206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
      [ 1977.551211]  [<ffffffffa0b9f214>] kiblnd_connd+0x464/0x900 [ko2iblnd]
      [ 1977.551221]  [<ffffffff810c54c0>] ? default_wake_function+0x0/0x20
      [ 1977.551230]  [<ffffffffa0b9edb0>] ? kiblnd_connd+0x0/0x900 [ko2iblnd]
      [ 1977.551236]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      [ 1977.551241]  [<ffffffff810b0980>] ? kthread+0x0/0xe0
      [ 1977.551246]  [<ffffffff816970d8>] ret_from_fork+0x58/0x90
      [ 1977.551251]  [<ffffffff810b0980>] ? kthread+0x0/0xe0
      [ 1977.551253] 
      [ 1977.551255] Kernel panic - not syncing: LBUG
      [ 1977.551300] CPU: 17 PID: 12875 Comm: kiblnd_connd Tainted: P           OE  ------------   3.10.0-514.16.1.el7.x86_64 #1
      [ 1977.551378] Hardware name: Intel Corporation S2600WTTR/S2600WTTR, BIOS SE5C610.86B.01.01.0015.012820160943 01/28/2016
      [ 1977.551454]  ffffffffa09fadac 000000009c68e137 ffff882fb8a23d38 ffffffff81686ac3
      [ 1977.551519]  ffff882fb8a23db8 ffffffff8167feca ffffffff00000008 ffff882fb8a23dc8
      [ 1977.551582]  ffff882fb8a23d68 000000009c68e137 000000009c68e137 0000000000000046
      [ 1977.551645] Call Trace:
      [ 1977.551674]  [<ffffffff81686ac3>] dump_stack+0x19/0x1b
      [ 1977.551719]  [<ffffffff8167feca>] panic+0xe3/0x1f2
      [ 1977.551767]  [<ffffffffa09dc894>] lbug_with_loc+0x64/0xb0 [libcfs]
      [ 1977.551823]  [<ffffffffa0b96206>] kiblnd_reconnect_peer+0x216/0x220 [ko2iblnd]
      [ 1977.551883]  [<ffffffffa0b9f214>] kiblnd_connd+0x464/0x900 [ko2iblnd]
      [ 1977.551936]  [<ffffffff810c54c0>] ? wake_up_state+0x20/0x20
      [ 1977.551985]  [<ffffffffa0b9edb0>] ? kiblnd_check_conns+0x840/0x840 [ko2iblnd]
      [ 1977.552041]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
      [ 1977.552083]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      [ 1977.552137]  [<ffffffff816970d8>] ret_from_fork+0x58/0x90
      [ 1977.552182]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
      

      On onyx: /scratch/dumps/lustre/jsalinas/kiblnd_reconnect_peer_assert.tgz

      Attachments

        Activity

          People

            wc-triage WC Triage
            jsalians_intel John Salinas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: