XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Minor
    • None
    • Lustre 2.11.0
    • soak lustre-master-ib #16 DNE without reformat; previous build was #12
    • 3
    • 9223372036854775807

    Description

      After reprovison SOAK with MOFED build #16, 1 OSS (soak-7) got hung after running about 2 hours. Have tried to restart the node and mount again, it quickly hung again.

      OSS dmesg

      [   59.983080] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
      [  577.564976] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
      [  577.588287] alg: No test for adler32 (adler32-zlib)
      [  577.604774] alg: No test for crc32 (crc32-table)
      [  578.440401] Lustre: Lustre: Build Version: 2.10.55
      [  578.626899] LNet: Added LNI 192.168.1.107@o2ib [8/256/0/180]
      [  578.644034] LNetError: 126:0:(o2iblnd_cb.c:2305:kiblnd_passive_connect()) Can't accept conn from 192.168.1.119@o2ib on NA (ib0:1:192.168.1.107): bad dst nid 192.168.1.107@o2ib
      [  579.250293] LustreError: 137-5: soaked-OST000b_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      [  579.404451] Lustre: soaked-OST0005: Not available for connect from 192.168.1.118@o2ib (not set up)
      [  579.468193] Lustre: soaked-OST0005: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [  579.509366] LustreError: 11-0: soaked-MDT0001-lwp-OST0005: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114
      [  579.517897] Lustre: soaked-OST0005: Will be in recovery for at least 2:30, or until 32 clients reconnect
      [  579.518181] Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.126@o2ib)
      [  580.759587] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.133@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      [  580.818149] LustreError: Skipped 53 previous similar messages
      [  580.844275] Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.121@o2ib)
      [  580.869576] Lustre: Skipped 4 previous similar messages
      [  581.910615] LustreError: 137-5: soaked-OST0011_UUID: not available for connect from 192.168.1.120@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      [  581.911026] Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.120@o2ib)
      [  581.911029] Lustre: Skipped 1 previous similar message
      [  582.011699] LustreError: Skipped 6 previous similar messages
      [  584.510520] Lustre: soaked-OST0005: Connection restored to soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib)
      [  584.543241] Lustre: Skipped 1 previous similar message
      [  584.613345] LNet: 7784:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.114@o2ib: 4295251 seconds
      [  585.156427] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.136@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      [  585.214952] LustreError: Skipped 4 previous similar messages
      [  585.613324] LNet: 7784:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.115@o2ib: 4295252 seconds
      [  588.629665] Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.123@o2ib)
      [  588.654988] Lustre: Skipped 15 previous similar messages
      [  588.718309] Lustre: soaked-OST000b: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [  588.763453] LustreError: 11-0: soaked-MDT0001-lwp-OST000b: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114
      [  588.800753] LustreError: Skipped 3 previous similar messages
      [  589.290380] Lustre: soaked-OST000b: Will be in recovery for at least 2:30, or until 32 clients reconnect
      [  590.882947] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.109@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      [  596.295018] Lustre: soaked-OST0011: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [  596.343902] LustreError: 11-0: soaked-MDT0001-lwp-OST0011: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114
      [  596.349189] Lustre: soaked-OST0011: Will be in recovery for at least 2:30, or until 32 clients reconnect
      [  596.412527] LustreError: Skipped 3 previous similar messages
      [  596.989322] Lustre: soaked-OST0011: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib)
      [  597.025080] Lustre: Skipped 27 previous similar messages
      [  601.257640] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.110@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
      [  601.322237] LustreError: Skipped 1 previous similar message
      [  604.746939] LustreError: 11-0: soaked-MDT0001-lwp-OST0017: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114
      [  604.751172] Lustre: soaked-OST0017: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
      [  604.755435] Lustre: soaked-OST0017: Will be in recovery for at least 2:30, or until 32 clients reconnect
      [  604.859781] LustreError: Skipped 3 previous similar messages
      [  613.202223] Lustre: soaked-OST0017: Connection restored to  (at 192.168.1.120@o2ib)
      [  613.230721] Lustre: Skipped 60 previous similar messages
      [  625.052799] Lustre: soaked-OST0011: Recovery over after 0:28, of 32 clients 32 recovered and 0 were evicted.
      [  625.178799] Lustre: soaked-OST0011: deleting orphan objects from 0x740000402:1954773 to 0x740000402:1956385
      [  625.186027] Lustre: soaked-OST0011: deleting orphan objects from 0x740000401:2946444 to 0x740000401:2946577
      [  625.247687] Lustre: soaked-OST0011: deleting orphan objects from 0x740000400:1324860 to 0x740000400:1326312
      [  629.859616] LustreError: 11-0: soaked-MDT0003-lwp-OST0005: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114
      [  629.900030] LustreError: Skipped 3 previous similar messages
      [  632.686916] Lustre: soaked-OST0017: Recovery over after 0:28, of 32 clients 32 recovered and 0 were evicted.
      [  632.812885] Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000402:1330224 to 0x8c0000402:1330433
      [  632.845820] Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000401:1961276 to 0x8c0000401:1962209
      [  632.892404] Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000400:2942161 to 0x8c0000400:2942193
      [  633.363106] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2)
      [  633.370215] LustreError: 29224:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0005: BRW to missing obj 0x440000402:1949013
      [  633.434168] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) Skipped 1 previous similar message
      [  633.468475] format at lustre_dlm.h:1098:ldlm_lvbo_fill doesn't end in newline
      [  635.387737] Lustre: soaked-OST000b: Recovery over after 0:46, of 32 clients 32 recovered and 0 were evicted.
      [  635.423428] Lustre: Skipped 1 previous similar message
      [  633.493466] LustreError: 7850:0:(lustre_dlm.h:1098:ldlm_lvbo_fill()) lock ffff880797e70800: delayed lvb init failed (rc -2)
      [  635.794185] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2)
      [  635.830221] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) Skipped 10 previous similar messages
      [  654.748696] LustreError: 11-0: soaked-MDT0000-lwp-OST0011: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114
      [  654.789007] LustreError: Skipped 1 previous similar message
      [  679.789097] LustreError: 11-0: soaked-MDT0003-lwp-OST0005: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114
      [  679.829488] LustreError: Skipped 5 previous similar messages
      [  729.808913] LustreError: 11-0: soaked-MDT0003-lwp-OST0005: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114
      [  729.849328] LustreError: Skipped 7 previous similar messages
      [  749.610650] LNet: 7784:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.114@o2ib: 0 seconds
      [  804.746279] LustreError: 11-0: soaked-MDT0000-lwp-OST0017: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114
      [  804.786597] LustreError: Skipped 9 previous similar messages
      [  833.890248] LNet: Service thread pid 35547 was inactive for 200.52s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [  833.952191] Pid: 35547, comm: ll_ost_io01_022
      [  833.969466] 
      Call Trace:
      [  833.987632]  [<ffffffff816a9589>] schedule+0x29/0x70
      [  834.006382]  [<ffffffffc06b34d5>] cv_wait_common+0x125/0x150 [spl]
      [  834.028931]  [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40
      [  834.052232]  [<ffffffffc06b3515>] __cv_wait+0x15/0x20 [spl]
      [  834.072541]  [<ffffffffc1f7181f>] txg_wait_synced+0xef/0x140 [zfs]
      [  834.094665]  [<ffffffffc0b562c8>] osd_trans_stop+0x468/0x590 [osd_zfs]
      [  834.117804]  [<ffffffffc117629f>] ofd_trans_stop+0x1f/0x60 [ofd]
      [  834.139098]  [<ffffffffc117c4c4>] ofd_commitrw_write+0x7e4/0x1c50 [ofd]
      [  834.162354]  [<ffffffffc11805e9>] ofd_commitrw+0x4b9/0xac0 [ofd]
      [  834.183547]  [<ffffffffc0f74677>] obd_commitrw+0x2ed/0x330 [ptlrpc]
      [  834.205542]  [<ffffffffc0f47c41>] tgt_brw_write+0xff1/0x17c0 [ptlrpc]
      [  834.227963]  [<ffffffff811de501>] ? __slab_free+0x81/0x2f0
      [  834.247183]  [<ffffffff8118aeb4>] ? free_one_page+0x2e4/0x310
      [  834.267230]  [<ffffffffc0e99f80>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc]
      [  834.290662]  [<ffffffffc0f43c55>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      [  834.314361]  [<ffffffffc0eeca3e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
      [  834.340655]  [<ffffffffc0ee9908>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [  834.364065]  [<ffffffff811dca2e>] ? kmalloc_order_trace+0x2e/0xa0
      [  834.385198]  [<ffffffffc0ef01e2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [  834.406858]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      [  834.425923]  [<ffffffff816a904d>] ? __schedule+0x39d/0x8b0
      [  834.444999]  [<ffffffffc0eef750>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      [  834.466633]  [<ffffffff810b099f>] kthread+0xcf/0xe0
      [  834.483693]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
      [  834.501018]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
      [  834.519757]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
      
      [  834.542850] LustreError: dumping log to /tmp/lustre-log.1510193654.35547
      [  834.802579] Pid: 32862, comm: ll_ost_io00_019
      [  834.819060] 
      Call Trace:
      [  834.835684]  [<ffffffff816a9589>] schedule+0x29/0x70
      [  834.853784]  [<ffffffffc06b34d5>] cv_wait_common+0x125/0x150 [spl]
      [  834.875745]  [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40
      [  834.898451]  [<ffffffffc06b3515>] __cv_wait+0x15/0x20 [spl]
      [  834.918158]  [<ffffffffc1f26bfb>] dmu_tx_wait+0x20b/0x3c0 [zfs]
      [  834.938932]  [<ffffffffc1f26e41>] dmu_tx_assign+0x91/0x490 [zfs]
      [  834.959865]  [<ffffffffc0b52f37>] osd_trans_start+0xa7/0x3a0 [osd_zfs]
      [  834.982424]  [<ffffffffc117620b>] ofd_trans_start+0x6b/0xe0 [ofd]
      [  835.003465]  [<ffffffffc117c62b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
      [  835.026153]  [<ffffffffc11805e9>] ofd_commitrw+0x4b9/0xac0 [ofd]
      [  835.046815]  [<ffffffffc0f74677>] obd_commitrw+0x2ed/0x330 [ptlrpc]
      [  835.068248]  [<ffffffffc0f47c41>] tgt_brw_write+0xff1/0x17c0 [ptlrpc]
      [  835.090232]  [<ffffffff810cd7a4>] ? update_curr+0x104/0x190
      [  835.110138]  [<ffffffff810c8f28>] ? __enqueue_entity+0x78/0x80
      [  835.130802]  [<ffffffff810cf91c>] ? enqueue_entity+0x26c/0xb60
      [  835.151423]  [<ffffffffc0e99f80>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc]
      [  835.175126]  [<ffffffffc0f43c55>] tgt_request_handle+0x925/0x1370 [ptlrpc]
      [  835.199009]  [<ffffffffc0eeca3e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
      [  835.225389]  [<ffffffffc0ee9908>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
      [  835.248817]  [<ffffffff810c4832>] ? default_wake_function+0x12/0x20
      [  835.270461]  [<ffffffff810ba598>] ? __wake_up_common+0x58/0x90
      [  835.290685]  [<ffffffffc0ef01e2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
      [  835.312227]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
      [  835.331138]  [<ffffffff816a9000>] ? __schedule+0x350/0x8b0
      [  835.350071]  [<ffffffffc0eef750>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
      [  835.371526]  [<ffffffff810b099f>] kthread+0xcf/0xe0
      [  835.388344]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
      [  835.405400]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
      [  835.423854]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
      
      [  835.446443] Pid: 29220, comm: ll_ost_io01_004
      

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              sarah Sarah Liu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: