Details
-
Bug
-
Resolution: Duplicate
-
Minor
-
None
-
Lustre 2.11.0
-
soak lustre-master-ib #16 DNE without reformat; previous build was #12
-
3
-
9223372036854775807
Description
After reprovison SOAK with MOFED build #16, 1 OSS (soak-7) got hung after running about 2 hours. Have tried to restart the node and mount again, it quickly hung again.
OSS dmesg
[ 59.983080] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready [ 577.564976] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2 [ 577.588287] alg: No test for adler32 (adler32-zlib) [ 577.604774] alg: No test for crc32 (crc32-table) [ 578.440401] Lustre: Lustre: Build Version: 2.10.55 [ 578.626899] LNet: Added LNI 192.168.1.107@o2ib [8/256/0/180] [ 578.644034] LNetError: 126:0:(o2iblnd_cb.c:2305:kiblnd_passive_connect()) Can't accept conn from 192.168.1.119@o2ib on NA (ib0:1:192.168.1.107): bad dst nid 192.168.1.107@o2ib [ 579.250293] LustreError: 137-5: soaked-OST000b_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [ 579.404451] Lustre: soaked-OST0005: Not available for connect from 192.168.1.118@o2ib (not set up) [ 579.468193] Lustre: soaked-OST0005: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 579.509366] LustreError: 11-0: soaked-MDT0001-lwp-OST0005: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114 [ 579.517897] Lustre: soaked-OST0005: Will be in recovery for at least 2:30, or until 32 clients reconnect [ 579.518181] Lustre: soaked-OST0005: Connection restored to (at 192.168.1.126@o2ib) [ 580.759587] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.133@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [ 580.818149] LustreError: Skipped 53 previous similar messages [ 580.844275] Lustre: soaked-OST0005: Connection restored to (at 192.168.1.121@o2ib) [ 580.869576] Lustre: Skipped 4 previous similar messages [ 581.910615] LustreError: 137-5: soaked-OST0011_UUID: not available for connect from 192.168.1.120@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [ 581.911026] Lustre: soaked-OST0005: Connection restored to (at 192.168.1.120@o2ib) [ 581.911029] Lustre: Skipped 1 previous similar message [ 582.011699] LustreError: Skipped 6 previous similar messages [ 584.510520] Lustre: soaked-OST0005: Connection restored to soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib) [ 584.543241] Lustre: Skipped 1 previous similar message [ 584.613345] LNet: 7784:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.114@o2ib: 4295251 seconds [ 585.156427] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.136@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [ 585.214952] LustreError: Skipped 4 previous similar messages [ 585.613324] LNet: 7784:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.115@o2ib: 4295252 seconds [ 588.629665] Lustre: soaked-OST0005: Connection restored to (at 192.168.1.123@o2ib) [ 588.654988] Lustre: Skipped 15 previous similar messages [ 588.718309] Lustre: soaked-OST000b: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 588.763453] LustreError: 11-0: soaked-MDT0001-lwp-OST000b: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114 [ 588.800753] LustreError: Skipped 3 previous similar messages [ 589.290380] Lustre: soaked-OST000b: Will be in recovery for at least 2:30, or until 32 clients reconnect [ 590.882947] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.109@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [ 596.295018] Lustre: soaked-OST0011: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 596.343902] LustreError: 11-0: soaked-MDT0001-lwp-OST0011: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114 [ 596.349189] Lustre: soaked-OST0011: Will be in recovery for at least 2:30, or until 32 clients reconnect [ 596.412527] LustreError: Skipped 3 previous similar messages [ 596.989322] Lustre: soaked-OST0011: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib) [ 597.025080] Lustre: Skipped 27 previous similar messages [ 601.257640] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.110@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server. [ 601.322237] LustreError: Skipped 1 previous similar message [ 604.746939] LustreError: 11-0: soaked-MDT0001-lwp-OST0017: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114 [ 604.751172] Lustre: soaked-OST0017: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 604.755435] Lustre: soaked-OST0017: Will be in recovery for at least 2:30, or until 32 clients reconnect [ 604.859781] LustreError: Skipped 3 previous similar messages [ 613.202223] Lustre: soaked-OST0017: Connection restored to (at 192.168.1.120@o2ib) [ 613.230721] Lustre: Skipped 60 previous similar messages [ 625.052799] Lustre: soaked-OST0011: Recovery over after 0:28, of 32 clients 32 recovered and 0 were evicted. [ 625.178799] Lustre: soaked-OST0011: deleting orphan objects from 0x740000402:1954773 to 0x740000402:1956385 [ 625.186027] Lustre: soaked-OST0011: deleting orphan objects from 0x740000401:2946444 to 0x740000401:2946577 [ 625.247687] Lustre: soaked-OST0011: deleting orphan objects from 0x740000400:1324860 to 0x740000400:1326312 [ 629.859616] LustreError: 11-0: soaked-MDT0003-lwp-OST0005: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114 [ 629.900030] LustreError: Skipped 3 previous similar messages [ 632.686916] Lustre: soaked-OST0017: Recovery over after 0:28, of 32 clients 32 recovered and 0 were evicted. [ 632.812885] Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000402:1330224 to 0x8c0000402:1330433 [ 632.845820] Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000401:1961276 to 0x8c0000401:1962209 [ 632.892404] Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000400:2942161 to 0x8c0000400:2942193 [ 633.363106] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2) [ 633.370215] LustreError: 29224:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0005: BRW to missing obj 0x440000402:1949013 [ 633.434168] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) Skipped 1 previous similar message [ 633.468475] format at lustre_dlm.h:1098:ldlm_lvbo_fill doesn't end in newline [ 635.387737] Lustre: soaked-OST000b: Recovery over after 0:46, of 32 clients 32 recovered and 0 were evicted. [ 635.423428] Lustre: Skipped 1 previous similar message [ 633.493466] LustreError: 7850:0:(lustre_dlm.h:1098:ldlm_lvbo_fill()) lock ffff880797e70800: delayed lvb init failed (rc -2) [ 635.794185] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2) [ 635.830221] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) Skipped 10 previous similar messages [ 654.748696] LustreError: 11-0: soaked-MDT0000-lwp-OST0011: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114 [ 654.789007] LustreError: Skipped 1 previous similar message [ 679.789097] LustreError: 11-0: soaked-MDT0003-lwp-OST0005: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114 [ 679.829488] LustreError: Skipped 5 previous similar messages [ 729.808913] LustreError: 11-0: soaked-MDT0003-lwp-OST0005: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114 [ 729.849328] LustreError: Skipped 7 previous similar messages [ 749.610650] LNet: 7784:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.114@o2ib: 0 seconds [ 804.746279] LustreError: 11-0: soaked-MDT0000-lwp-OST0017: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114 [ 804.786597] LustreError: Skipped 9 previous similar messages [ 833.890248] LNet: Service thread pid 35547 was inactive for 200.52s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 833.952191] Pid: 35547, comm: ll_ost_io01_022 [ 833.969466] Call Trace: [ 833.987632] [<ffffffff816a9589>] schedule+0x29/0x70 [ 834.006382] [<ffffffffc06b34d5>] cv_wait_common+0x125/0x150 [spl] [ 834.028931] [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40 [ 834.052232] [<ffffffffc06b3515>] __cv_wait+0x15/0x20 [spl] [ 834.072541] [<ffffffffc1f7181f>] txg_wait_synced+0xef/0x140 [zfs] [ 834.094665] [<ffffffffc0b562c8>] osd_trans_stop+0x468/0x590 [osd_zfs] [ 834.117804] [<ffffffffc117629f>] ofd_trans_stop+0x1f/0x60 [ofd] [ 834.139098] [<ffffffffc117c4c4>] ofd_commitrw_write+0x7e4/0x1c50 [ofd] [ 834.162354] [<ffffffffc11805e9>] ofd_commitrw+0x4b9/0xac0 [ofd] [ 834.183547] [<ffffffffc0f74677>] obd_commitrw+0x2ed/0x330 [ptlrpc] [ 834.205542] [<ffffffffc0f47c41>] tgt_brw_write+0xff1/0x17c0 [ptlrpc] [ 834.227963] [<ffffffff811de501>] ? __slab_free+0x81/0x2f0 [ 834.247183] [<ffffffff8118aeb4>] ? free_one_page+0x2e4/0x310 [ 834.267230] [<ffffffffc0e99f80>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc] [ 834.290662] [<ffffffffc0f43c55>] tgt_request_handle+0x925/0x1370 [ptlrpc] [ 834.314361] [<ffffffffc0eeca3e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] [ 834.340655] [<ffffffffc0ee9908>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 834.364065] [<ffffffff811dca2e>] ? kmalloc_order_trace+0x2e/0xa0 [ 834.385198] [<ffffffffc0ef01e2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 834.406858] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [ 834.425923] [<ffffffff816a904d>] ? __schedule+0x39d/0x8b0 [ 834.444999] [<ffffffffc0eef750>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [ 834.466633] [<ffffffff810b099f>] kthread+0xcf/0xe0 [ 834.483693] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 834.501018] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 [ 834.519757] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 834.542850] LustreError: dumping log to /tmp/lustre-log.1510193654.35547 [ 834.802579] Pid: 32862, comm: ll_ost_io00_019 [ 834.819060] Call Trace: [ 834.835684] [<ffffffff816a9589>] schedule+0x29/0x70 [ 834.853784] [<ffffffffc06b34d5>] cv_wait_common+0x125/0x150 [spl] [ 834.875745] [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40 [ 834.898451] [<ffffffffc06b3515>] __cv_wait+0x15/0x20 [spl] [ 834.918158] [<ffffffffc1f26bfb>] dmu_tx_wait+0x20b/0x3c0 [zfs] [ 834.938932] [<ffffffffc1f26e41>] dmu_tx_assign+0x91/0x490 [zfs] [ 834.959865] [<ffffffffc0b52f37>] osd_trans_start+0xa7/0x3a0 [osd_zfs] [ 834.982424] [<ffffffffc117620b>] ofd_trans_start+0x6b/0xe0 [ofd] [ 835.003465] [<ffffffffc117c62b>] ofd_commitrw_write+0x94b/0x1c50 [ofd] [ 835.026153] [<ffffffffc11805e9>] ofd_commitrw+0x4b9/0xac0 [ofd] [ 835.046815] [<ffffffffc0f74677>] obd_commitrw+0x2ed/0x330 [ptlrpc] [ 835.068248] [<ffffffffc0f47c41>] tgt_brw_write+0xff1/0x17c0 [ptlrpc] [ 835.090232] [<ffffffff810cd7a4>] ? update_curr+0x104/0x190 [ 835.110138] [<ffffffff810c8f28>] ? __enqueue_entity+0x78/0x80 [ 835.130802] [<ffffffff810cf91c>] ? enqueue_entity+0x26c/0xb60 [ 835.151423] [<ffffffffc0e99f80>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc] [ 835.175126] [<ffffffffc0f43c55>] tgt_request_handle+0x925/0x1370 [ptlrpc] [ 835.199009] [<ffffffffc0eeca3e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc] [ 835.225389] [<ffffffffc0ee9908>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc] [ 835.248817] [<ffffffff810c4832>] ? default_wake_function+0x12/0x20 [ 835.270461] [<ffffffff810ba598>] ? __wake_up_common+0x58/0x90 [ 835.290685] [<ffffffffc0ef01e2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc] [ 835.312227] [<ffffffff81029557>] ? __switch_to+0xd7/0x510 [ 835.331138] [<ffffffff816a9000>] ? __schedule+0x350/0x8b0 [ 835.350071] [<ffffffffc0eef750>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc] [ 835.371526] [<ffffffff810b099f>] kthread+0xcf/0xe0 [ 835.388344] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 835.405400] [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90 [ 835.423854] [<ffffffff810b08d0>] ? kthread+0x0/0xe0 [ 835.446443] Pid: 29220, comm: ll_ost_io01_004
Attachments
Issue Links
- is related to
-
LU-10250 replay-single test_74: hang and timed out
- Open