[LU-10223] OSS hang Created: 10/Nov/17  Updated: 29/Aug/18  Resolved: 29/Aug/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Sarah Liu Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: soak
Environment:

soak lustre-master-ib #16 DNE without reformat; previous build was #12


Attachments: File lustre-log.1510193654.35547    
Issue Links:
Related
is related to LU-10250 replay-single test_74: hang and time... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

After reprovison SOAK with MOFED build #16, 1 OSS (soak-7) got hung after running about 2 hours. Have tried to restart the node and mount again, it quickly hung again.

OSS dmesg

[   59.983080] IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
[  577.564976] LNet: HW NUMA nodes: 2, HW CPU cores: 32, npartitions: 2
[  577.588287] alg: No test for adler32 (adler32-zlib)
[  577.604774] alg: No test for crc32 (crc32-table)
[  578.440401] Lustre: Lustre: Build Version: 2.10.55
[  578.626899] LNet: Added LNI 192.168.1.107@o2ib [8/256/0/180]
[  578.644034] LNetError: 126:0:(o2iblnd_cb.c:2305:kiblnd_passive_connect()) Can't accept conn from 192.168.1.119@o2ib on NA (ib0:1:192.168.1.107): bad dst nid 192.168.1.107@o2ib
[  579.250293] LustreError: 137-5: soaked-OST000b_UUID: not available for connect from 192.168.1.128@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[  579.404451] Lustre: soaked-OST0005: Not available for connect from 192.168.1.118@o2ib (not set up)
[  579.468193] Lustre: soaked-OST0005: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[  579.509366] LustreError: 11-0: soaked-MDT0001-lwp-OST0005: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114
[  579.517897] Lustre: soaked-OST0005: Will be in recovery for at least 2:30, or until 32 clients reconnect
[  579.518181] Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.126@o2ib)
[  580.759587] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.133@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[  580.818149] LustreError: Skipped 53 previous similar messages
[  580.844275] Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.121@o2ib)
[  580.869576] Lustre: Skipped 4 previous similar messages
[  581.910615] LustreError: 137-5: soaked-OST0011_UUID: not available for connect from 192.168.1.120@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[  581.911026] Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.120@o2ib)
[  581.911029] Lustre: Skipped 1 previous similar message
[  582.011699] LustreError: Skipped 6 previous similar messages
[  584.510520] Lustre: soaked-OST0005: Connection restored to soaked-MDT0001-mdtlov_UUID (at 192.168.1.109@o2ib)
[  584.543241] Lustre: Skipped 1 previous similar message
[  584.613345] LNet: 7784:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.114@o2ib: 4295251 seconds
[  585.156427] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.136@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[  585.214952] LustreError: Skipped 4 previous similar messages
[  585.613324] LNet: 7784:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.115@o2ib: 4295252 seconds
[  588.629665] Lustre: soaked-OST0005: Connection restored to  (at 192.168.1.123@o2ib)
[  588.654988] Lustre: Skipped 15 previous similar messages
[  588.718309] Lustre: soaked-OST000b: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[  588.763453] LustreError: 11-0: soaked-MDT0001-lwp-OST000b: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114
[  588.800753] LustreError: Skipped 3 previous similar messages
[  589.290380] Lustre: soaked-OST000b: Will be in recovery for at least 2:30, or until 32 clients reconnect
[  590.882947] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.109@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[  596.295018] Lustre: soaked-OST0011: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[  596.343902] LustreError: 11-0: soaked-MDT0001-lwp-OST0011: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114
[  596.349189] Lustre: soaked-OST0011: Will be in recovery for at least 2:30, or until 32 clients reconnect
[  596.412527] LustreError: Skipped 3 previous similar messages
[  596.989322] Lustre: soaked-OST0011: Connection restored to soaked-MDT0000-mdtlov_UUID (at 192.168.1.108@o2ib)
[  597.025080] Lustre: Skipped 27 previous similar messages
[  601.257640] LustreError: 137-5: soaked-OST0017_UUID: not available for connect from 192.168.1.110@o2ib (no target). If you are running an HA pair check that the target is mounted on the other server.
[  601.322237] LustreError: Skipped 1 previous similar message
[  604.746939] LustreError: 11-0: soaked-MDT0001-lwp-OST0017: operation mds_connect to node 192.168.1.109@o2ib failed: rc = -114
[  604.751172] Lustre: soaked-OST0017: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[  604.755435] Lustre: soaked-OST0017: Will be in recovery for at least 2:30, or until 32 clients reconnect
[  604.859781] LustreError: Skipped 3 previous similar messages
[  613.202223] Lustre: soaked-OST0017: Connection restored to  (at 192.168.1.120@o2ib)
[  613.230721] Lustre: Skipped 60 previous similar messages
[  625.052799] Lustre: soaked-OST0011: Recovery over after 0:28, of 32 clients 32 recovered and 0 were evicted.
[  625.178799] Lustre: soaked-OST0011: deleting orphan objects from 0x740000402:1954773 to 0x740000402:1956385
[  625.186027] Lustre: soaked-OST0011: deleting orphan objects from 0x740000401:2946444 to 0x740000401:2946577
[  625.247687] Lustre: soaked-OST0011: deleting orphan objects from 0x740000400:1324860 to 0x740000400:1326312
[  629.859616] LustreError: 11-0: soaked-MDT0003-lwp-OST0005: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114
[  629.900030] LustreError: Skipped 3 previous similar messages
[  632.686916] Lustre: soaked-OST0017: Recovery over after 0:28, of 32 clients 32 recovered and 0 were evicted.
[  632.812885] Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000402:1330224 to 0x8c0000402:1330433
[  632.845820] Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000401:1961276 to 0x8c0000401:1962209
[  632.892404] Lustre: soaked-OST0017: deleting orphan objects from 0x8c0000400:2942161 to 0x8c0000400:2942193
[  633.363106] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2)
[  633.370215] LustreError: 29224:0:(ofd_io.c:616:ofd_preprw_write()) soaked-OST0005: BRW to missing obj 0x440000402:1949013
[  633.434168] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) Skipped 1 previous similar message
[  633.468475] format at lustre_dlm.h:1098:ldlm_lvbo_fill doesn't end in newline
[  635.387737] Lustre: soaked-OST000b: Recovery over after 0:46, of 32 clients 32 recovered and 0 were evicted.
[  635.423428] Lustre: Skipped 1 previous similar message
[  633.493466] LustreError: 7850:0:(lustre_dlm.h:1098:ldlm_lvbo_fill()) lock ffff880797e70800: delayed lvb init failed (rc -2)
[  635.794185] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) delayed lvb init failed (rc -2)
[  635.830221] LustreError: 7850:0:(lustre_dlm.h:1372:ldlm_res_lvbo_update()) Skipped 10 previous similar messages
[  654.748696] LustreError: 11-0: soaked-MDT0000-lwp-OST0011: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114
[  654.789007] LustreError: Skipped 1 previous similar message
[  679.789097] LustreError: 11-0: soaked-MDT0003-lwp-OST0005: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114
[  679.829488] LustreError: Skipped 5 previous similar messages
[  729.808913] LustreError: 11-0: soaked-MDT0003-lwp-OST0005: operation mds_connect to node 192.168.1.111@o2ib failed: rc = -114
[  729.849328] LustreError: Skipped 7 previous similar messages
[  749.610650] LNet: 7784:0:(o2iblnd_cb.c:3198:kiblnd_check_conns()) Timed out tx for 192.168.1.114@o2ib: 0 seconds
[  804.746279] LustreError: 11-0: soaked-MDT0000-lwp-OST0017: operation mds_connect to node 192.168.1.108@o2ib failed: rc = -114
[  804.786597] LustreError: Skipped 9 previous similar messages
[  833.890248] LNet: Service thread pid 35547 was inactive for 200.52s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[  833.952191] Pid: 35547, comm: ll_ost_io01_022
[  833.969466] 
Call Trace:
[  833.987632]  [<ffffffff816a9589>] schedule+0x29/0x70
[  834.006382]  [<ffffffffc06b34d5>] cv_wait_common+0x125/0x150 [spl]
[  834.028931]  [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40
[  834.052232]  [<ffffffffc06b3515>] __cv_wait+0x15/0x20 [spl]
[  834.072541]  [<ffffffffc1f7181f>] txg_wait_synced+0xef/0x140 [zfs]
[  834.094665]  [<ffffffffc0b562c8>] osd_trans_stop+0x468/0x590 [osd_zfs]
[  834.117804]  [<ffffffffc117629f>] ofd_trans_stop+0x1f/0x60 [ofd]
[  834.139098]  [<ffffffffc117c4c4>] ofd_commitrw_write+0x7e4/0x1c50 [ofd]
[  834.162354]  [<ffffffffc11805e9>] ofd_commitrw+0x4b9/0xac0 [ofd]
[  834.183547]  [<ffffffffc0f74677>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[  834.205542]  [<ffffffffc0f47c41>] tgt_brw_write+0xff1/0x17c0 [ptlrpc]
[  834.227963]  [<ffffffff811de501>] ? __slab_free+0x81/0x2f0
[  834.247183]  [<ffffffff8118aeb4>] ? free_one_page+0x2e4/0x310
[  834.267230]  [<ffffffffc0e99f80>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc]
[  834.290662]  [<ffffffffc0f43c55>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[  834.314361]  [<ffffffffc0eeca3e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[  834.340655]  [<ffffffffc0ee9908>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[  834.364065]  [<ffffffff811dca2e>] ? kmalloc_order_trace+0x2e/0xa0
[  834.385198]  [<ffffffffc0ef01e2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[  834.406858]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
[  834.425923]  [<ffffffff816a904d>] ? __schedule+0x39d/0x8b0
[  834.444999]  [<ffffffffc0eef750>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[  834.466633]  [<ffffffff810b099f>] kthread+0xcf/0xe0
[  834.483693]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
[  834.501018]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
[  834.519757]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0

[  834.542850] LustreError: dumping log to /tmp/lustre-log.1510193654.35547
[  834.802579] Pid: 32862, comm: ll_ost_io00_019
[  834.819060] 
Call Trace:
[  834.835684]  [<ffffffff816a9589>] schedule+0x29/0x70
[  834.853784]  [<ffffffffc06b34d5>] cv_wait_common+0x125/0x150 [spl]
[  834.875745]  [<ffffffff810b1920>] ? autoremove_wake_function+0x0/0x40
[  834.898451]  [<ffffffffc06b3515>] __cv_wait+0x15/0x20 [spl]
[  834.918158]  [<ffffffffc1f26bfb>] dmu_tx_wait+0x20b/0x3c0 [zfs]
[  834.938932]  [<ffffffffc1f26e41>] dmu_tx_assign+0x91/0x490 [zfs]
[  834.959865]  [<ffffffffc0b52f37>] osd_trans_start+0xa7/0x3a0 [osd_zfs]
[  834.982424]  [<ffffffffc117620b>] ofd_trans_start+0x6b/0xe0 [ofd]
[  835.003465]  [<ffffffffc117c62b>] ofd_commitrw_write+0x94b/0x1c50 [ofd]
[  835.026153]  [<ffffffffc11805e9>] ofd_commitrw+0x4b9/0xac0 [ofd]
[  835.046815]  [<ffffffffc0f74677>] obd_commitrw+0x2ed/0x330 [ptlrpc]
[  835.068248]  [<ffffffffc0f47c41>] tgt_brw_write+0xff1/0x17c0 [ptlrpc]
[  835.090232]  [<ffffffff810cd7a4>] ? update_curr+0x104/0x190
[  835.110138]  [<ffffffff810c8f28>] ? __enqueue_entity+0x78/0x80
[  835.130802]  [<ffffffff810cf91c>] ? enqueue_entity+0x26c/0xb60
[  835.151423]  [<ffffffffc0e99f80>] ? target_bulk_timeout+0x0/0xb0 [ptlrpc]
[  835.175126]  [<ffffffffc0f43c55>] tgt_request_handle+0x925/0x1370 [ptlrpc]
[  835.199009]  [<ffffffffc0eeca3e>] ptlrpc_server_handle_request+0x24e/0xab0 [ptlrpc]
[  835.225389]  [<ffffffffc0ee9908>] ? ptlrpc_wait_event+0x98/0x340 [ptlrpc]
[  835.248817]  [<ffffffff810c4832>] ? default_wake_function+0x12/0x20
[  835.270461]  [<ffffffff810ba598>] ? __wake_up_common+0x58/0x90
[  835.290685]  [<ffffffffc0ef01e2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
[  835.312227]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
[  835.331138]  [<ffffffff816a9000>] ? __schedule+0x350/0x8b0
[  835.350071]  [<ffffffffc0eef750>] ? ptlrpc_main+0x0/0x1e40 [ptlrpc]
[  835.371526]  [<ffffffff810b099f>] kthread+0xcf/0xe0
[  835.388344]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0
[  835.405400]  [<ffffffff816b4fd8>] ret_from_fork+0x58/0x90
[  835.423854]  [<ffffffff810b08d0>] ? kthread+0x0/0xe0

[  835.446443] Pid: 29220, comm: ll_ost_io01_004


 Comments   
Comment by Andreas Dilger [ 29/Aug/18 ]

Close as a duplicate of LU-10250, not much information here.

Generated at Sat Feb 10 02:33:09 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.