[LU-12979] OSS hung during failback Created: 18/Nov/19  Updated: 16/Apr/20  Resolved: 16/Apr/20

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Sarah Liu Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: soak
Environment:

b2_13-ib build #2


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

During OSS failover testing, the failover pair node hung during failback after system running for about 35 hours

soak-7

[17882.991912] Lustre: soaked-OST0006: deleting orphan objects from 0x0:1148098 to 0x0:1149825
[17884.486957] Lustre: Failing over soaked-OST0002
[17884.728160] Lustre: server umount soaked-OST0002 complete
[17895.395319] Lustre: Failing over soaked-OST000a
[17895.413977] Lustre: server umount soaked-OST000a complete
[17896.790424] LustreError: 137-5: soaked-OST0002_UUID: not available for connect from 192.168.1.124@o2ib (no target). If you are running an HA pair che
ck that the target is mounted on the other server.
[17896.810226] LustreError: Skipped 437 previous similar messages
[17902.199718] Lustre: Failing over soaked-OST0006
[17902.346578] Lustre: server umount soaked-OST0006 complete
[17908.627101] Lustre: Failing over soaked-OST000e
[17908.758771] Lustre: server umount soaked-OST000e complete
[17923.789823] Lustre: soaked-OST0007: Export ffff9724a6452c00 already connecting from 192.168.1.118@o2ib
[17931.042012] Lustre: soaked-OST0007: Export ffff9724d6813400 already connecting from 192.168.1.126@o2ib
[17937.914063] Lustre: soaked-OST0007: Export ffff9728a5360800 already connecting from 192.168.1.130@o2ib
[17940.946368] Lustre: soaked-OST0007: Export ffff9728d851d800 already connecting from 192.168.1.137@o2ib
[17940.956765] Lustre: Skipped 1 previous similar message
[17946.942948] Lustre: soaked-OST0007: Export ffff9728d81df400 already connecting from 192.168.1.124@o2ib
[17973.969095] Lustre: soaked-OST0007: Export ffff9724a6452c00 already connecting from 192.168.1.118@o2ib
[17973.979489] Lustre: Skipped 13 previous similar messages
[17991.122858] Lustre: soaked-OST0007: Export ffff9728d851d800 already connecting from 192.168.1.137@o2ib
[17991.133257] Lustre: Skipped 3 previous similar messages
[18024.148290] Lustre: soaked-OST0007: Export ffff9724a6452c00 already connecting from 192.168.1.118@o2ib
[18024.158682] Lustre: Skipped 14 previous similar messages
[18064.712164] Lustre: ll_ost00_008: service thread pid 18052 was inactive for 200.107 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[18064.733418] Pid: 18052, comm: ll_ost00_008 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Nov 8 18:37:40 UTC 2019
[18064.744672] Call Trace:
[18064.747423]  [<ffffffffc091e2d5>] cv_wait_common+0x125/0x150 [spl]
[18064.754345]  [<ffffffffc091e315>] __cv_wait+0x15/0x20 [spl]
[18064.760577]  [<ffffffffc0c992ef>] txg_wait_synced+0xef/0x140 [zfs]
[18064.767530]  [<ffffffffc0c4ecc5>] dmu_tx_wait+0x275/0x3c0 [zfs]
[18064.774174]  [<ffffffffc0c4eea2>] dmu_tx_assign+0x92/0x490 [zfs]
[18064.780910]  [<ffffffffc16c2fd9>] osd_trans_start+0x199/0x440 [osd_zfs]
[18064.788312]  [<ffffffffc14e8430>] tgt_server_data_update+0x3c0/0x510 [ptlrpc]
[18064.796347]  [<ffffffffc14ea40d>] tgt_client_del+0x29d/0x6a0 [ptlrpc]
[18064.803581]  [<ffffffffc180523c>] ofd_obd_disconnect+0x1ac/0x220 [ofd]
[18064.810885]  [<ffffffffc144f176>] target_handle_disconnect+0xd6/0x450 [ptlrpc]
[18064.818985]  [<ffffffffc14f0d38>] tgt_disconnect+0x58/0x170 [ptlrpc]
[18064.826127]  [<ffffffffc14f983a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
[18064.833851]  [<ffffffffc149ba96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[18064.842439]  [<ffffffffc149f5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[18064.849482]  [<ffffffff93cc50d1>] kthread+0xd1/0xe0
[18064.854943]  [<ffffffff9438cd37>] ret_from_fork_nospec_end+0x0/0x39
[18064.861953]  [<ffffffffffffffff>] 0xffffffffffffffff
[18065.736218] Lustre: ll_ost00_006: service thread pid 18039 was inactive for 200.435 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[18065.757478] Pid: 18039, comm: ll_ost00_006 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Nov 8 18:37:40 UTC 2019
[18065.768742] Call Trace:
[18065.771481]  [<ffffffffc091e2d5>] cv_wait_common+0x125/0x150 [spl]
[18065.778397]  [<ffffffffc091e315>] __cv_wait+0x15/0x20 [spl]
[18065.784636]  [<ffffffffc0c992ef>] txg_wait_synced+0xef/0x140 [zfs]
[18065.791576]  [<ffffffffc0c4ecc5>] dmu_tx_wait+0x275/0x3c0 [zfs]
[18065.798219]  [<ffffffffc0c4eea2>] dmu_tx_assign+0x92/0x490 [zfs]
[18065.804957]  [<ffffffffc16c2fd9>] osd_trans_start+0x199/0x440 [osd_zfs]
[18065.812358]  [<ffffffffc14e8430>] tgt_server_data_update+0x3c0/0x510 [ptlrpc]
[18065.820373]  [<ffffffffc14ea40d>] tgt_client_del+0x29d/0x6a0 [ptlrpc]
[18065.827613]  [<ffffffffc180523c>] ofd_obd_disconnect+0x1ac/0x220 [ofd]
[18065.834923]  [<ffffffffc144f176>] target_handle_disconnect+0xd6/0x450 [ptlrpc]
[18065.843022]  [<ffffffffc14f0d38>] tgt_disconnect+0x58/0x170 [ptlrpc]
[18065.850165]  [<ffffffffc14f983a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
[18065.857886]  [<ffffffffc149ba96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[18065.866478]  [<ffffffffc149f5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[18065.873511]  [<ffffffff93cc50d1>] kthread+0xd1/0xe0
[18065.878972]  [<ffffffff9438cd37>] ret_from_fork_nospec_end+0x0/0x39
[18065.885982]  [<ffffffffffffffff>] 0xffffffffffffffff
...

[21128.204987] Lustre: Skipped 1 previous similar message
[21138.287341] Lustre: soaked-OST0002: Not available for connect from 192.168.1.137@o2ib (not set up)
[21138.297347] Lustre: Skipped 2 previous similar messages
[21167.633849] Lustre: soaked-OST0002: Not available for connect from 192.168.1.111@o2ib (not set up)
[21167.643858] Lustre: Skipped 9 previous similar messages
[21195.356109] Lustre: 23774:0:(service.c:1442:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff97278de0e780 x1650366648233984/t0(0) o5->soaked-MDT0002-mdtlov_UUID@192.168.1.110@o2ib:545/0 lens 432/432 e 1 to 0 dl 1573925640 ref 2 fl Interpret:/0/0 rc 0/0 job:''
[21195.386391] Lustre: 23774:0:(service.c:1442:ptlrpc_at_send_early_reply()) Skipped 3 previous similar messages
[21199.340302] ptlrpc_watchdog_fire: 3 callbacks suppressed
[21199.346241] Lustre: ll_ost00_024: service thread pid 5497 was inactive for 1200.379 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[21199.367494] Pid: 5497, comm: ll_ost00_024 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Nov 8 18:37:40 UTC 2019
[21199.378663] Call Trace:
[21199.381399]  [<ffffffffc1801873>] ofd_create_hdl+0xcc3/0x2100 [ofd]
[21199.388410]  [<ffffffffc14f983a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
[21199.396155]  [<ffffffffc149ba96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[21199.404748]  [<ffffffffc149f5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[21199.411790]  [<ffffffff93cc50d1>] kthread+0xd1/0xe0
[21199.417239]  [<ffffffff9438cd37>] ret_from_fork_nospec_end+0x0/0x39
[21199.424249]  [<ffffffffffffffff>] 0xffffffffffffffff
[21203.436516] Pid: 18121, comm: ll_ost01_023 3.10.0-1062.1.1.el7_lustre.x86_64 #1 SMP Fri Nov 8 18:37:40 UTC 2019
[21203.447781] Call Trace:
[21203.450514]  [<ffffffffc1801873>] ofd_create_hdl+0xcc3/0x2100 [ofd]
[21203.457536]  [<ffffffffc14f983a>] tgt_request_handle+0x98a/0x1630 [ptlrpc]
[21203.465297]  [<ffffffffc149ba96>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[21203.473893]  [<ffffffffc149f5cc>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[21203.480928]  [<ffffffff93cc50d1>] kthread+0xd1/0xe0
[21203.486390]  [<ffffffff9438cd37>] ret_from_fork_nospec_end+0x0/0x39
[21203.493401]  [<ffffffffffffffff>] 0xffffffffffffffff
[21217.812639] Lustre: soaked-OST0002: Not available for connect from 192.168.1.111@o2ib (not set up)
[21217.822646] Lustre: Skipped 23 previous similar messages



 Comments   
Comment by Joseph Gmitter (Inactive) [ 20/Nov/19 ]

Hi Alex,

Could you take a look to see what may be going on here?

Thanks.
Joe

Comment by Alex Zhuravlev [ 20/Nov/19 ]

it would be very helpful to get full stack traces.. I think I saw something similar in maloo recently, but also w/o full traces..

Comment by Sarah Liu [ 21/Nov/19 ]

Sorry, there is no full stack. I will try and see if this issue can be reproduce and get the stack

Generated at Sat Feb 10 02:57:19 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.