[LU-5214] Failure on test suite replay-ost-single test_5 Created: 17/Jun/14  Updated: 16/Jan/19  Resolved: 16/Jan/19

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0, Lustre 2.10.0, Lustre 2.11.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Yang Sheng
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

server and client: lustre-master build # 2091 DNE


Issue Links:
Duplicate
Related
is related to LU-9273 replay-ost-single test_5: timeout aft... Resolved
is related to LU-4950 sanity-benchmark test fsx hung: txg_s... Closed
is related to LU-5575 Failure on test suite replay-ost-sing... Closed
Severity: 3
Rank (Obsolete): 14548

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/221291fa-f523-11e3-b29e-52540035b04c.

The sub-test test_5 failed with the following error:

test failed to respond and timed out



 Comments   
Comment by Oleg Drokin [ 18/Jun/14 ]

shadow-49vm4 has these locked up threads in the dmesg logs:

LNet: Service thread pid 13815 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 13815, comm: ll_ost_io00_062

Call Trace:
 [<ffffffff815287f3>] io_schedule+0x73/0xc0
 [<ffffffff81267cc8>] get_request_wait+0x108/0x1d0
 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8126160e>] ? elv_merge+0x17e/0x1c0
 [<ffffffff81267e29>] blk_queue_bio+0x99/0x620
 [<ffffffff8116e900>] ? cache_alloc_refill+0x1c0/0x240
 [<ffffffff81266ebf>] generic_make_request+0x29f/0x5f0
 [<ffffffffa0002f97>] ? dm_merge_bvec+0xc7/0x100 [dm_mod]
 [<ffffffff81267280>] submit_bio+0x70/0x120
 [<ffffffffa05bf80e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
 [<ffffffffa0d3f80c>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]
 [<ffffffffa0d3fc3c>] osd_do_bio+0x3ec/0x820 [osd_ldiskfs]
 [<ffffffffa0436878>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
 [<ffffffffa0d4317c>] osd_write_commit+0x31c/0x610 [osd_ldiskfs]
 [<ffffffffa0e63d04>] ofd_commitrw_write+0x604/0xfd0 [ofd]
 [<ffffffffa0e64bfa>] ofd_commitrw+0x52a/0x8c0 [ofd]
 [<ffffffffa05cac31>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
 [<ffffffffa088358d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc]
 [<ffffffffa088a7ce>] tgt_brw_write+0xc7e/0x1530 [ptlrpc]
 [<ffffffffa07e6750>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa08892cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
 [<ffffffffa0838d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
 [<ffffffffa0838020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
 [<ffffffff8109ab56>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109aac0>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

LustreError: dumping log to /tmp/lustre-log.1402825080.13815
LNet: Service thread pid 13815 completed after 50.47s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
LNet: Service thread pid 12699 completed after 62.13s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
LNet: Service thread pid 12663 was inactive for 62.16s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Pid: 12663, comm: ll_ost_io00_016

Call Trace:
 [<ffffffff815287f3>] io_schedule+0x73/0xc0
 [<ffffffff81267cc8>] get_request_wait+0x108/0x1d0
 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8126160e>] ? elv_merge+0x17e/0x1c0
 [<ffffffff81267e29>] blk_queue_bio+0x99/0x620
 [<ffffffff8116e637>] ? cache_grow+0x217/0x320
 [<ffffffff81266ebf>] generic_make_request+0x29f/0x5f0
 [<ffffffffa0002f97>] ? dm_merge_bvec+0xc7/0x100 [dm_mod]
 [<ffffffff81267280>] submit_bio+0x70/0x120
 [<ffffffffa05bf80e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass]
 [<ffffffffa0d3f80c>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs]
 [<ffffffffa0d3fc3c>] osd_do_bio+0x3ec/0x820 [osd_ldiskfs]
 [<ffffffffa0436878>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs]
 [<ffffffffa0d4317c>] osd_write_commit+0x31c/0x610 [osd_ldiskfs]
 [<ffffffffa0e63d04>] ofd_commitrw_write+0x604/0xfd0 [ofd]
 [<ffffffffa0e64bfa>] ofd_commitrw+0x52a/0x8c0 [ofd]
 [<ffffffffa05cac31>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
 [<ffffffffa088358d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc]
 [<ffffffffa088a7ce>] tgt_brw_write+0xc7e/0x1530 [ptlrpc]
 [<ffffffffa07e6750>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa08892cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
 [<ffffffffa0838d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
 [<ffffffffa0838020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
 [<ffffffff8109ab56>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109aac0>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Pid: 12682, comm: ll_ost_io00_035

Call Trace:
 [<ffffffffa03dd9b7>] ? jbd2_journal_stop+0x1e7/0x2b0 [jbd2]
 [<ffffffff8109b22e>] ? prepare_to_wait+0x4e/0x80
 [<ffffffffa0d203f5>] osd_trans_stop+0x195/0x550 [osd_ldiskfs]
 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa0e5c5ff>] ofd_trans_stop+0x1f/0x60 [ofd]
 [<ffffffffa0e63aa2>] ofd_commitrw_write+0x3a2/0xfd0 [ofd]
 [<ffffffffa0e64bfa>] ofd_commitrw+0x52a/0x8c0 [ofd]
 [<ffffffffa05cac31>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
 [<ffffffffa088358d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc]
 [<ffffffffa088a7ce>] tgt_brw_write+0xc7e/0x1530 [ptlrpc]
 [<ffffffffa07e6750>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa08892cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
 [<ffffffffa0838d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
 [<ffffffffa0838020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
 [<ffffffff8109ab56>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109aac0>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20

Pid: 12668, comm: ll_ost_io00_021

Call Trace:
 [<ffffffffa03de08a>] start_this_handle+0x25a/0x480 [jbd2]
 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa03de495>] jbd2_journal_start+0xb5/0x100 [jbd2]
 [<ffffffffa0436906>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs]
 [<ffffffffa0d21fdf>] osd_trans_start+0x1df/0x660 [osd_ldiskfs]
 [<ffffffffa0d3182a>] ? osd_declare_attr_set+0x13a/0x7b0 [osd_ldiskfs]
 [<ffffffffa0e5c6bc>] ofd_trans_start+0x7c/0x100 [ofd]
 [<ffffffffa0e63c23>] ofd_commitrw_write+0x523/0xfd0 [ofd]
 [<ffffffffa0e64bfa>] ofd_commitrw+0x52a/0x8c0 [ofd]
 [<ffffffffa05cac31>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
 [<ffffffffa088358d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc]
 [<ffffffffa088a7ce>] tgt_brw_write+0xc7e/0x1530 [ptlrpc]
 [<ffffffffa07e6750>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa08892cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
 [<ffffffffa0838d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
 [<ffffffffa0838020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
 [<ffffffff8109ab56>] kthread+0x96/0xa0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff8109aac0>] ? kthread+0x0/0xa0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ]

master, build# 3264, 2.7.64 tag
Hard Failover: EL6.7 Server/Client - ZFS
https://testing.hpdd.intel.com/test_sets/2dc08784-9ebc-11e5-98a4-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for hardfailover: EL6.7 Server/Client - ZFS
build# 3305
https://testing.hpdd.intel.com/test_sets/e3cfd3b2-bbd7-11e5-8506-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ]

Another instance found for hardfailover: EL7 Server/Client - ZFS
build# 3305
https://testing.hpdd.intel.com/test_sets/febe1384-bbc6-11e5-8506-5254006e85c2

Comment by Sarah Liu [ 20/Jan/16 ]

instance on master build # 3305 RHEL6.7
https://testing.hpdd.intel.com/test_sets/8b9aed50-bc84-11e5-b3b7-5254006e85c2

It looks like this issue affects multiple branches, could this be considered a higher priority?

Comment by Peter Jones [ 22/Jan/16 ]

YangSheng

Could you please look into this issue?

Thanks

Peter

Comment by Saurabh Tandan (Inactive) [ 04/Feb/16 ]

Another instance occurred for FULL - EL6.7 Server/EL6.7 Client - ZFS , master , build# 3314
https://testing.hpdd.intel.com/test_sets/98eb99ce-cb47-11e5-a59a-5254006e85c2

Another instance on master for FULL - EL7.1 Server/EL7.1 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/ddc75dc6-cb88-11e5-b49e-5254006e85c2

Comment by Yang Sheng [ 16/Jan/19 ]

Please reopen it if hit again.

Generated at Sat Feb 10 01:49:31 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.