[LU-5214] Failure on test suite replay-ost-single test_5 Created: 17/Jun/14 Updated: 16/Jan/19 Resolved: 16/Jan/19 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.7.0, Lustre 2.8.0, Lustre 2.10.0, Lustre 2.11.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Yang Sheng |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
server and client: lustre-master build # 2091 DNE |
||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 14548 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/221291fa-f523-11e3-b29e-52540035b04c. The sub-test test_5 failed with the following error:
|
| Comments |
| Comment by Oleg Drokin [ 18/Jun/14 ] |
|
shadow-49vm4 has these locked up threads in the dmesg logs: LNet: Service thread pid 13815 was inactive for 40.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 13815, comm: ll_ost_io00_062 Call Trace: [<ffffffff815287f3>] io_schedule+0x73/0xc0 [<ffffffff81267cc8>] get_request_wait+0x108/0x1d0 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8126160e>] ? elv_merge+0x17e/0x1c0 [<ffffffff81267e29>] blk_queue_bio+0x99/0x620 [<ffffffff8116e900>] ? cache_alloc_refill+0x1c0/0x240 [<ffffffff81266ebf>] generic_make_request+0x29f/0x5f0 [<ffffffffa0002f97>] ? dm_merge_bvec+0xc7/0x100 [dm_mod] [<ffffffff81267280>] submit_bio+0x70/0x120 [<ffffffffa05bf80e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass] [<ffffffffa0d3f80c>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs] [<ffffffffa0d3fc3c>] osd_do_bio+0x3ec/0x820 [osd_ldiskfs] [<ffffffffa0436878>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs] [<ffffffffa0d4317c>] osd_write_commit+0x31c/0x610 [osd_ldiskfs] [<ffffffffa0e63d04>] ofd_commitrw_write+0x604/0xfd0 [ofd] [<ffffffffa0e64bfa>] ofd_commitrw+0x52a/0x8c0 [ofd] [<ffffffffa05cac31>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa088358d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc] [<ffffffffa088a7ce>] tgt_brw_write+0xc7e/0x1530 [ptlrpc] [<ffffffffa07e6750>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa08892cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc] [<ffffffffa0838d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] [<ffffffffa0838020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] [<ffffffff8109ab56>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 LustreError: dumping log to /tmp/lustre-log.1402825080.13815 LNet: Service thread pid 13815 completed after 50.47s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). LNet: Service thread pid 12699 completed after 62.13s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). LNet: Service thread pid 12663 was inactive for 62.16s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 12663, comm: ll_ost_io00_016 Call Trace: [<ffffffff815287f3>] io_schedule+0x73/0xc0 [<ffffffff81267cc8>] get_request_wait+0x108/0x1d0 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8126160e>] ? elv_merge+0x17e/0x1c0 [<ffffffff81267e29>] blk_queue_bio+0x99/0x620 [<ffffffff8116e637>] ? cache_grow+0x217/0x320 [<ffffffff81266ebf>] generic_make_request+0x29f/0x5f0 [<ffffffffa0002f97>] ? dm_merge_bvec+0xc7/0x100 [dm_mod] [<ffffffff81267280>] submit_bio+0x70/0x120 [<ffffffffa05bf80e>] ? lprocfs_oh_tally+0x2e/0x50 [obdclass] [<ffffffffa0d3f80c>] osd_submit_bio+0x1c/0x60 [osd_ldiskfs] [<ffffffffa0d3fc3c>] osd_do_bio+0x3ec/0x820 [osd_ldiskfs] [<ffffffffa0436878>] ? __ldiskfs_journal_stop+0x68/0xa0 [ldiskfs] [<ffffffffa0d4317c>] osd_write_commit+0x31c/0x610 [osd_ldiskfs] [<ffffffffa0e63d04>] ofd_commitrw_write+0x604/0xfd0 [ofd] [<ffffffffa0e64bfa>] ofd_commitrw+0x52a/0x8c0 [ofd] [<ffffffffa05cac31>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa088358d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc] [<ffffffffa088a7ce>] tgt_brw_write+0xc7e/0x1530 [ptlrpc] [<ffffffffa07e6750>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa08892cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc] [<ffffffffa0838d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] [<ffffffffa0838020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] [<ffffffff8109ab56>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Pid: 12682, comm: ll_ost_io00_035 Call Trace: [<ffffffffa03dd9b7>] ? jbd2_journal_stop+0x1e7/0x2b0 [jbd2] [<ffffffff8109b22e>] ? prepare_to_wait+0x4e/0x80 [<ffffffffa0d203f5>] osd_trans_stop+0x195/0x550 [osd_ldiskfs] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0e5c5ff>] ofd_trans_stop+0x1f/0x60 [ofd] [<ffffffffa0e63aa2>] ofd_commitrw_write+0x3a2/0xfd0 [ofd] [<ffffffffa0e64bfa>] ofd_commitrw+0x52a/0x8c0 [ofd] [<ffffffffa05cac31>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa088358d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc] [<ffffffffa088a7ce>] tgt_brw_write+0xc7e/0x1530 [ptlrpc] [<ffffffffa07e6750>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa08892cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc] [<ffffffffa0838d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] [<ffffffffa0838020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] [<ffffffff8109ab56>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Pid: 12668, comm: ll_ost_io00_021 Call Trace: [<ffffffffa03de08a>] start_this_handle+0x25a/0x480 [jbd2] [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa03de495>] jbd2_journal_start+0xb5/0x100 [jbd2] [<ffffffffa0436906>] ldiskfs_journal_start_sb+0x56/0xe0 [ldiskfs] [<ffffffffa0d21fdf>] osd_trans_start+0x1df/0x660 [osd_ldiskfs] [<ffffffffa0d3182a>] ? osd_declare_attr_set+0x13a/0x7b0 [osd_ldiskfs] [<ffffffffa0e5c6bc>] ofd_trans_start+0x7c/0x100 [ofd] [<ffffffffa0e63c23>] ofd_commitrw_write+0x523/0xfd0 [ofd] [<ffffffffa0e64bfa>] ofd_commitrw+0x52a/0x8c0 [ofd] [<ffffffffa05cac31>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass] [<ffffffffa088358d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc] [<ffffffffa088a7ce>] tgt_brw_write+0xc7e/0x1530 [ptlrpc] [<ffffffffa07e6750>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa08892cc>] tgt_request_handle+0x23c/0xac0 [ptlrpc] [<ffffffffa0838d3a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] [<ffffffffa0838020>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] [<ffffffff8109ab56>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 |
| Comment by Saurabh Tandan (Inactive) [ 10/Dec/15 ] |
|
master, build# 3264, 2.7.64 tag |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for hardfailover: EL6.7 Server/Client - ZFS |
| Comment by Saurabh Tandan (Inactive) [ 20/Jan/16 ] |
|
Another instance found for hardfailover: EL7 Server/Client - ZFS |
| Comment by Sarah Liu [ 20/Jan/16 ] |
|
instance on master build # 3305 RHEL6.7 It looks like this issue affects multiple branches, could this be considered a higher priority? |
| Comment by Peter Jones [ 22/Jan/16 ] |
|
YangSheng Could you please look into this issue? Thanks Peter |
| Comment by Saurabh Tandan (Inactive) [ 04/Feb/16 ] |
|
Another instance occurred for FULL - EL6.7 Server/EL6.7 Client - ZFS , master , build# 3314 Another instance on master for FULL - EL7.1 Server/EL7.1 Client - ZFS, build# 3314 |
| Comment by Yang Sheng [ 16/Jan/19 ] |
|
Please reopen it if hit again. |