[LU-4439] Test failure on replay-single test_70b: rundbench load failed Created: 06/Jan/14 Updated: 14/Oct/15 Resolved: 13/Oct/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | performance, zfs | ||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 12183 | ||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/80971760-7539-11e3-936d-52540035b04c. The sub-test test_70b failed with the following error:
Info required for matching: replay-single 70b |
| Comments |
| Comment by Nathaniel Clark [ 09/Jan/14 ] |
|
For review-zfs failures, the errors are thus:
|
| Comment by Nathaniel Clark [ 24/Jan/14 ] |
|
Similar occurrence of blocked ll_ost_io thread in replay-single/10 OST log: 20:59:40:Lustre: DEBUG MARKER: == replay-single test 10: create |X| rename unlink == 20:59:01 (1390453141) 20:59:40:LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 210s: evicting client at 10.10.16.209@tcp ns: filter-lustre-OST0001_UUID lock: ffff88001a728540/0xfea81e9df99a13dc lrc: 3/0,0 mode: PW/PW res: [0x78c5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 10.10.16.209@tcp remote: 0xa2e90a29d78a83f6 expref: 115 pid: 26117 timeout: 4308758774 lvb_type: 0 20:59:40:LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 1 previous similar message 21:00:13:Lustre: lustre-OST0001: Export ffff88006ddd4800 already connecting from 10.10.16.209@tcp 21:00:13:Lustre: Skipped 10 previous similar messages 21:00:13:INFO: task ll_ost00_002:25056 blocked for more than 120 seconds. 21:00:13:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 21:00:13:ll_ost00_002 D 0000000000000000 0 25056 2 0x00000080 21:00:13: ffff8800735e9a10 0000000000000046 ffff88006f2716c0 ffffc900076dae40 21:00:13: ffff8800735e99f0 ffffffffa01fa09a ffff8800735e9a60 0000000000000000 21:00:13: ffff880079bc3098 ffff8800735e9fd8 000000000000fb88 ffff880079bc3098 21:00:13:Call Trace: 21:00:13: [<ffffffffa01fa09a>] ? dmu_tx_count_twig+0x10a/0x1a0 [zfs] 21:00:13: [<ffffffff81096fae>] ? prepare_to_wait_exclusive+0x4e/0x80 21:00:13: [<ffffffffa0135edd>] cv_wait_common+0xed/0x100 [spl] 21:00:13: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40 21:00:13: [<ffffffffa0135f45>] __cv_wait+0x15/0x20 [spl] 21:00:13: [<ffffffffa0234e9b>] txg_wait_open+0x7b/0xa0 [zfs] 21:00:13: [<ffffffffa01f9a5d>] dmu_tx_wait+0xed/0xf0 [zfs] 21:00:13: [<ffffffffa01f9aee>] dmu_tx_assign+0x8e/0x4e0 [zfs] 21:00:13: [<ffffffffa0e1c531>] ? osd_declare_quota+0x1c1/0x2d0 [osd_zfs] 21:00:13: [<ffffffffa0e1756c>] osd_trans_start+0x9c/0x410 [osd_zfs] 21:00:13: [<ffffffffa09d4cfe>] tgt_client_data_update+0x26e/0x5a0 [ptlrpc] 21:00:13: [<ffffffffa09d5280>] tgt_client_del+0x250/0x510 [ptlrpc] 21:00:13: [<ffffffffa0ece47e>] ? ofd_grant_discard+0x3e/0x1c0 [ofd] 21:00:14: [<ffffffffa0eba55b>] ofd_obd_disconnect+0x1bb/0x200 [ofd] 21:00:14: [<ffffffffa093c481>] target_handle_disconnect+0x1b1/0x480 [ptlrpc] 21:00:14: [<ffffffffa09dd6d9>] tgt_disconnect+0x39/0x160 [ptlrpc] 21:00:14: [<ffffffffa09ddf2c>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 21:00:14: [<ffffffffa098d61a>] ptlrpc_main+0xd1a/0x1970 [ptlrpc] 21:00:14: [<ffffffffa098c900>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] 21:00:14: [<ffffffff81096a36>] kthread+0x96/0xa0 21:00:14: [<ffffffff8100c0ca>] child_rip+0xa/0x20 21:00:14: [<ffffffff810969a0>] ? kthread+0x0/0xa0 21:00:14: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Nathaniel Clark [ 27/Jan/14 ] |
|
review-zfs recovery-small/27 Jan 23 20:54:34 wtm-14vm4 kernel: LNet: Service thread pid 18275 was inactive for 136.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Jan 23 20:54:34 wtm-14vm4 kernel: Pid: 18275, comm: ll_ost_io00_016 Jan 23 20:54:34 wtm-14vm4 kernel: Jan 23 20:54:34 wtm-14vm4 kernel: Call Trace: Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff81096fae>] ? prepare_to_wait_exclusive+0x4e/0x80 Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0135edd>] cv_wait_common+0xed/0x100 [spl] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40 Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0135f45>] __cv_wait+0x15/0x20 [spl] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0234e9b>] txg_wait_open+0x7b/0xa0 [zfs] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa01f9a5d>] dmu_tx_wait+0xed/0xf0 [zfs] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa01f9aee>] dmu_tx_assign+0x8e/0x4e0 [zfs] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0e1756c>] osd_trans_start+0x9c/0x410 [osd_zfs] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0ec30ec>] ofd_trans_start+0x7c/0x100 [ofd] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0ec7df3>] ofd_commitrw_write+0x523/0xfd0 [ofd] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0ec8dca>] ofd_commitrw+0x52a/0x8c0 [ofd] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa09d81dd>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa09df3fe>] tgt_brw_write+0xc7e/0x1530 [ptlrpc] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa05e9a81>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa05e9a81>] ? libcfs_debug_msg+0x41/0x50 [libcfs] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa093b7c0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa09ddefc>] tgt_request_handle+0x23c/0xac0 [ptlrpc] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa098d61a>] ptlrpc_main+0xd1a/0x1970 [ptlrpc] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa098c900>] ? ptlrpc_main+0x0/0x1970 [ptlrpc] Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff81096a36>] kthread+0x96/0xa0 Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0 Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Andreas Dilger [ 13/Oct/15 ] |
|
Haven't seen this in a long time. |
| Comment by Saurabh Tandan (Inactive) [ 14/Oct/15 ] |
|
Encountered another instance for Hard Failover of EL6.7 Server/Client - ZFS for tag 2.7.61: |