[LU-4439] Test failure on replay-single test_70b: rundbench load failed Created: 06/Jan/14  Updated: 14/Oct/15  Resolved: 13/Oct/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.6.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Cannot Reproduce Votes: 0
Labels: performance, zfs

Issue Links:
Related
is related to LU-6844 replay-single test 70b failure: 'rund... Resolved
Severity: 3
Rank (Obsolete): 12183

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/80971760-7539-11e3-936d-52540035b04c.

The sub-test test_70b failed with the following error:

rundbench load on .* failed!

Info required for matching: replay-single 70b



 Comments   
Comment by Nathaniel Clark [ 09/Jan/14 ]

For review-zfs failures, the errors are thus:

  • dbench on client gets a "No space left on device" error
  • MDT:
    17:25:20:LustreError: 8125:0:(osp_precreate.c:481:osp_precreate_send()) lustre-OST0001-osc-MDT0000: can't precreate: rc = -5
    17:25:20:LustreError: 8125:0:(osp_precreate.c:984:osp_precreate_thread()) lustre-OST0001-osc-MDT0000: cannot precreate objects: rc = -5
    
  • OST:
    17:24:36:Lustre: DEBUG MARKER: /usr/sbin/lctl mark test_70b fail mds1 1 times
    17:24:36:Lustre: DEBUG MARKER: test_70b fail mds1 1 times
    17:24:36:LNet: Service thread pid 9687 was inactive for 62.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
    17:24:36:LNet: Skipped 3 previous similar messages
    17:24:36:Pid: 9687, comm: ll_ost_io00_019
    17:24:36:
    17:24:36:Call Trace:
    17:24:36: [<ffffffff81096fae>] ? prepare_to_wait_exclusive+0x4e/0x80
    17:24:36: [<ffffffffa0135edd>] cv_wait_common+0xed/0x100 [spl]
    17:24:36: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
    17:24:36: [<ffffffffa0135f45>] __cv_wait+0x15/0x20 [spl]
    17:24:36: [<ffffffffa0234e9b>] txg_wait_open+0x7b/0xa0 [zfs]
    17:24:36: [<ffffffffa01f9a5d>] dmu_tx_wait+0xed/0xf0 [zfs]
    17:24:36: [<ffffffffa01f9aee>] dmu_tx_assign+0x8e/0x4e0 [zfs]
    17:24:36: [<ffffffffa0e9c56c>] osd_trans_start+0x9c/0x410 [osd_zfs]
    17:24:36: [<ffffffffa0f4947c>] ofd_trans_start+0x7c/0x100 [ofd]
    17:24:36: [<ffffffffa0f4e5c3>] ofd_commitrw_write+0x523/0xfd0 [ofd]
    17:24:36: [<ffffffffa0f4fe13>] ofd_commitrw+0x5a3/0xad0 [ofd]
    17:24:36: [<ffffffffa0722271>] ? lprocfs_counter_add+0x151/0x1c0 [obdclass]
    17:24:36: [<ffffffffa09ccb7d>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc]
    17:24:36: [<ffffffffa09d1318>] tgt_brw_write+0xd18/0x1550 [ptlrpc]
    17:24:36: [<ffffffffa05d927b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
    17:24:36: [<ffffffffa09357c0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
    17:24:36: [<ffffffffa09d2fea>] tgt_handle_request0+0x2ea/0x1490 [ptlrpc]
    17:24:36: [<ffffffffa09d45ca>] tgt_request_handle+0x43a/0x980 [ptlrpc]
    17:24:36: [<ffffffffa0987725>] ptlrpc_main+0xd25/0x1970 [ptlrpc]
    17:24:36: [<ffffffffa0986a00>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
    17:24:36: [<ffffffff81096a36>] kthread+0x96/0xa0
    17:24:36: [<ffffffff8100c0ca>] child_rip+0xa/0x20
    17:24:37: [<ffffffff810969a0>] ? kthread+0x0/0xa0
    17:24:37: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
    ...
    
Comment by Nathaniel Clark [ 24/Jan/14 ]

Similar occurrence of blocked ll_ost_io thread in replay-single/10
https://maloo.whamcloud.com/test_sets/6c773394-8445-11e3-a862-52540035b04c

OST log:

20:59:40:Lustre: DEBUG MARKER: == replay-single test 10: create |X| rename unlink == 20:59:01 (1390453141)
20:59:40:LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) ### lock callback timer expired after 210s: evicting client at 10.10.16.209@tcp  ns: filter-lustre-OST0001_UUID lock: ffff88001a728540/0xfea81e9df99a13dc lrc: 3/0,0 mode: PW/PW res: [0x78c5:0x0:0x0].0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->4095) flags: 0x60000000010020 nid: 10.10.16.209@tcp remote: 0xa2e90a29d78a83f6 expref: 115 pid: 26117 timeout: 4308758774 lvb_type: 0
20:59:40:LustreError: 0:0:(ldlm_lockd.c:344:waiting_locks_callback()) Skipped 1 previous similar message
21:00:13:Lustre: lustre-OST0001: Export ffff88006ddd4800 already connecting from 10.10.16.209@tcp
21:00:13:Lustre: Skipped 10 previous similar messages
21:00:13:INFO: task ll_ost00_002:25056 blocked for more than 120 seconds.
21:00:13:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
21:00:13:ll_ost00_002  D 0000000000000000     0 25056      2 0x00000080
21:00:13: ffff8800735e9a10 0000000000000046 ffff88006f2716c0 ffffc900076dae40
21:00:13: ffff8800735e99f0 ffffffffa01fa09a ffff8800735e9a60 0000000000000000
21:00:13: ffff880079bc3098 ffff8800735e9fd8 000000000000fb88 ffff880079bc3098
21:00:13:Call Trace:
21:00:13: [<ffffffffa01fa09a>] ? dmu_tx_count_twig+0x10a/0x1a0 [zfs]
21:00:13: [<ffffffff81096fae>] ? prepare_to_wait_exclusive+0x4e/0x80
21:00:13: [<ffffffffa0135edd>] cv_wait_common+0xed/0x100 [spl]
21:00:13: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
21:00:13: [<ffffffffa0135f45>] __cv_wait+0x15/0x20 [spl]
21:00:13: [<ffffffffa0234e9b>] txg_wait_open+0x7b/0xa0 [zfs]
21:00:13: [<ffffffffa01f9a5d>] dmu_tx_wait+0xed/0xf0 [zfs]
21:00:13: [<ffffffffa01f9aee>] dmu_tx_assign+0x8e/0x4e0 [zfs]
21:00:13: [<ffffffffa0e1c531>] ? osd_declare_quota+0x1c1/0x2d0 [osd_zfs]
21:00:13: [<ffffffffa0e1756c>] osd_trans_start+0x9c/0x410 [osd_zfs]
21:00:13: [<ffffffffa09d4cfe>] tgt_client_data_update+0x26e/0x5a0 [ptlrpc]
21:00:13: [<ffffffffa09d5280>] tgt_client_del+0x250/0x510 [ptlrpc]
21:00:13: [<ffffffffa0ece47e>] ? ofd_grant_discard+0x3e/0x1c0 [ofd]
21:00:14: [<ffffffffa0eba55b>] ofd_obd_disconnect+0x1bb/0x200 [ofd]
21:00:14: [<ffffffffa093c481>] target_handle_disconnect+0x1b1/0x480 [ptlrpc]
21:00:14: [<ffffffffa09dd6d9>] tgt_disconnect+0x39/0x160 [ptlrpc]
21:00:14: [<ffffffffa09ddf2c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
21:00:14: [<ffffffffa098d61a>] ptlrpc_main+0xd1a/0x1970 [ptlrpc]
21:00:14: [<ffffffffa098c900>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
21:00:14: [<ffffffff81096a36>] kthread+0x96/0xa0
21:00:14: [<ffffffff8100c0ca>] child_rip+0xa/0x20
21:00:14: [<ffffffff810969a0>] ? kthread+0x0/0xa0
21:00:14: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Nathaniel Clark [ 27/Jan/14 ]

review-zfs recovery-small/27
https://maloo.whamcloud.com/test_sets/ffbbaa16-851e-11e3-ac35-52540035b04c

Jan 23 20:54:34 wtm-14vm4 kernel: LNet: Service thread pid 18275 was inactive for 136.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Jan 23 20:54:34 wtm-14vm4 kernel: Pid: 18275, comm: ll_ost_io00_016
Jan 23 20:54:34 wtm-14vm4 kernel: 
Jan 23 20:54:34 wtm-14vm4 kernel: Call Trace:
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff81096fae>] ? prepare_to_wait_exclusive+0x4e/0x80
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0135edd>] cv_wait_common+0xed/0x100 [spl]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0135f45>] __cv_wait+0x15/0x20 [spl]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0234e9b>] txg_wait_open+0x7b/0xa0 [zfs]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa01f9a5d>] dmu_tx_wait+0xed/0xf0 [zfs]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa01f9aee>] dmu_tx_assign+0x8e/0x4e0 [zfs]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0e1756c>] osd_trans_start+0x9c/0x410 [osd_zfs]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0ec30ec>] ofd_trans_start+0x7c/0x100 [ofd]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0ec7df3>] ofd_commitrw_write+0x523/0xfd0 [ofd]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa0ec8dca>] ofd_commitrw+0x52a/0x8c0 [ofd]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa09d81dd>] obd_commitrw.clone.0+0x11d/0x390 [ptlrpc]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa09df3fe>] tgt_brw_write+0xc7e/0x1530 [ptlrpc]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa05e9a81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa05e9a81>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa093b7c0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa09ddefc>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa098d61a>] ptlrpc_main+0xd1a/0x1970 [ptlrpc]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffffa098c900>] ? ptlrpc_main+0x0/0x1970 [ptlrpc]
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff81096a36>] kthread+0x96/0xa0
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff810969a0>] ? kthread+0x0/0xa0
Jan 23 20:54:34 wtm-14vm4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Andreas Dilger [ 13/Oct/15 ]

Haven't seen this in a long time.

Comment by Saurabh Tandan (Inactive) [ 14/Oct/15 ]

Encountered another instance for Hard Failover of EL6.7 Server/Client - ZFS for tag 2.7.61:
https://testing.hpdd.intel.com/test_sets/4d8e8fc6-6d42-11e5-bf10-5254006e85c2

Generated at Sat Feb 10 01:42:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.