Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.6.0
-
Hyperion/LLNL
-
3
-
14730
Description
Running IOR with 100 clients. Performance is terrible. OSTs are wedging and dropping watchdogs.
Example:
2014-07-01 08:22:47 LNet: Service thread pid 8308 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2014-07-01 08:22:47 Pid: 8308, comm: ll_ost_io00_014 2014-07-01 08:22:47 2014-07-01 08:22:47 Call Trace: 2014-07-01 08:22:47 [<ffffffffa05b34ba>] ? dmu_zfetch+0x51a/0xd70 [zfs] 2014-07-01 08:22:47 [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0 2014-07-01 08:22:47 [<ffffffff815287f3>] io_schedule+0x73/0xc0 2014-07-01 08:22:47 [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl] 2014-07-01 08:22:47 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 2014-07-01 08:22:47 [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl] 2014-07-01 08:22:47 [<ffffffffa062f0ab>] zio_wait+0xfb/0x1b0 [zfs] 2014-07-01 08:22:47 [<ffffffffa05a503d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs] 2014-07-01 08:22:47 [<ffffffffa05a5e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs] 2014-07-01 08:22:47 [<ffffffffa0e3f1a3>] osd_bufs_get+0x493/0xb00 [osd_zfs] 2014-07-01 08:22:47 [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-07-01 08:22:47 [<ffffffffa0f2e00b>] ofd_preprw_read+0x15b/0x890 [ofd] 2014-07-01 08:22:47 [<ffffffffa0f30709>] ofd_preprw+0x749/0x1650 [ofd] 2014-07-01 08:22:47 [<ffffffffa09d71b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa09deb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-07-01 08:22:47 [<ffffffffa097ab36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa097ac4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa09dca7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa098c29a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] 2014-07-01 08:22:47 [<ffffffffa098b580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 2014-07-01 08:22:47 [<ffffffff8109ab56>] kthread+0x96/0xa0 2014-07-01 08:22:47 [<ffffffff8100c20a>] child_rip+0xa/0x20 2014-07-01 08:22:47 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 2014-07-01 08:22:47 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Lustre dump attached.
Second example:
2014-07-01 09:38:41 Pid: 9299, comm: ll_ost_io00_070 2014-07-01 09:38:41 2014-07-01 09:38:41 Call Trace: 2014-07-01 09:38:41 [<ffffffffa05b02f7>] ? dmu_zfetch+0x357/0xd70 [zfs] 2014-07-01 09:38:41 [<ffffffffa05957f2>] ? arc_read+0x572/0x8d0 [zfs] 2014-07-01 09:38:41 [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0 2014-07-01 09:38:41 [<ffffffff815287f3>] io_schedule+0x73/0xc0 2014-07-01 09:38:41 [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl] 2014-07-01 09:38:41 [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40 2014-07-01 09:38:41 [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl] 2014-07-01 09:38:41 [<ffffffffa062c0ab>] zio_wait+0xfb/0x1b0 [zfs] 2014-07-01 09:38:41 [<ffffffffa05a203d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs] 2014-07-01 09:38:41 [<ffffffffa05a2e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs] 2014-07-01 09:38:41 [<ffffffffa0e441a3>] osd_bufs_get+0x493/0xb00 [osd_zfs] 2014-07-01 09:38:41 [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-07-01 09:38:41 [<ffffffffa0f3700b>] ofd_preprw_read+0x15b/0x890 [ofd] 2014-07-01 09:38:41 [<ffffffffa0f39709>] ofd_preprw+0x749/0x1650 [ofd] 2014-07-01 09:38:41 [<ffffffffa09d41b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa09dbb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs] 2014-07-01 09:38:41 [<ffffffffa0977b36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa0977c4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa09d9a7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa098929a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] 2014-07-01 09:38:41 [<ffffffffa0988580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] 2014-07-01 09:38:41 [<ffffffff8109ab56>] kthread+0x96/0xa0 2014-07-01 09:38:41 [<ffffffff8100c20a>] child_rip+0xa/0x20 2014-07-01 09:38:41 [<ffffffff8109aac0>] ? kthread+0x0/0xa0 2014-07-01 09:38:41 [<ffffffff8100c200>] ? child_rip+0x0/0x20 2014-07-01 09:38:41