Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.4.0
-
Hyperion/LLNL
-
3
-
7557
Description
Running ior file-per-process. we observe one or two of the OSTs have a very excessive load factor compared to the others (load average of 112 vs LA of 0.1)
System log shows large number of watchdogs. IO is not failing but rates are very, very slow.
First watchdog (log attached)
2013-04-04 11:26:05 LNet: Service thread pid 8074 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-04-04 11:26:05 Pid: 8074, comm: ll_ost_io00_018 2013-04-04 11:26:05 2013-04-04 11:26:05 Call Trace: 2013-04-04 11:26:05 [<ffffffffa056cd40>] ? arc_read_nolock+0x530/0x810 [zfs] 2013-04-04 11:26:05 [<ffffffffa04e45ac>] cv_wait_common+0x9c/0x1a0 [spl] 2013-04-04 11:26:05 [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 2013-04-04 11:26:05 [<ffffffffa04e46e3>] __cv_wait+0x13/0x20 [spl] 2013-04-04 11:26:05 [<ffffffffa060633b>] zio_wait+0xeb/0x160 [zfs] 2013-04-04 11:26:05 [<ffffffffa057106d>] dbuf_read+0x3fd/0x720 [zfs] 2013-04-04 11:26:06 [<ffffffffa0572c1b>] dbuf_prefetch+0x10b/0x2b0 [zfs] 2013-04-04 11:26:06 [<ffffffffa0586381>] dmu_zfetch_dofetch+0xf1/0x160 [zfs] 2013-04-04 11:26:06 [<ffffffffa0570280>] ? dbuf_read_done+0x0/0x110 [zfs] 2013-04-04 11:26:06 [<ffffffffa0587211>] dmu_zfetch+0xaa1/0xe40 [zfs] 2013-04-04 11:26:06 [<ffffffffa05710fa>] dbuf_read+0x48a/0x720 [zfs] 2013-04-04 11:26:06 [<ffffffffa0578bc9>] dmu_buf_hold_array_by_dnode+0x179/0x570 [zfs] 2013-04-04 11:26:06 [<ffffffffa0579b28>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs] 2013-04-04 11:26:06 [<ffffffffa0d4c95d>] osd_bufs_get+0x49d/0x9a0 [osd_zfs] 2013-04-04 11:26:06 [<ffffffff81270f7c>] ? put_dec+0x10c/0x110 2013-04-04 11:26:06 [<ffffffffa0723736>] ? lu_object_find+0x16/0x20 [obdclass] 2013-04-04 11:26:06 [<ffffffffa0ded49f>] ofd_preprw_read+0x13f/0x7e0 [ofd] 2013-04-04 11:26:06 [<ffffffffa0dedec5>] ofd_preprw+0x385/0x1190 [ofd] 2013-04-04 11:26:06 [<ffffffffa0da739c>] obd_preprw+0x12c/0x3d0 [ost] 2013-04-04 11:26:06 [<ffffffffa0dace80>] ost_brw_read+0xd00/0x12e0 [ost] 2013-04-04 11:26:06 [<ffffffff812739b6>] ? vsnprintf+0x2b6/0x5f0 2013-04-04 11:26:06 [<ffffffffa035127b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] 2013-04-04 11:26:06 [<ffffffffa0361bdb>] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs] 2013-04-04 11:26:06 [<ffffffffa08a2f4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa08a30a8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa0db3a63>] ost_handle+0x2b53/0x46f0 [ost] 2013-04-04 11:26:06 [<ffffffffa035e0e4>] ? libcfs_id2str+0x74/0xb0 [libcfs] 2013-04-04 11:26:06 [<ffffffffa08b21ac>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa03525de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-04-04 11:26:06 [<ffffffffa08a97e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-04-04 11:26:06 [<ffffffff81052223>] ? __wake_up+0x53/0x70 2013-04-04 11:26:06 [<ffffffffa08b36f5>] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-04-04 11:26:06 [<ffffffff8100c0ca>] child_rip+0xa/0x20 2013-04-04 11:26:06 [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-04-04 11:26:06 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 2013-04-04 11:26:06 2013-04-04 11:26:06 LustreError: dumping log to /tmp/lustre-log.1365099965.8074