Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.8.0
Affects Version/s: Lustre 2.6.0
Labels:
- MB
- llnl
- prz
- zfs
Environment:
Hyperion/LLNL

Severity:
3
Rank (Obsolete):
14730

Description

Running IOR with 100 clients. Performance is terrible. OSTs are wedging and dropping watchdogs.
Example:

2014-07-01 08:22:47 LNet: Service thread pid 8308 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
2014-07-01 08:22:47 Pid: 8308, comm: ll_ost_io00_014
2014-07-01 08:22:47
2014-07-01 08:22:47 Call Trace:
2014-07-01 08:22:47  [<ffffffffa05b34ba>] ? dmu_zfetch+0x51a/0xd70 [zfs]
2014-07-01 08:22:47  [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
2014-07-01 08:22:47  [<ffffffff815287f3>] io_schedule+0x73/0xc0
2014-07-01 08:22:47  [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl]
2014-07-01 08:22:47  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
2014-07-01 08:22:47  [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl]
2014-07-01 08:22:47  [<ffffffffa062f0ab>] zio_wait+0xfb/0x1b0 [zfs]
2014-07-01 08:22:47  [<ffffffffa05a503d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs]
2014-07-01 08:22:47  [<ffffffffa05a5e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
2014-07-01 08:22:47  [<ffffffffa0e3f1a3>] osd_bufs_get+0x493/0xb00 [osd_zfs]
2014-07-01 08:22:47  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-07-01 08:22:47  [<ffffffffa0f2e00b>] ofd_preprw_read+0x15b/0x890 [ofd]
2014-07-01 08:22:47  [<ffffffffa0f30709>] ofd_preprw+0x749/0x1650 [ofd]
2014-07-01 08:22:47  [<ffffffffa09d71b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa09deb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-07-01 08:22:47  [<ffffffffa097ab36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa097ac4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa09dca7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa098c29a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
2014-07-01 08:22:47  [<ffffffffa098b580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
2014-07-01 08:22:47  [<ffffffff8109ab56>] kthread+0x96/0xa0
2014-07-01 08:22:47  [<ffffffff8100c20a>] child_rip+0xa/0x20
2014-07-01 08:22:47  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
2014-07-01 08:22:47  [<ffffffff8100c200>] ? child_rip+0x0/0x20

Lustre dump attached.

Second example:

2014-07-01 09:38:41 Pid: 9299, comm: ll_ost_io00_070
2014-07-01 09:38:41
2014-07-01 09:38:41 Call Trace:
2014-07-01 09:38:41  [<ffffffffa05b02f7>] ? dmu_zfetch+0x357/0xd70 [zfs]
2014-07-01 09:38:41  [<ffffffffa05957f2>] ? arc_read+0x572/0x8d0 [zfs]
2014-07-01 09:38:41  [<ffffffff810a6d01>] ? ktime_get_ts+0xb1/0xf0
2014-07-01 09:38:41  [<ffffffff815287f3>] io_schedule+0x73/0xc0
2014-07-01 09:38:41  [<ffffffffa04f841c>] cv_wait_common+0x8c/0x100 [spl]
2014-07-01 09:38:41  [<ffffffff8109af00>] ? autoremove_wake_function+0x0/0x40
2014-07-01 09:38:41  [<ffffffffa04f84a8>] __cv_wait_io+0x18/0x20 [spl]
2014-07-01 09:38:41  [<ffffffffa062c0ab>] zio_wait+0xfb/0x1b0 [zfs]
2014-07-01 09:38:41  [<ffffffffa05a203d>] dmu_buf_hold_array_by_dnode+0x19d/0x4c0 [zfs]
2014-07-01 09:38:41  [<ffffffffa05a2e68>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
2014-07-01 09:38:41  [<ffffffffa0e441a3>] osd_bufs_get+0x493/0xb00 [osd_zfs]
2014-07-01 09:38:41  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-07-01 09:38:41  [<ffffffffa0f3700b>] ofd_preprw_read+0x15b/0x890 [ofd]
2014-07-01 09:38:41  [<ffffffffa0f39709>] ofd_preprw+0x749/0x1650 [ofd]
2014-07-01 09:38:41  [<ffffffffa09d41b1>] obd_preprw.clone.3+0x121/0x390 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa09dbb03>] tgt_brw_read+0x2d3/0x1150 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa03be488>] ? libcfs_log_return+0x28/0x40 [libcfs]
2014-07-01 09:38:41  [<ffffffffa0977b36>] ? lustre_pack_reply_v2+0x216/0x280 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa0977c4e>] ? lustre_pack_reply_flags+0xae/0x1f0 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa09d9a7c>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa098929a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
2014-07-01 09:38:41  [<ffffffffa0988580>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
2014-07-01 09:38:41  [<ffffffff8109ab56>] kthread+0x96/0xa0
2014-07-01 09:38:41  [<ffffffff8100c20a>] child_rip+0xa/0x20
2014-07-01 09:38:41  [<ffffffff8109aac0>] ? kthread+0x0/0xa0
2014-07-01 09:38:41  [<ffffffff8100c200>] ? child_rip+0x0/0x20
2014-07-01 09:38:41

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

ior.iws28.txt.gz
01/Jul/14 3:42 PM
0.2 kB
Cliff White
iws24.dump.txt.gz
01/Jul/14 4:55 PM
0.2 kB
Cliff White
iws28.dump.txt.gz
14/Nov/14 6:36 PM
0.2 kB
Cliff White
Hyperion Performance 17 Nov 2014.xlsx
17/Nov/14 7:35 PM
132 kB
Cliff White
lustre-log.1429199475.64826.txt.gz
16/Apr/15 4:06 PM
0.3 kB
Cliff White
proc_spl.tgz
16/Apr/15 4:06 PM
3.99 MB
Cliff White
proc_spl_MDS.tgz
16/Apr/15 4:10 PM
3.89 MB
Cliff White

Issue Links

is duplicated by

LU-3109 ZFS - very slow reads, OST watchdogs

Resolved

LU-5775 obdfilter-survey test 1c: lctl hung on OSS

Resolved

is related to

LU-6254 Fix OFD/OSD prefetch for osd-ldiskfs and osd-zfs

Open

LU-4820 extra memcpy in read path

Closed

LU-6228 How to balance network connections across socknal_sd tasks?

Resolved

ZFS - many OST watchdogs with IOR

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates