[LU-3109] ZFS - very slow reads, OST watchdogs Created: 04/Apr/13 Updated: 14/Feb/15 Resolved: 14/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cliff White (Inactive) | Assignee: | WC Triage |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | performance, prz, zfs | ||
| Environment: |
Hyperion/LLNL |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 7557 | ||||||||||||||||
| Description |
|
Running ior file-per-process. we observe one or two of the OSTs have a very excessive load factor compared to the others (load average of 112 vs LA of 0.1) 2013-04-04 11:26:05 LNet: Service thread pid 8074 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 2013-04-04 11:26:05 Pid: 8074, comm: ll_ost_io00_018 2013-04-04 11:26:05 2013-04-04 11:26:05 Call Trace: 2013-04-04 11:26:05 [<ffffffffa056cd40>] ? arc_read_nolock+0x530/0x810 [zfs] 2013-04-04 11:26:05 [<ffffffffa04e45ac>] cv_wait_common+0x9c/0x1a0 [spl] 2013-04-04 11:26:05 [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 2013-04-04 11:26:05 [<ffffffffa04e46e3>] __cv_wait+0x13/0x20 [spl] 2013-04-04 11:26:05 [<ffffffffa060633b>] zio_wait+0xeb/0x160 [zfs] 2013-04-04 11:26:05 [<ffffffffa057106d>] dbuf_read+0x3fd/0x720 [zfs] 2013-04-04 11:26:06 [<ffffffffa0572c1b>] dbuf_prefetch+0x10b/0x2b0 [zfs] 2013-04-04 11:26:06 [<ffffffffa0586381>] dmu_zfetch_dofetch+0xf1/0x160 [zfs] 2013-04-04 11:26:06 [<ffffffffa0570280>] ? dbuf_read_done+0x0/0x110 [zfs] 2013-04-04 11:26:06 [<ffffffffa0587211>] dmu_zfetch+0xaa1/0xe40 [zfs] 2013-04-04 11:26:06 [<ffffffffa05710fa>] dbuf_read+0x48a/0x720 [zfs] 2013-04-04 11:26:06 [<ffffffffa0578bc9>] dmu_buf_hold_array_by_dnode+0x179/0x570 [zfs] 2013-04-04 11:26:06 [<ffffffffa0579b28>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs] 2013-04-04 11:26:06 [<ffffffffa0d4c95d>] osd_bufs_get+0x49d/0x9a0 [osd_zfs] 2013-04-04 11:26:06 [<ffffffff81270f7c>] ? put_dec+0x10c/0x110 2013-04-04 11:26:06 [<ffffffffa0723736>] ? lu_object_find+0x16/0x20 [obdclass] 2013-04-04 11:26:06 [<ffffffffa0ded49f>] ofd_preprw_read+0x13f/0x7e0 [ofd] 2013-04-04 11:26:06 [<ffffffffa0dedec5>] ofd_preprw+0x385/0x1190 [ofd] 2013-04-04 11:26:06 [<ffffffffa0da739c>] obd_preprw+0x12c/0x3d0 [ost] 2013-04-04 11:26:06 [<ffffffffa0dace80>] ost_brw_read+0xd00/0x12e0 [ost] 2013-04-04 11:26:06 [<ffffffff812739b6>] ? vsnprintf+0x2b6/0x5f0 2013-04-04 11:26:06 [<ffffffffa035127b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs] 2013-04-04 11:26:06 [<ffffffffa0361bdb>] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs] 2013-04-04 11:26:06 [<ffffffffa08a2f4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa08a30a8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa0db3a63>] ost_handle+0x2b53/0x46f0 [ost] 2013-04-04 11:26:06 [<ffffffffa035e0e4>] ? libcfs_id2str+0x74/0xb0 [libcfs] 2013-04-04 11:26:06 [<ffffffffa08b21ac>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa03525de>] ? cfs_timer_arm+0xe/0x10 [libcfs] 2013-04-04 11:26:06 [<ffffffffa08a97e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] 2013-04-04 11:26:06 [<ffffffff81052223>] ? __wake_up+0x53/0x70 2013-04-04 11:26:06 [<ffffffffa08b36f5>] ptlrpc_main+0xb75/0x1870 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-04-04 11:26:06 [<ffffffff8100c0ca>] child_rip+0xa/0x20 2013-04-04 11:26:06 [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-04-04 11:26:06 [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] 2013-04-04 11:26:06 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 2013-04-04 11:26:06 2013-04-04 11:26:06 LustreError: dumping log to /tmp/lustre-log.1365099965.8074 |
| Comments |
| Comment by Cliff White (Inactive) [ 04/Apr/13 ] |
|
lustre-log from watchdog |
| Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ] |
|
A little system state info. avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 1.17 0.00 0.00 98.83
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdg 0.00 0.00 505.50 0.00 187472.50 0.00 370.87 6.35 12.53 1.98 100.00
The await is comparable to other systems that are not blocking. |
| Comment by Alex Zhuravlev [ 04/Apr/13 ] |
|
I saw this on local setup few times (over an year period). |
| Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ] |
|
With this zfs setup is is happening a bit. There is just a ton of waiting on zfs code. There are few servers with the same general workload one or two of them will be in this very hight load avg state. Is there some zfs data we can dump from sysfs or proc to give a better insight as to what is going on? |
| Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ] |
|
Maybe this is a storage thing: iostat -x 10 output. avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 3.20 0.00 0.00 96.80
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 5.30 1263.20 51.70 638661.00 503.52 3.47 2.72 0.76 96.39
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 1299.00 5.30 1263.20 51.70 638661.00 503.52 3.83 3.00 0.78 98.87
Next "normal" server
avg-cpu: %user %nice %system %iowait %steal %idle
0.01 0.00 4.15 0.00 0.00 95.84
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 2.50 1548.10 6.50 746937.90 481.71 1.09 0.71 0.18 28.48
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdd 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sde 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdf 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdg 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdi 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdj 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdk 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdl 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 1445.40 2.50 1548.10 6.50 746937.90 481.71 1.34 0.87 0.19 29.90
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
We seem to be going faster with 1/3 of the utility of the slower server. |
| Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ] |
|
Also there is a junk load of threads on both good and bad performing servers. hyperion-agb14@mannthey:ps aux | grep ll_ost_ | wc -l Just about 300 threads for one block device and one filesystem? There is only 16 cpus and 1 device. |
| Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ] |
|
The hotspot (high load avg) seems to move servers depending on config. "normal" /proc/fs/lustre/ost/OSS/ost/stats @ 1365123595.929143 Name Cur.Count Cur.Rate #Events Unit last min avg max stddev req_waittime 13 0 20685 [usec] 929 9 214.98 712727 8332.40 req_qdepth 13 0 20685 [reqs] 0 0 0.00 1 0.03 req_active 13 0 20685 [reqs] 13 1 1.28 11 1.39 req_timeout 13 0 20685 [sec] 585 45 45.00 45 0.00 reqbuf_avail 25 1 41400 [bufs] 1575 49 62.70 64 1.43 ldlm_extent_enqueue 0 0 768 [reqs] 0 1 1.00 1 0.00 ost_create 0 0 11 [usec] 0 170 958523.00 8438749 2519109.30 ost_destroy 0 0 512 [usec] 0 356414 1530694.28 7608546 775048.51 ost_get_info 0 0 1 [usec] 0 19398 19398.00 19398 0.00 ost_connect 0 0 125 [usec] 0 415 498.31 1069 94.21 obd_ping 13 0 19268 [usec] 579 7 42.13 207 15.58 bad /proc/fs/lustre/ost/OSS/ost/stats @ 1365123593.682607 Name Cur.Count Cur.Rate #Events Unit last min avg max stddev req_waittime 13 0 20687 [usec] 1862 24 617.66 3054848 34127.66 req_qdepth 13 0 20687 [reqs] 0 0 0.00 1 0.04 req_active 13 0 20687 [reqs] 13 1 1.29 12 1.38 req_timeout 13 0 20687 [sec] 585 45 45.00 45 0.00 reqbuf_avail 28 1 45215 [bufs] 1764 51 62.67 64 1.45 ldlm_extent_enqueue 0 0 768 [reqs] 0 1 1.00 1 0.00 ost_create 0 0 11 [usec] 0 65 150352.36 575236 201593.16 ost_destroy 0 0 512 [usec] 0 272486 1592186.24 4242310 735618.42 ost_get_info 0 0 1 [usec] 0 17611 17611.00 17611 0.00 ost_connect 0 0 125 [usec] 0 412 484.32 1126 73.94 obd_ping 13 0 19270 [usec] 2143 50 158.49 2066 26.96 |
| Comment by Cliff White (Inactive) [ 05/Apr/13 ] |
|
Alex, this is consistently reproduceable on Hyperion - what data should we be looking at? |
| Comment by Cliff White (Inactive) [ 05/Apr/13 ] |
|
We are using zfs-0.6.0-rc10 |
| Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ] |
|
"normal" /proc/fs/lustre/obdfilter/lustre-OST0001/stats @ 1365124587.752293 Name Cur.Count Cur.Rate #Events Unit last min avg max read_bytes 0 0 1625295 [bytes] 0 4096 393794.95 1048576 write_bytes 6568 437 652523 [bytes] 6880755712 4096 1047962.14 1048576 get_info 73 4 1538 set_info_async 0 0 3 connect 0 0 125 statfs 3 0 1226 create 0 0 12 destroy 34 2 768 sync 39 2 768 preprw 6568 437 2277818 commitrw 6568 437 2277818 ping 108 7 23612 "bad" /proc/fs/lustre/obdfilter/lustre-OST0003/stats @ 1365124582.616412 Name Cur.Count Cur.Rate #Events Unit last min avg max read_bytes 0 0 1226305 [bytes] 0 4096 521965.14 1048576 write_bytes 5602 373 653713 [bytes] 5861138432 4096 1042516.96 1048576 get_info 127 8 1426 set_info_async 0 0 2 connect 0 0 125 statfs 3 0 1223 create 0 0 12 destroy 63 4 712 sync 64 4 712 preprw 5602 373 1880018 commitrw 5602 373 1880018 ping 95 6 23601 Looks like a "normal" server is doing more work. |
| Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ] |
|
It seems our zfs is 0.6.0-rc10 and there is a 0.6.1 . |
| Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ] |
|
oprofile didn't just work. |
| Comment by Christopher Morrone [ 05/Apr/13 ] |
|
rc10 is old, you should definitely upgrade to the latest rc of 0.6.0. 0.6.1 is a little TOO new, because packaging has changed there, and lustre will need a little tweaking to find the new paths and things automatically. You can build by hand by giving spl and zfs paths, but the latest 0.6.0 rc will just be easier. |
| Comment by Andreas Dilger [ 14/Feb/15 ] |
|
This is probably the same as |