[LU-3109] ZFS - very slow reads, OST watchdogs Created: 04/Apr/13  Updated: 14/Feb/15  Resolved: 14/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Cliff White (Inactive) Assignee: WC Triage
Resolution: Duplicate Votes: 0
Labels: performance, prz, zfs
Environment:

Hyperion/LLNL


Attachments: File agb13.lustre.log.gz    
Issue Links:
Duplicate
duplicates LU-5278 ZFS - many OST watchdogs with IOR Resolved
Related
is related to LU-2887 sanity-quota test_12a: slow due to ZF... Resolved
Severity: 3
Rank (Obsolete): 7557

 Description   

Running ior file-per-process. we observe one or two of the OSTs have a very excessive load factor compared to the others (load average of 112 vs LA of 0.1)
System log shows large number of watchdogs. IO is not failing but rates are very, very slow.
First watchdog (log attached)

2013-04-04 11:26:05 LNet: Service thread pid 8074 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
2013-04-04 11:26:05 Pid: 8074, comm: ll_ost_io00_018
2013-04-04 11:26:05
2013-04-04 11:26:05 Call Trace:
2013-04-04 11:26:05  [<ffffffffa056cd40>] ? arc_read_nolock+0x530/0x810 [zfs]
2013-04-04 11:26:05  [<ffffffffa04e45ac>] cv_wait_common+0x9c/0x1a0 [spl]
2013-04-04 11:26:05  [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
2013-04-04 11:26:05  [<ffffffffa04e46e3>] __cv_wait+0x13/0x20 [spl]
2013-04-04 11:26:05  [<ffffffffa060633b>] zio_wait+0xeb/0x160 [zfs]
2013-04-04 11:26:05  [<ffffffffa057106d>] dbuf_read+0x3fd/0x720 [zfs]
2013-04-04 11:26:06  [<ffffffffa0572c1b>] dbuf_prefetch+0x10b/0x2b0 [zfs]
2013-04-04 11:26:06  [<ffffffffa0586381>] dmu_zfetch_dofetch+0xf1/0x160 [zfs]
2013-04-04 11:26:06  [<ffffffffa0570280>] ? dbuf_read_done+0x0/0x110 [zfs]
2013-04-04 11:26:06  [<ffffffffa0587211>] dmu_zfetch+0xaa1/0xe40 [zfs]
2013-04-04 11:26:06  [<ffffffffa05710fa>] dbuf_read+0x48a/0x720 [zfs]
2013-04-04 11:26:06  [<ffffffffa0578bc9>] dmu_buf_hold_array_by_dnode+0x179/0x570 [zfs]
2013-04-04 11:26:06  [<ffffffffa0579b28>] dmu_buf_hold_array_by_bonus+0x68/0x90 [zfs]
2013-04-04 11:26:06  [<ffffffffa0d4c95d>] osd_bufs_get+0x49d/0x9a0 [osd_zfs]
2013-04-04 11:26:06  [<ffffffff81270f7c>] ? put_dec+0x10c/0x110
2013-04-04 11:26:06  [<ffffffffa0723736>] ? lu_object_find+0x16/0x20 [obdclass]
2013-04-04 11:26:06  [<ffffffffa0ded49f>] ofd_preprw_read+0x13f/0x7e0 [ofd]
2013-04-04 11:26:06  [<ffffffffa0dedec5>] ofd_preprw+0x385/0x1190 [ofd]
2013-04-04 11:26:06  [<ffffffffa0da739c>] obd_preprw+0x12c/0x3d0 [ost]
2013-04-04 11:26:06  [<ffffffffa0dace80>] ost_brw_read+0xd00/0x12e0 [ost]
2013-04-04 11:26:06  [<ffffffff812739b6>] ? vsnprintf+0x2b6/0x5f0
2013-04-04 11:26:06  [<ffffffffa035127b>] ? cfs_set_ptldebug_header+0x2b/0xc0 [libcfs]
2013-04-04 11:26:06  [<ffffffffa0361bdb>] ? libcfs_debug_vmsg2+0x50b/0xbb0 [libcfs]
2013-04-04 11:26:06  [<ffffffffa08a2f4c>] ? lustre_msg_get_version+0x8c/0x100 [ptlrpc]
2013-04-04 11:26:06  [<ffffffffa08a30a8>] ? lustre_msg_check_version+0xe8/0x100 [ptlrpc]
2013-04-04 11:26:06  [<ffffffffa0db3a63>] ost_handle+0x2b53/0x46f0 [ost]
2013-04-04 11:26:06  [<ffffffffa035e0e4>] ? libcfs_id2str+0x74/0xb0 [libcfs]
2013-04-04 11:26:06  [<ffffffffa08b21ac>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
2013-04-04 11:26:06  [<ffffffffa03525de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
2013-04-04 11:26:06  [<ffffffffa08a97e9>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
2013-04-04 11:26:06  [<ffffffff81052223>] ? __wake_up+0x53/0x70
2013-04-04 11:26:06  [<ffffffffa08b36f5>] ptlrpc_main+0xb75/0x1870 [ptlrpc]
2013-04-04 11:26:06  [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
2013-04-04 11:26:06  [<ffffffff8100c0ca>] child_rip+0xa/0x20
2013-04-04 11:26:06  [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
2013-04-04 11:26:06  [<ffffffffa08b2b80>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
2013-04-04 11:26:06  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
2013-04-04 11:26:06
2013-04-04 11:26:06 LustreError: dumping log to /tmp/lustre-log.1365099965.8074


 Comments   
Comment by Cliff White (Inactive) [ 04/Apr/13 ]

lustre-log from watchdog

Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ]

A little system state info.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    1.17    0.00    0.00   98.83

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdg               0.00     0.00  505.50    0.00 187472.50     0.00   370.87     6.35   12.53   1.98 100.00

The await is comparable to other systems that are not blocking.

Comment by Alex Zhuravlev [ 04/Apr/13 ]

I saw this on local setup few times (over an year period).

Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ]

With this zfs setup is is happening a bit. There is just a ton of waiting on zfs code.

There are few servers with the same general workload one or two of them will be in this very hight load avg state.

Is there some zfs data we can dump from sysfs or proc to give a better insight as to what is going on?

Comment by Keith Mannthey (Inactive) [ 04/Apr/13 ]

Maybe this is a storage thing:

iostat -x 10 output.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.00    0.00    3.20    0.00    0.00   96.80

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdg               0.00     0.00    5.30 1263.20    51.70 638661.00   503.52     3.47    2.72   0.76  96.39
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-5              0.00  1299.00    5.30 1263.20    51.70 638661.00   503.52     3.83    3.00   0.78  98.87

Next "normal" server

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.01    0.00    4.15    0.00    0.00   95.84

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    2.50 1548.10     6.50 746937.90   481.71     1.09    0.71   0.18  28.48
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdd               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdf               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdi               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdj               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdk               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdl               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-0              0.00  1445.40    2.50 1548.10     6.50 746937.90   481.71     1.34    0.87   0.19  29.90
dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-2              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-3              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-4              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
dm-5              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

We seem to be going faster with 1/3 of the utility of the slower server.

Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ]

Also there is a junk load of threads on both good and bad performing servers.

hyperion-agb14@mannthey:ps aux | grep ll_ost_ | wc -l
202
hyperion-agb18@mannthey:ps aux | grep z_ | wc -l
90

Just about 300 threads for one block device and one filesystem?

There is only 16 cpus and 1 device.

Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ]

The hotspot (high load avg) seems to move servers depending on config.

"normal"

/proc/fs/lustre/ost/OSS/ost/stats @ 1365123595.929143
Name                      Cur.Count  Cur.Rate   #Events   Unit           last        min          avg        max    stddev
req_waittime              13         0          20685     [usec]          929          9       214.98     712727   8332.40 
req_qdepth                13         0          20685     [reqs]            0          0         0.00          1      0.03 
req_active                13         0          20685     [reqs]           13          1         1.28         11      1.39 
req_timeout               13         0          20685     [sec]           585         45        45.00         45      0.00 
reqbuf_avail              25         1          41400     [bufs]         1575         49        62.70         64      1.43 
ldlm_extent_enqueue       0          0          768       [reqs]            0          1         1.00          1      0.00 
ost_create                0          0          11        [usec]            0        170    958523.00    8438749 2519109.30 
ost_destroy               0          0          512       [usec]            0     356414   1530694.28    7608546 775048.51 
ost_get_info              0          0          1         [usec]            0      19398     19398.00      19398      0.00 
ost_connect               0          0          125       [usec]            0        415       498.31       1069     94.21 
obd_ping                  13         0          19268     [usec]          579          7        42.13        207     15.58 

bad

/proc/fs/lustre/ost/OSS/ost/stats @ 1365123593.682607
Name                      Cur.Count  Cur.Rate   #Events   Unit           last        min          avg        max    stddev
req_waittime              13         0          20687     [usec]         1862         24       617.66    3054848  34127.66 
req_qdepth                13         0          20687     [reqs]            0          0         0.00          1      0.04 
req_active                13         0          20687     [reqs]           13          1         1.29         12      1.38 
req_timeout               13         0          20687     [sec]           585         45        45.00         45      0.00 
reqbuf_avail              28         1          45215     [bufs]         1764         51        62.67         64      1.45 
ldlm_extent_enqueue       0          0          768       [reqs]            0          1         1.00          1      0.00 
ost_create                0          0          11        [usec]            0         65    150352.36     575236 201593.16 
ost_destroy               0          0          512       [usec]            0     272486   1592186.24    4242310 735618.42 
ost_get_info              0          0          1         [usec]            0      17611     17611.00      17611      0.00 
ost_connect               0          0          125       [usec]            0        412       484.32       1126     73.94 
obd_ping                  13         0          19270     [usec]         2143         50       158.49       2066     26.96 
Comment by Cliff White (Inactive) [ 05/Apr/13 ]

Alex, this is consistently reproduceable on Hyperion - what data should we be looking at?

Comment by Cliff White (Inactive) [ 05/Apr/13 ]

We are using zfs-0.6.0-rc10

Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ]

"normal"

/proc/fs/lustre/obdfilter/lustre-OST0001/stats @ 1365124587.752293
Name                      Cur.Count  Cur.Rate   #Events   Unit           last        min          avg        max
read_bytes                0          0          1625295   [bytes]           0       4096    393794.95    1048576
write_bytes               6568       437        652523    [bytes]  6880755712       4096   1047962.14    1048576
get_info                  73         4          1538      
set_info_async            0          0          3         
connect                   0          0          125       
statfs                    3          0          1226      
create                    0          0          12        
destroy                   34         2          768       
sync                      39         2          768       
preprw                    6568       437        2277818   
commitrw                  6568       437        2277818   
ping                      108        7          23612 

"bad"

/proc/fs/lustre/obdfilter/lustre-OST0003/stats @ 1365124582.616412
Name                      Cur.Count  Cur.Rate   #Events   Unit           last        min          avg        max
read_bytes                0          0          1226305   [bytes]           0       4096    521965.14    1048576
write_bytes               5602       373        653713    [bytes]  5861138432       4096   1042516.96    1048576
get_info                  127        8          1426      
set_info_async            0          0          2         
connect                   0          0          125       
statfs                    3          0          1223      
create                    0          0          12        
destroy                   63         4          712       
sync                      64         4          712       
preprw                    5602       373        1880018   
commitrw                  5602       373        1880018   
ping                      95         6          23601   

Looks like a "normal" server is doing more work.

Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ]

It seems our zfs is 0.6.0-rc10 and there is a 0.6.1 .

Comment by Keith Mannthey (Inactive) [ 05/Apr/13 ]

oprofile didn't just work. LU-3111 filed.

Comment by Christopher Morrone [ 05/Apr/13 ]

rc10 is old, you should definitely upgrade to the latest rc of 0.6.0. 0.6.1 is a little TOO new, because packaging has changed there, and lustre will need a little tweaking to find the new paths and things automatically. You can build by hand by giving spl and zfs paths, but the latest 0.6.0 rc will just be easier.

Comment by Andreas Dilger [ 14/Feb/15 ]

This is probably the same as LU-5278, which has patches that improve ZFS read performance significantly.

Generated at Sat Feb 10 01:31:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.