[LU-2887] sanity-quota test_12a: slow due to ZFS VMs sharing single disk Created: 27/Feb/13  Updated: 29/May/17  Resolved: 03/Dec/16

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Bug Priority: Minor
Reporter: Maloo Assignee: Alex Zhuravlev
Resolution: Fixed Votes: 0
Labels: MB, performance, zfs

Attachments: Text File txgs.1380193340.log     Text File txgs.1380193506.log     Text File txgs.1380194292.log    
Issue Links:
Blocker
is blocked by LU-4009 Add ZIL support to osd-zfs Open
Duplicate
is duplicated by LU-4108 Failure on test suite performance-san... Resolved
Related
is related to LU-2176 ZFS: running racer grounds everything... Resolved
is related to LU-2872 Test timeout failure on test suite sa... Resolved
is related to LU-3089 Test failure recovery-small test_55: ... Resolved
is related to LU-3225 Timeout failure on test suite sanity-... Resolved
is related to LU-2085 sanityn test_16 (fsx) ran over its Au... Closed
is related to LU-1960 Test failure on test suite sanity-ben... Closed
is related to LU-2829 Timeout on sanityn test_33a: zfs slow... Resolved
is related to LU-2874 Test timeout failure on test suite re... Resolved
is related to LU-3109 ZFS - very slow reads, OST watchdogs Resolved
is related to LU-2836 Test failure on test suite sanity-quo... Resolved
is related to LU-5148 OSTs won't mount following upgrade to... Resolved
is related to LU-2891 Test failure on test suite sanity-quo... Resolved
is related to LU-2955 replay-ost-single 8b: Hung until Auto... Resolved
is related to LU-4072 sanity, subtest test_24v takes a VERY... Resolved
is related to LU-4444 conf-sanity test_69: ZFS took too lon... Resolved
is related to LU-2124 Test failure on test suite obdfilter-... Closed
is related to LU-4950 sanity-benchmark test fsx hung: txg_s... Closed
Severity: 3
Rank (Obsolete): 6968

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c3a0b364-812d-11e2-b609-52540035b04c.

The sub-test test_12a failed with the following error:

test failed to respond and timed out

Info required for matching: sanity-quota 12a

Looking through test 12a, things seem to have hung up on the runas dd (with oflag=sync) at the end of the test.

OST has threads that are blocked on disk I/O (oss dmesg):

txg_sync      D 0000000000000000     0 24236      2 0x00000080
 ffff88005027dbc0 0000000000000046 ffff88004b906ec0 0000000000000086
 ffff88005027db70 ffff88007c7d4408 0000000000000001 ffff88007c7d4420
 ffff880052101058 ffff88005027dfd8 000000000000fb88 ffff880052101058
Call Trace:
 [<ffffffff81090b9e>] ? prepare_to_wait_exclusive+0x4e/0x80
 [<ffffffffa016b5ac>] cv_wait_common+0x9c/0x1a0 [spl]
 [<ffffffffa02d5160>] ? zio_execute+0x0/0xf0 [zfs]
 [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa016b6e3>] __cv_wait+0x13/0x20 [spl]
 [<ffffffffa02d533b>] zio_wait+0xeb/0x160 [zfs]
 [<ffffffffa026b807>] dsl_pool_sync+0x2a7/0x480 [zfs]
 [<ffffffffa027e147>] spa_sync+0x397/0x9a0 [zfs]
 [<ffffffffa028fd41>] txg_sync_thread+0x2c1/0x490 [zfs]
 [<ffffffff810527f9>] ? set_user_nice+0xc9/0x130
 [<ffffffffa028fa80>] ? txg_sync_thread+0x0/0x490 [zfs]
 [<ffffffffa0164668>] thread_generic_wrapper+0x68/0x80 [spl]
 [<ffffffffa0164600>] ? thread_generic_wrapper+0x0/0x80 [spl]
 [<ffffffff81090626>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81090590>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
ll_ost_io00_0 D 0000000000000000     0 18170      2 0x00000080
 ffff8800427b9820 0000000000000046 0000000000000046 0000000000000001
 ffff8800427b98b0 0000000000000086 ffff8800427b97e0 ffff88005027dd60
 ffff8800427b7ab8 ffff8800427b9fd8 000000000000fb88 ffff8800427b7ab8
Call Trace:
 [<ffffffff81090b9e>] ? prepare_to_wait_exclusive+0x4e/0x80
 [<ffffffffa016b5ac>] cv_wait_common+0x9c/0x1a0 [spl]
 [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa016b6e3>] __cv_wait+0x13/0x20 [spl]
 [<ffffffffa028f573>] txg_wait_synced+0xb3/0x190 [zfs]
 [<ffffffffa0c71015>] osd_trans_stop+0x365/0x420 [osd_zfs]
 [<ffffffffa0cb9062>] ofd_trans_stop+0x22/0x60 [ofd]
 [<ffffffffa0cbdf06>] ofd_commitrw_write+0x406/0x11b0 [ofd]
 [<ffffffffa0cbf13d>] ofd_commitrw+0x48d/0x920 [ofd]
 [<ffffffffa085b708>] obd_commitrw+0x128/0x3d0 [ost]
 [<ffffffffa0862599>] ost_brw_write+0xe49/0x14d0 [ost]
 [<ffffffff812739b6>] ? vsnprintf+0x2b6/0x5f0
 [<ffffffffa088c1f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc]
 [<ffffffffa08680e3>] ost_handle+0x31e3/0x46f0 [ost]
 [<ffffffffa05ca154>] ? libcfs_id2str+0x74/0xb0 [libcfs]
 [<ffffffffa08dc02c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc]
 [<ffffffffa05be5de>] ? cfs_timer_arm+0xe/0x10 [libcfs]
 [<ffffffffa08d3759>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc]
 [<ffffffff81052223>] ? __wake_up+0x53/0x70
 [<ffffffffa08dd576>] ptlrpc_main+0xb76/0x1870 [ptlrpc]
 [<ffffffffa08dca00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffffa08dca00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffffa08dca00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc]
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20


 Comments   
Comment by Alex Zhuravlev [ 27/Feb/13 ]

i've been this in the local testing as well.

Comment by Nathaniel Clark [ 27/Feb/13 ]

Compairing results with ldiskfs via maloo:

Some zfs runs complete in ~4K seconds (possibly SLOW=yes).
ldisksfs runs complete in ~300 to 900 seconds.

Possible fix is to mark this as SLOW only or SLOW for zfs?

Comment by Niu Yawei (Inactive) [ 27/Feb/13 ]

Will you work on this, Alex? Seems it's a generic zfs sync problem?

Comment by Alex Zhuravlev [ 27/Feb/13 ]

sorry, not anytime soon

Comment by Nathaniel Clark [ 05/Mar/13 ]

Tag should be "sometimes_except" since it's now a SLOW test for zfs

Comment by Bruno Faccini (Inactive) [ 08/Mar/13 ]

Alex, do you think we can learn something with some profiling ??

Comment by Bruno Faccini (Inactive) [ 11/Mar/13 ]

Whole logs analysis does not indicate threads (at least ll_ost_io00_0 with pid 18170, which could be traced in Lustre debug-log) were hung. So we may only face a very slow test execution due to some external cause, the perfs were horribly slow :

09:41:50: [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d12/f.sanity-quota.12a-0] [count=17] [oflag=sync]
10:18:20:17+0 records in
10:18:23:17+0 records out
10:18:24:17825792 bytes (18 MB) copied, 2288.82 s, 7.8 kB/s
10:18:24:Write to ost1...
10:18:25:running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
10:18:26: [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d12/f.sanity-quota.12a-1] [count=17] [oflag=sync]
10:26:34:dd: writing `/mnt/lustre/d0.sanity-quota/d12/f.sanity-quota.12a-1': Disk quota exceeded
10:26:37:5+0 records in
10:26:38:4+0 records out
10:26:39:4194304 bytes (4.2 MB) copied, 490.729 s, 8.5 kB/s
10:26:39:Free space from ost0...
10:26:40:CMD: wtm-24vm7 lctl set_param -n osd*.*MD*.force_sync 1
10:26:42:CMD: wtm-24vm7 lctl get_param -n osc.*MDT*.sync_*
10:26:43:CMD: wtm-24vm7 lctl get_param -n osc.*MDT*.sync_*
10:26:44:CMD: wtm-24vm7 lctl get_param -n osc.*MDT*.sync_*
10:26:45:CMD: wtm-24vm7 lctl get_param -n osc.*MDT*.sync_*
10:26:47:Waiting for local destroys to complete
10:26:51:CMD: wtm-24vm8 lctl set_param -n osd*.*OS*.force_sync 1
10:26:52:Write to ost1 after space freed from ost0...
10:26:54:running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
10:26:55: [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d12/f.sanity-quota.12a-1] [count=17] [oflag=sync]

BTW, my own runs of test_12a on zfs show much better perfs/stats.

Nathaniel, did you hit this problem/time-out multiple times ?? Also, you have already set test_12a as ZFS_SLOW in change 5553 for LU-2874 !

I will try to re-run the same test suite for your original LU-2829/change 5530/patch #2/build 13602 where these https://maloo.whamcloud.com/test_sets/c3a0b364-812d-11e2-b609-52540035b04c tests failed.

Comment by Bruno Faccini (Inactive) [ 11/Mar/13 ]

Humm too bad/late, Build 13602 has been removed ...

So back in trying to reproduce the very slow runs of test_12a.

Comment by Nathaniel Clark [ 12/Mar/13 ]

Most recent failure https://maloo.whamcloud.com/test_sets/bb4b4890-8895-11e2-b643-52540035b04c

Failure Rate: 7.00% of last 100 executions [all branches]

Comment by Alex Zhuravlev [ 13/Mar/13 ]

it might make sense to add some stats to osd-zfs/ so we know for sure how long txg_wait_synced() takes ?

Comment by Jodi Levi (Inactive) [ 15/Mar/13 ]

Lowering priority per discussion with Oleg.

Comment by Jian Yu [ 20/Aug/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/33/
FSTYPE=zfs

sanity-quota test_7a hung as follows:

running as uid/gid/euid/egid 60000/60000/60000/60000, groups:
 [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d7/f.sanity-quota.7a] [count=21] [oflag=sync]

Syslog on OSS:

Aug 18 23:22:00 wtm-15vm4 kernel: txg_sync      D 0000000000000000     0 15003      2 0x00000080
Aug 18 23:22:00 wtm-15vm4 kernel: ffff880052733ba0 0000000000000046 ffff8800ffffffff 0000333ae2e38992
Aug 18 23:22:00 wtm-15vm4 kernel: 00000000a0529170 ffff88006a5fce50 0000000000386bda ffffffffadee9654
Aug 18 23:22:00 wtm-15vm4 kernel: ffff8800523f3ab8 ffff880052733fd8 000000000000fb88 ffff8800523f3ab8
Aug 18 23:22:00 wtm-15vm4 kernel: Call Trace:
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff810a2351>] ? ktime_get_ts+0xb1/0xf0
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff8150e913>] io_schedule+0x73/0xc0
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e6d4c>] cv_wait_common+0x8c/0x100 [spl]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff81096cc0>] ? autoremove_wake_function+0x0/0x40
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e3717>] ? taskq_dispatch_ent+0x57/0x110 [spl]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e6dd8>] __cv_wait_io+0x18/0x20 [spl]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa052939b>] zio_wait+0xfb/0x190 [zfs]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa04c2135>] dsl_pool_sync+0x2f5/0x540 [zfs]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa04da82e>] spa_sync+0x39e/0x970 [zfs]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff8103b8d9>] ? kvm_clock_get_cycles+0x9/0x10
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa04e582a>] txg_sync_thread+0x27a/0x4b0 [zfs]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff810560a9>] ? set_user_nice+0xc9/0x130
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa04e55b0>] ? txg_sync_thread+0x0/0x4b0 [zfs]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e2a3f>] thread_generic_wrapper+0x5f/0x70 [spl]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e29e0>] ? thread_generic_wrapper+0x0/0x70 [spl]
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff81096956>] kthread+0x96/0xa0
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff810968c0>] ? kthread+0x0/0xa0
Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Maloo report: https://maloo.whamcloud.com/test_sets/760750e4-095f-11e3-ad8a-52540035b04c

Comment by Jian Yu [ 20/Aug/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/33/
FSTYPE=zfs

replay-ost-single test_5 hung, and syslog on OSS showed that:

Aug 18 17:17:57 wtm-15vm4 kernel: txg_sync      D 0000000000000000     0 29194      2 0x00000080
Aug 18 17:17:57 wtm-15vm4 kernel: ffff8800683c5ba0 0000000000000046 ffff8800ffffffff 00004af162da63c4
Aug 18 17:17:57 wtm-15vm4 kernel: 000000004a452700 ffff88004515a750 000000000070a032 ffffffffadef130d
Aug 18 17:17:57 wtm-15vm4 kernel: ffff88006c9c1058 ffff8800683c5fd8 000000000000fb88 ffff88006c9c1058
Aug 18 17:17:57 wtm-15vm4 kernel: Call Trace:
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff810a2351>] ? ktime_get_ts+0xb1/0xf0
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff8150e913>] io_schedule+0x73/0xc0
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e6d4c>] cv_wait_common+0x8c/0x100 [spl]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff81096cc0>] ? autoremove_wake_function+0x0/0x40
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e3717>] ? taskq_dispatch_ent+0x57/0x110 [spl]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e6dd8>] __cv_wait_io+0x18/0x20 [spl]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa052939b>] zio_wait+0xfb/0x190 [zfs]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa04c1f2c>] dsl_pool_sync+0xec/0x540 [zfs]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa04da82e>] spa_sync+0x39e/0x970 [zfs]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff8103b8d9>] ? kvm_clock_get_cycles+0x9/0x10
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa04e582a>] txg_sync_thread+0x27a/0x4b0 [zfs]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff810560a9>] ? set_user_nice+0xc9/0x130
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa04e55b0>] ? txg_sync_thread+0x0/0x4b0 [zfs]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e2a3f>] thread_generic_wrapper+0x5f/0x70 [spl]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e29e0>] ? thread_generic_wrapper+0x0/0x70 [spl]
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff81096956>] kthread+0x96/0xa0
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff810968c0>] ? kthread+0x0/0xa0
Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Maloo report: https://maloo.whamcloud.com/test_sets/15fe8b0a-095e-11e3-ad8a-52540035b04c

The following sub-tests also hung at "txg_sync":
large-scale test_3a: https://maloo.whamcloud.com/test_sets/9ed19422-0963-11e3-ad8a-52540035b04c
obdfilter-survey test_1a: https://maloo.whamcloud.com/test_sets/653914a0-0964-11e3-ad8a-52540035b04c
parallel-scale test_compilebench: https://maloo.whamcloud.com/test_sets/18f7c150-0963-11e3-ad8a-52540035b04c
parallel-scale-nfsv3 test_compilebench: https://maloo.whamcloud.com/test_sets/bd745a9e-0964-11e3-ad8a-52540035b04c
parallel-scale-nfsv4 test_compilebench: https://maloo.whamcloud.com/test_sets/17b573b2-0965-11e3-ad8a-52540035b04c
posix test_1: https://maloo.whamcloud.com/test_sets/73bee652-0965-11e3-ad8a-52540035b04c

The issue in this ticket is blocking the zfs test session on Lustre b2_4 branch:
https://maloo.whamcloud.com/test_sessions/0343b17e-095a-11e3-ad8a-52540035b04c

Comment by Jian Yu [ 05/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

sanity-quota test 12a hit the same failure:
https://maloo.whamcloud.com/test_sets/15cc22b2-1559-11e3-8938-52540035b04c

parallel-scale test simul: https://maloo.whamcloud.com/test_sets/4e4fc084-155a-11e3-8938-52540035b04c

Comment by Alex Zhuravlev [ 17/Sep/13 ]

is there a way to generate crash dump for the case? I'm still scratching my head around the issue. talked to Brian B., but he doesn't see the root cause as well.

Comment by Jian Yu [ 18/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/45/ (2.4.1 RC2)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

MDSCOUNT=1
OSTCOUNT=7
MDSSIZE=2097152
OSTSIZE=2097152

In autotest run, parallel-scale test simul timed out in 3600s: https://maloo.whamcloud.com/test_sets/7a8c3714-18ab-11e3-aa54-52540035b04c

Dmesg on OSS node showed that:

txg_sync      D 0000000000000000     0 30848      2 0x00000080
 ffff880038d8bba0 0000000000000046 ffff8800ffffffff 0000c4392b16901e
 000000001a86df70 ffff88006f691f30 00000000013837fa ffffffffadd58c90
 ffff88002319d098 ffff880038d8bfd8 000000000000fb88 ffff88002319d098
Call Trace:
 [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff8150ed03>] io_schedule+0x73/0xc0
 [<ffffffffa03e6d4c>] cv_wait_common+0x8c/0x100 [spl]
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa03e3717>] ? taskq_dispatch_ent+0x57/0x110 [spl]
 [<ffffffffa03e6dd8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa052939b>] zio_wait+0xfb/0x190 [zfs]
 [<ffffffffa04c1f2c>] dsl_pool_sync+0xec/0x540 [zfs]
 [<ffffffffa04da82e>] spa_sync+0x39e/0x970 [zfs]
 [<ffffffff8103b8d9>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffffa04e582a>] txg_sync_thread+0x27a/0x4b0 [zfs]
 [<ffffffff810560a9>] ? set_user_nice+0xc9/0x130
 [<ffffffffa04e55b0>] ? txg_sync_thread+0x0/0x4b0 [zfs]
 [<ffffffffa03e2a3f>] thread_generic_wrapper+0x5f/0x70 [spl]
 [<ffffffffa03e29e0>] ? thread_generic_wrapper+0x0/0x70 [spl]
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

In manual test run, parallel-scale test simul passed in 8727s: https://maloo.whamcloud.com/test_sets/aacb3504-1a8d-11e3-8fec-52540035b04c

In the above manual test report, we can see that although all of the sub-tests passed, many of them took very long time to be finished:

simul                   8727s
connectathon 	        6880s
iorssf 	                5570s
iorfpp 	                5960s
write_append_truncate 	13940s 	
write_disjoint 	        29272s

I'll set up kdump on the test node and run the test manually again to try to get the vmcore dump file.

Comment by Jian Yu [ 18/Sep/13 ]

Is this ticket related to LU-2476/LU-2600 which is about the performance issue on ZFS?

Comment by Jian Yu [ 22/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/45/ (2.4.1 RC2)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

MDSCOUNT=1
OSTCOUNT=7
MDSSIZE=2097152
OSTSIZE=2097152

In manual test runs, sanity-quota did not hang, it finally passed in 27153s: https://maloo.whamcloud.com/test_sets/529eda72-21ac-11e3-8292-52540035b04c
Among the sub-tests, test 2 and 12a took very long time to be finished:

test_2        16566s
test_12a      3196s
Comment by Alex Zhuravlev [ 23/Sep/13 ]

could some one run some single test (like simul) with ofd.*.sync_on_lock_cancel set to never please?

Comment by Alex Zhuravlev [ 23/Sep/13 ]

I did submit http://review.whamcloud.com/#/c/7711/ few days ago, but Jira hasn't started even ..

Comment by Jian Yu [ 23/Sep/13 ]

could some one run some single test (like simul) with ofd.*.sync_on_lock_cancel set to never please?

Patch for Lustre b2_4 branch is in http://review.whamcloud.com/7725.

A manual test run is ongoing on wtm-[82-85], which are the test nodes where the parallel-scale test was performed against Lustre 2.4.1 RC2 previously. We'll compare the test results.

Comment by Jian Yu [ 24/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/45/ (2.4.1 RC2)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

MDSCOUNT=1
OSTCOUNT=7
MDSSIZE=2097152
OSTSIZE=2097152

Without the patch, parallel-scale test simul passed in 8727s: https://maloo.whamcloud.com/test_sets/aacb3504-1a8d-11e3-8fec-52540035b04c
With the patch, parallel-scale test simul passed in 189s: https://maloo.whamcloud.com/test_sets/a0eb119e-25af-11e3-aeff-52540035b04c

Comment by Alex Zhuravlev [ 24/Sep/13 ]

ahaha

Comment by Alex Zhuravlev [ 25/Sep/13 ]

Jian, can you tell which devices were used on OST in that 189s run? thanks.

Comment by Alex Zhuravlev [ 25/Sep/13 ]

yet another question: if we're using raw devices for ZFS pools (not /dev/loop*), then what I/O scheduler is used?

Comment by Jian Yu [ 25/Sep/13 ]

Hi Alex,

The devices are SCSI disk partitions.

For I/O scheduler:

# cat /sys/block/sdb/queue/scheduler
noop anticipatory [deadline] cfq
Comment by Alex Zhuravlev [ 25/Sep/13 ]

thanks. no luck then..

Comment by Alex Zhuravlev [ 25/Sep/13 ]

Jian, would you mind to try with noop please? (on the clean master branch)

Comment by Jian Yu [ 26/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-master/1689/
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs
Test nodes: wtm-[82-85]

With the default "deadline" I/O scheduler:

# pdsh -l root -S -w wtm-[84,85] "cat /sys/block/sdb/queue/scheduler"
wtm-84: noop anticipatory [deadline] cfq
wtm-85: noop anticipatory [deadline] cfq

parallel-scale test simul passed in 13457s: https://maloo.whamcloud.com/test_sets/a154ee76-2680-11e3-b741-52540035b04c

With the "noop" I/O scheduler:

# pdsh -l root -S -w wtm-[84,85] "cat /sys/block/sdb/queue/scheduler"
wtm-84: [noop] anticipatory deadline cfq
wtm-85: [noop] anticipatory deadline cfq

parallel-scale test simul passed in 13589s: https://maloo.whamcloud.com/test_sets/dae61290-26a4-11e3-94b1-52540035b04c

Comment by Alex Zhuravlev [ 26/Sep/13 ]

given no quick reply with the results, it's going to take another zillion seconds

would you mind to run manually simul and grab /proc/spl/kstat/zfs/txgs-* few times over the run?

Comment by Alex Zhuravlev [ 26/Sep/13 ]

I guess it makes sense to specify zfs_txg_history=500 (or 1000) when you load zfs module.

Comment by Jian Yu [ 26/Sep/13 ]

would you mind to run manually simul and grab /proc/spl/kstat/zfs/txgs-* few times over the run?

I guess it makes sense to specify zfs_txg_history=500 (or 1000) when you load zfs module.

OK, let me do more experiments.

Comment by Jian Yu [ 26/Sep/13 ]

Hi Alex,
Since the simul test is still running with noop scheduler, I did not stop it and gathered the /proc/spl/kstat/zfs/txgs-* data from both MDS (wtm-84) and OSS (wtm-85) for three times. Please refer to the attachments.

Comment by Alex Zhuravlev [ 26/Sep/13 ]

thanks! learning ...

Comment by Alex Zhuravlev [ 26/Sep/13 ]

do I understand correctly that all OSTs are using the same physical device?

Comment by Jian Yu [ 26/Sep/13 ]

do I understand correctly that all OSTs are using the same physical device?

Yes.

ost_HOST=wtm-85
OSTCOUNT=7
OSTSIZE=2097152
OSTDEV1=/dev/sdb5 OSTDEV2=/dev/sdb6 OSTDEV3=/dev/sdb7 OSTDEV4=/dev/sdb8 OSTDEV5=/dev/sdb9 OSTDEV6=/dev/sdb10 OSTDEV7=/dev/sdb11
Comment by Alex Zhuravlev [ 26/Sep/13 ]

would it be possible to run simul using single OST on a non-shared phys.device?

Comment by Jian Yu [ 26/Sep/13 ]

would it be possible to run simul using single OST on a non-shared phys.device?

Sure. I can start a new test run now because the previous simul test run was just finished. I'm uploading the Maloo report.

So, the new test run will be performed with single OST, deadline I/O scheduler, and zfs_txg_history=1000 while loading zfs module.

Comment by Alex Zhuravlev [ 26/Sep/13 ]

please, make sure that single OST uses a physical device nobody else uses (including MDT). so the idea is have one ZFS pool in use on that device and zero concurrency between different ZFS pools. thanks in advance.

Comment by Jian Yu [ 26/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-master/1689/
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

Test configuration:

mds_HOST=wtm-84 MDSCOUNT=1 MDSSIZE=2097152  MDSDEV1=/dev/sdc 
ost_HOST=wtm-85 OSTCOUNT=1 OSTSIZE=15000000 OSTDEV1=/dev/sdd
# pdsh -l root -S -w wtm-84 "cat /sys/block/sdc/queue/scheduler"
wtm-84: [noop] anticipatory deadline cfq

# pdsh -l root -S -w wtm-85 "cat /sys/block/sdd/queue/scheduler"
wtm-85: [noop] anticipatory deadline cfq

parallel-scale test simul passed in 1950s: https://maloo.whamcloud.com/test_sets/55b02a2c-26b1-11e3-9d3a-52540035b04c

Comment by Alex Zhuravlev [ 26/Sep/13 ]

hmm, sounds interesting.. any idea whether simul's workset depends on the space available? or probably it makes sense to return with OSTSIZE=$((6*2097152)) ?

Comment by Jian Yu [ 26/Sep/13 ]

I think the performance of simul test does not depend on the available space on OST because no parameter is calculated from the available space on OST, and the simul test mainly does the open/close/unlink/rename/creat/truncate, etc. operations.
In the above test run, I used OSTSIZE=15000000 (about 14G). Just re-ran the test with OSTSIZE=$((6*2097152))=12582912 (about 12G) and it passed in 2095s: https://maloo.whamcloud.com/test_sets/e089858a-26c0-11e3-83d1-52540035b04c

Comment by Alex Zhuravlev [ 26/Sep/13 ]

thanks a lot for help. it seems the issue is due to two factors: 1) expensive sync (to be learned yet) 2) high cost of competing of few pools on a same phys.device. I don't think anybody will be running few pools on a same device in wild..

Comment by Jian Yu [ 26/Sep/13 ]

You're welcome, Alex.

I just checked the autotest configuration and found the OST devices used by it were:

OSTDEV1=/dev/lvm-OSS/P1
OSTDEV2=/dev/lvm-OSS/P2
OSTDEV3=/dev/lvm-OSS/P3
OSTDEV4=/dev/lvm-OSS/P4
OSTDEV5=/dev/lvm-OSS/P5
OSTDEV6=/dev/lvm-OSS/P6
OSTDEV7=/dev/lvm-OSS/P7

It was likely that all of the LVs were created from the same PV, especially on virtual machines. Since most of the test nodes used by autotest system are VMs, and ZFS full test runs are performed by autotest regularly, it's hard to bypass the above factor 2).

Comment by Alex Zhuravlev [ 26/Sep/13 ]

frankly I don't think we should be trying to improve ZFS for such an environment - it's not realistic to have many pools on a same device. solution to this isn't obvious either.

Comment by Nathaniel Clark [ 26/Sep/13 ]

http://review.whamcloud.com/7778

Skip simul test for ZFS, since it takes too long.

Comment by Nathaniel Clark [ 26/Sep/13 ]

http://review.whamcloud.com/7780 - FOR TEST ONLY patch to run parallel-scale with only 2 OSTs

Comment by Jian Yu [ 27/Sep/13 ]

Hi Alex and Nathaniel,

FYI, all of the following sub-tests run slowly on ZFS under the current test environment (7 OSTs share the same physical device):

sanityn:
    test 33a                    13949s

sanity-quota:
    test 2                      16905s
    test 12a                    2556s

replay-ost-single:
    test 8a                     2326s
    test 8b                     2290s

parallel-scale:
    test compilebench           6913s
    test simul                  8727s
    test connectathon 	        6880s
    test iorssf 	        5570s
    test iorfpp 	        5960s
    test write_append_truncate 	13940s 	
    test write_disjoint 	29272s

We'd better not to skip those tests for ZFS. We can improve the test scripts to reduce file sizes/counts, iteration times, etc. to make the tests still run for ZFS within a reasonable time (less than 1800s with SLOW=yes).

Comment by Alex Zhuravlev [ 27/Sep/13 ]

can we reconfigure with OSTCOUNT=1 for parallel-scale specifically ?

Comment by Jian Yu [ 27/Sep/13 ]

can we reconfigure with OSTCOUNT=1 for parallel-scale specifically ?

This needs changes on autotest system. We can get the result of parallel-scale test with SLOW=no first to see whether those sub-tests can pass within a reasonable time or not.

Comment by Jian Yu [ 29/Sep/13 ]

In manual test run with SLOW=no and OSTCOUNT=7, parallel-scale test simul passed in 112s: https://maloo.whamcloud.com/test_sets/4b220e1e-28c7-11e3-8951-52540035b04c
However, the following sub-tests still took very long time:

compilebench    7760s
iorssf          5061s
iorfpp          5748s

Let's wait for the autotest test result in http://review.whamcloud.com/7778 to do a comparison.

Comment by Jian Yu [ 30/Sep/13 ]

Let's wait for the autotest test result in http://review.whamcloud.com/7778 to do a comparison.

In autotest test run with SLOW=no and OSTCOUNT=7, parallel-scale timed out: https://maloo.whamcloud.com/test_sets/487e3fe6-29c3-11e3-b5ea-52540035b04c

compilebench    6005s
metabench       14400s (TIMEOUT)

We still have to reduce the number of cbench_IDIRS, cbench_RUNS, mbench_NFILES, etc. for ZFS.

Comment by Nathaniel Clark [ 30/Sep/13 ]

From 7778 https://maloo.whamcloud.com/test_sets/487e3fe6-29c3-11e3-b5ea-52540035b04c
the metabench result seems very strange.

metabench normally runs fairly quickly (~200-500) judging by the results in maloo for ZFS runs.

It looks like the whole system (client-30) just went out to lunch for 4 hours.

Comment by Nathaniel Clark [ 03/Oct/13 ]

Reduce performance expectation for ZFS in sanity-quota/0, lowest observed over last 4 weeks is ~150.

http://review.whamcloud.com/7848

Comment by Andreas Dilger [ 16/Oct/13 ]

I would prefer the approach of setting OSTCOUNT=2 for ZFS-backed test filesystems. This is more in line with real systems, since we will typically only have a single OST per OSS with ZFS instead of 4 or more OSTs per OSS with ldiskfs. I think a lot of tests depend on having at least two OSTs, so OSTCOUNT=1 will probably cause some tests to be skipped.

Another complimentary approach would be to format a single ZFS pool across a few LVs and then have the different OST/MDT targets in their own datasets in the shared pool. That would avoid the extra commits caused by having separate pools. The drawback is that all of the datasets would store their files into the same space, so some of the Lustre tests would be broken if we don't add ZFS reservations for the minimum size of the datasets (e.g. fill one OST and allocate objects to another OST would break).

Comment by Jian Yu [ 17/Oct/13 ]

I would prefer the approach of setting OSTCOUNT=2 for ZFS-backed test filesystems.

I just created TEI-790 to ask TEI team for help on this change.

Another complimentary approach would be to format a single ZFS pool across a few LVs and then have the different OST/MDT targets in their own datasets in the shared pool.

This needs change mdsdevname(), ostdevname() and the failover testing support codes for ZFS (http://review.whamcloud.com/6429) in test-framework.sh.

Comment by Jian Yu [ 01/Nov/13 ]

Since TEI-790 was fixed, I've triggered a full group test session on ZFS against Lustre b2_4 build #47. I'll vet the test results to see whether the timed-out tests can pass or not with OSTCOUNT=2 and SLOW=yes.

Comment by Jian Yu [ 04/Nov/13 ]

Here is the test result on Lustre b2_4 build #47 with FSTYPE=zfs, OSTCOUNT=2 and SLOW=yes:
https://maloo.whamcloud.com/test_sessions/05b82736-444d-11e3-8472-52540035b04c

Timeout failures with OSTCOUNT=7 on the following sub-tests disappeared with OSTCOUNT=2:

sanity-benchmark     test bonnie        LU-1960
replay-ost-single    test 5             LU-2887
sanity-quota         test 7a and 12a    LU-2887
large-scale          test 3a            LU-2887
obdfilter-survey     test 1a            LU-2124

However, timeout failure on sanityn test 33a (LU-2829) still occurred.

And because of LU-3906 (out of space issue), parallel-scale, parallel-scale-nfsv3 and parallel-scale-nfsv4 tests were not really run, then we do not know whether the timeout failures on them disappear or not with OSTCOUNT=2.

Comment by Jian Yu [ 15/Nov/13 ]

http://review.whamcloud.com/7778

The above patch and http://review.whamcloud.com/8234 have landed on master branch. They are also needed on Lustre b2_4 and b2_5 branches.

The two patches are combined and back-ported to Lustre b2_4 branch: http://review.whamcloud.com/8284

Comment by Jian Yu [ 22/Nov/13 ]

Patch http://review.whamcloud.com/8284 landed on Lustre b2_4 branch. I'll trigger ZFS full group test session on Lustre b2_4 build #57 to check the test results.

Comment by Jian Yu [ 26/Nov/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/58/
Distro/Arch: RHEL6.4/x86_64

FSTYPE=zfs
MDSCOUNT=1
MDSSIZE=2097152
OSTCOUNT=2
OSTSIZE=2097152

replay-ost-single test 5 still timed out: https://maloo.whamcloud.com/test_sets/a76a6b78-5606-11e3-8e94-52540035b04c

txg_sync      D 0000000000000000     0 32309      2 0x00000080
 ffff88004925dba0 0000000000000046 ffff8800ffffffff 00001e6701abaaf4
 000000005cd72ac8 ffff88007126d2f0 00000000001ec6e4 ffffffffad6d9703
 ffff880072f07ab8 ffff88004925dfd8 000000000000fb88 ffff880072f07ab8
Call Trace:
 [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0
 [<ffffffff8150ed93>] io_schedule+0x73/0xc0
 [<ffffffffa03e6d4c>] cv_wait_common+0x8c/0x100 [spl]
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffffa03e3717>] ? taskq_dispatch_ent+0x57/0x110 [spl]
 [<ffffffffa03e6dd8>] __cv_wait_io+0x18/0x20 [spl]
 [<ffffffffa052939b>] zio_wait+0xfb/0x190 [zfs]
 [<ffffffffa04c1f2c>] dsl_pool_sync+0xec/0x540 [zfs]
 [<ffffffffa04da82e>] spa_sync+0x39e/0x970 [zfs]
 [<ffffffff8103b8d9>] ? kvm_clock_get_cycles+0x9/0x10
 [<ffffffffa04e582a>] txg_sync_thread+0x27a/0x4b0 [zfs]
 [<ffffffff810560a9>] ? set_user_nice+0xc9/0x130
 [<ffffffffa04e55b0>] ? txg_sync_thread+0x0/0x4b0 [zfs]
 [<ffffffffa03e2a3f>] thread_generic_wrapper+0x5f/0x70 [spl]
 [<ffffffffa03e29e0>] ? thread_generic_wrapper+0x0/0x70 [spl]
 [<ffffffff81096a36>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff810969a0>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Comment by Jian Yu [ 16/Dec/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/67/
Distro/Arch: RHEL6.4/x86_64

FSTYPE=zfs
MDSCOUNT=1
MDSSIZE=2097152
OSTCOUNT=2
OSTSIZE=8388608

performance-sanity test 4 timed out in 28800s:
https://maloo.whamcloud.com/test_sets/68896e68-6579-11e3-8518-52540035b04c

parallel-scale test metabench timed out in 14400s:
https://maloo.whamcloud.com/test_sets/bdff6c44-6579-11e3-8518-52540035b04c

Comment by Peter Jones [ 20/Dec/13 ]

Landed for 2.4.2 and 2.6. Will land for 2.5.1 shortly.

Comment by Andreas Dilger [ 01/Oct/14 ]

sanity-quota.sh test_12a is still being skipped on ZFS due to this bug, reopen until problem is actually fixed. At a minimum, the skip should only be done in the case of VM nodes, or ZFS-on-LVM or something.

Comment by Alex Zhuravlev [ 30/Sep/15 ]

ZIL will improve this

Generated at Sat Feb 10 01:29:04 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.