[LU-2887] sanity-quota test_12a: slow due to ZFS VMs sharing single disk Created: 27/Feb/13 Updated: 29/May/17 Resolved: 03/Dec/16 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | Alex Zhuravlev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | MB, performance, zfs | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 6968 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/c3a0b364-812d-11e2-b609-52540035b04c. The sub-test test_12a failed with the following error:
Info required for matching: sanity-quota 12a Looking through test 12a, things seem to have hung up on the runas dd (with oflag=sync) at the end of the test. OST has threads that are blocked on disk I/O (oss dmesg): txg_sync D 0000000000000000 0 24236 2 0x00000080 ffff88005027dbc0 0000000000000046 ffff88004b906ec0 0000000000000086 ffff88005027db70 ffff88007c7d4408 0000000000000001 ffff88007c7d4420 ffff880052101058 ffff88005027dfd8 000000000000fb88 ffff880052101058 Call Trace: [<ffffffff81090b9e>] ? prepare_to_wait_exclusive+0x4e/0x80 [<ffffffffa016b5ac>] cv_wait_common+0x9c/0x1a0 [spl] [<ffffffffa02d5160>] ? zio_execute+0x0/0xf0 [zfs] [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa016b6e3>] __cv_wait+0x13/0x20 [spl] [<ffffffffa02d533b>] zio_wait+0xeb/0x160 [zfs] [<ffffffffa026b807>] dsl_pool_sync+0x2a7/0x480 [zfs] [<ffffffffa027e147>] spa_sync+0x397/0x9a0 [zfs] [<ffffffffa028fd41>] txg_sync_thread+0x2c1/0x490 [zfs] [<ffffffff810527f9>] ? set_user_nice+0xc9/0x130 [<ffffffffa028fa80>] ? txg_sync_thread+0x0/0x490 [zfs] [<ffffffffa0164668>] thread_generic_wrapper+0x68/0x80 [spl] [<ffffffffa0164600>] ? thread_generic_wrapper+0x0/0x80 [spl] [<ffffffff81090626>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff81090590>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 ll_ost_io00_0 D 0000000000000000 0 18170 2 0x00000080 ffff8800427b9820 0000000000000046 0000000000000046 0000000000000001 ffff8800427b98b0 0000000000000086 ffff8800427b97e0 ffff88005027dd60 ffff8800427b7ab8 ffff8800427b9fd8 000000000000fb88 ffff8800427b7ab8 Call Trace: [<ffffffff81090b9e>] ? prepare_to_wait_exclusive+0x4e/0x80 [<ffffffffa016b5ac>] cv_wait_common+0x9c/0x1a0 [spl] [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa016b6e3>] __cv_wait+0x13/0x20 [spl] [<ffffffffa028f573>] txg_wait_synced+0xb3/0x190 [zfs] [<ffffffffa0c71015>] osd_trans_stop+0x365/0x420 [osd_zfs] [<ffffffffa0cb9062>] ofd_trans_stop+0x22/0x60 [ofd] [<ffffffffa0cbdf06>] ofd_commitrw_write+0x406/0x11b0 [ofd] [<ffffffffa0cbf13d>] ofd_commitrw+0x48d/0x920 [ofd] [<ffffffffa085b708>] obd_commitrw+0x128/0x3d0 [ost] [<ffffffffa0862599>] ost_brw_write+0xe49/0x14d0 [ost] [<ffffffff812739b6>] ? vsnprintf+0x2b6/0x5f0 [<ffffffffa088c1f0>] ? target_bulk_timeout+0x0/0xc0 [ptlrpc] [<ffffffffa08680e3>] ost_handle+0x31e3/0x46f0 [ost] [<ffffffffa05ca154>] ? libcfs_id2str+0x74/0xb0 [libcfs] [<ffffffffa08dc02c>] ptlrpc_server_handle_request+0x41c/0xdf0 [ptlrpc] [<ffffffffa05be5de>] ? cfs_timer_arm+0xe/0x10 [libcfs] [<ffffffffa08d3759>] ? ptlrpc_wait_event+0xa9/0x290 [ptlrpc] [<ffffffff81052223>] ? __wake_up+0x53/0x70 [<ffffffffa08dd576>] ptlrpc_main+0xb76/0x1870 [ptlrpc] [<ffffffffa08dca00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffffa08dca00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffffa08dca00>] ? ptlrpc_main+0x0/0x1870 [ptlrpc] [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Alex Zhuravlev [ 27/Feb/13 ] |
|
i've been this in the local testing as well. |
| Comment by Nathaniel Clark [ 27/Feb/13 ] |
|
Compairing results with ldiskfs via maloo: Some zfs runs complete in ~4K seconds (possibly SLOW=yes). Possible fix is to mark this as SLOW only or SLOW for zfs? |
| Comment by Niu Yawei (Inactive) [ 27/Feb/13 ] |
|
Will you work on this, Alex? Seems it's a generic zfs sync problem? |
| Comment by Alex Zhuravlev [ 27/Feb/13 ] |
|
sorry, not anytime soon |
| Comment by Nathaniel Clark [ 05/Mar/13 ] |
|
Tag should be "sometimes_except" since it's now a SLOW test for zfs |
| Comment by Bruno Faccini (Inactive) [ 08/Mar/13 ] |
|
Alex, do you think we can learn something with some profiling ?? |
| Comment by Bruno Faccini (Inactive) [ 11/Mar/13 ] |
|
Whole logs analysis does not indicate threads (at least ll_ost_io00_0 with pid 18170, which could be traced in Lustre debug-log) were hung. So we may only face a very slow test execution due to some external cause, the perfs were horribly slow : 09:41:50: [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d12/f.sanity-quota.12a-0] [count=17] [oflag=sync] 10:18:20:17+0 records in 10:18:23:17+0 records out 10:18:24:17825792 bytes (18 MB) copied, 2288.82 s, 7.8 kB/s 10:18:24:Write to ost1... 10:18:25:running as uid/gid/euid/egid 60000/60000/60000/60000, groups: 10:18:26: [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d12/f.sanity-quota.12a-1] [count=17] [oflag=sync] 10:26:34:dd: writing `/mnt/lustre/d0.sanity-quota/d12/f.sanity-quota.12a-1': Disk quota exceeded 10:26:37:5+0 records in 10:26:38:4+0 records out 10:26:39:4194304 bytes (4.2 MB) copied, 490.729 s, 8.5 kB/s 10:26:39:Free space from ost0... 10:26:40:CMD: wtm-24vm7 lctl set_param -n osd*.*MD*.force_sync 1 10:26:42:CMD: wtm-24vm7 lctl get_param -n osc.*MDT*.sync_* 10:26:43:CMD: wtm-24vm7 lctl get_param -n osc.*MDT*.sync_* 10:26:44:CMD: wtm-24vm7 lctl get_param -n osc.*MDT*.sync_* 10:26:45:CMD: wtm-24vm7 lctl get_param -n osc.*MDT*.sync_* 10:26:47:Waiting for local destroys to complete 10:26:51:CMD: wtm-24vm8 lctl set_param -n osd*.*OS*.force_sync 1 10:26:52:Write to ost1 after space freed from ost0... 10:26:54:running as uid/gid/euid/egid 60000/60000/60000/60000, groups: 10:26:55: [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d12/f.sanity-quota.12a-1] [count=17] [oflag=sync] BTW, my own runs of test_12a on zfs show much better perfs/stats. Nathaniel, did you hit this problem/time-out multiple times ?? Also, you have already set test_12a as ZFS_SLOW in change 5553 for I will try to re-run the same test suite for your original |
| Comment by Bruno Faccini (Inactive) [ 11/Mar/13 ] |
|
Humm too bad/late, Build 13602 has been removed ... So back in trying to reproduce the very slow runs of test_12a. |
| Comment by Nathaniel Clark [ 12/Mar/13 ] |
|
Most recent failure https://maloo.whamcloud.com/test_sets/bb4b4890-8895-11e2-b643-52540035b04c
|
| Comment by Alex Zhuravlev [ 13/Mar/13 ] |
|
it might make sense to add some stats to osd-zfs/ so we know for sure how long txg_wait_synced() takes ? |
| Comment by Jodi Levi (Inactive) [ 15/Mar/13 ] |
|
Lowering priority per discussion with Oleg. |
| Comment by Jian Yu [ 20/Aug/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/33/ sanity-quota test_7a hung as follows: running as uid/gid/euid/egid 60000/60000/60000/60000, groups: [dd] [if=/dev/zero] [bs=1M] [of=/mnt/lustre/d0.sanity-quota/d7/f.sanity-quota.7a] [count=21] [oflag=sync] Syslog on OSS: Aug 18 23:22:00 wtm-15vm4 kernel: txg_sync D 0000000000000000 0 15003 2 0x00000080 Aug 18 23:22:00 wtm-15vm4 kernel: ffff880052733ba0 0000000000000046 ffff8800ffffffff 0000333ae2e38992 Aug 18 23:22:00 wtm-15vm4 kernel: 00000000a0529170 ffff88006a5fce50 0000000000386bda ffffffffadee9654 Aug 18 23:22:00 wtm-15vm4 kernel: ffff8800523f3ab8 ffff880052733fd8 000000000000fb88 ffff8800523f3ab8 Aug 18 23:22:00 wtm-15vm4 kernel: Call Trace: Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff810a2351>] ? ktime_get_ts+0xb1/0xf0 Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff8150e913>] io_schedule+0x73/0xc0 Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e6d4c>] cv_wait_common+0x8c/0x100 [spl] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff81096cc0>] ? autoremove_wake_function+0x0/0x40 Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e3717>] ? taskq_dispatch_ent+0x57/0x110 [spl] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e6dd8>] __cv_wait_io+0x18/0x20 [spl] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa052939b>] zio_wait+0xfb/0x190 [zfs] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa04c2135>] dsl_pool_sync+0x2f5/0x540 [zfs] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa04da82e>] spa_sync+0x39e/0x970 [zfs] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff8103b8d9>] ? kvm_clock_get_cycles+0x9/0x10 Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa04e582a>] txg_sync_thread+0x27a/0x4b0 [zfs] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff810560a9>] ? set_user_nice+0xc9/0x130 Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa04e55b0>] ? txg_sync_thread+0x0/0x4b0 [zfs] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e2a3f>] thread_generic_wrapper+0x5f/0x70 [spl] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffffa03e29e0>] ? thread_generic_wrapper+0x0/0x70 [spl] Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff81096956>] kthread+0x96/0xa0 Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff810968c0>] ? kthread+0x0/0xa0 Aug 18 23:22:00 wtm-15vm4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Maloo report: https://maloo.whamcloud.com/test_sets/760750e4-095f-11e3-ad8a-52540035b04c |
| Comment by Jian Yu [ 20/Aug/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/33/ replay-ost-single test_5 hung, and syslog on OSS showed that: Aug 18 17:17:57 wtm-15vm4 kernel: txg_sync D 0000000000000000 0 29194 2 0x00000080 Aug 18 17:17:57 wtm-15vm4 kernel: ffff8800683c5ba0 0000000000000046 ffff8800ffffffff 00004af162da63c4 Aug 18 17:17:57 wtm-15vm4 kernel: 000000004a452700 ffff88004515a750 000000000070a032 ffffffffadef130d Aug 18 17:17:57 wtm-15vm4 kernel: ffff88006c9c1058 ffff8800683c5fd8 000000000000fb88 ffff88006c9c1058 Aug 18 17:17:57 wtm-15vm4 kernel: Call Trace: Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff810a2351>] ? ktime_get_ts+0xb1/0xf0 Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff8150e913>] io_schedule+0x73/0xc0 Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e6d4c>] cv_wait_common+0x8c/0x100 [spl] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff81096cc0>] ? autoremove_wake_function+0x0/0x40 Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e3717>] ? taskq_dispatch_ent+0x57/0x110 [spl] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e6dd8>] __cv_wait_io+0x18/0x20 [spl] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa052939b>] zio_wait+0xfb/0x190 [zfs] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa04c1f2c>] dsl_pool_sync+0xec/0x540 [zfs] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa04da82e>] spa_sync+0x39e/0x970 [zfs] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff8103b8d9>] ? kvm_clock_get_cycles+0x9/0x10 Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa04e582a>] txg_sync_thread+0x27a/0x4b0 [zfs] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff810560a9>] ? set_user_nice+0xc9/0x130 Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa04e55b0>] ? txg_sync_thread+0x0/0x4b0 [zfs] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e2a3f>] thread_generic_wrapper+0x5f/0x70 [spl] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffffa03e29e0>] ? thread_generic_wrapper+0x0/0x70 [spl] Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff81096956>] kthread+0x96/0xa0 Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff8100c0ca>] child_rip+0xa/0x20 Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff810968c0>] ? kthread+0x0/0xa0 Aug 18 17:17:57 wtm-15vm4 kernel: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Maloo report: https://maloo.whamcloud.com/test_sets/15fe8b0a-095e-11e3-ad8a-52540035b04c The following sub-tests also hung at "txg_sync": The issue in this ticket is blocking the zfs test session on Lustre b2_4 branch: |
| Comment by Jian Yu [ 05/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1) sanity-quota test 12a hit the same failure: parallel-scale test simul: https://maloo.whamcloud.com/test_sets/4e4fc084-155a-11e3-8938-52540035b04c |
| Comment by Alex Zhuravlev [ 17/Sep/13 ] |
|
is there a way to generate crash dump for the case? I'm still scratching my head around the issue. talked to Brian B., but he doesn't see the root cause as well. |
| Comment by Jian Yu [ 18/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/45/ (2.4.1 RC2) MDSCOUNT=1 In autotest run, parallel-scale test simul timed out in 3600s: https://maloo.whamcloud.com/test_sets/7a8c3714-18ab-11e3-aa54-52540035b04c Dmesg on OSS node showed that: txg_sync D 0000000000000000 0 30848 2 0x00000080 ffff880038d8bba0 0000000000000046 ffff8800ffffffff 0000c4392b16901e 000000001a86df70 ffff88006f691f30 00000000013837fa ffffffffadd58c90 ffff88002319d098 ffff880038d8bfd8 000000000000fb88 ffff88002319d098 Call Trace: [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0 [<ffffffff8150ed03>] io_schedule+0x73/0xc0 [<ffffffffa03e6d4c>] cv_wait_common+0x8c/0x100 [spl] [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa03e3717>] ? taskq_dispatch_ent+0x57/0x110 [spl] [<ffffffffa03e6dd8>] __cv_wait_io+0x18/0x20 [spl] [<ffffffffa052939b>] zio_wait+0xfb/0x190 [zfs] [<ffffffffa04c1f2c>] dsl_pool_sync+0xec/0x540 [zfs] [<ffffffffa04da82e>] spa_sync+0x39e/0x970 [zfs] [<ffffffff8103b8d9>] ? kvm_clock_get_cycles+0x9/0x10 [<ffffffffa04e582a>] txg_sync_thread+0x27a/0x4b0 [zfs] [<ffffffff810560a9>] ? set_user_nice+0xc9/0x130 [<ffffffffa04e55b0>] ? txg_sync_thread+0x0/0x4b0 [zfs] [<ffffffffa03e2a3f>] thread_generic_wrapper+0x5f/0x70 [spl] [<ffffffffa03e29e0>] ? thread_generic_wrapper+0x0/0x70 [spl] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 In manual test run, parallel-scale test simul passed in 8727s: https://maloo.whamcloud.com/test_sets/aacb3504-1a8d-11e3-8fec-52540035b04c In the above manual test report, we can see that although all of the sub-tests passed, many of them took very long time to be finished: simul 8727s connectathon 6880s iorssf 5570s iorfpp 5960s write_append_truncate 13940s write_disjoint 29272s I'll set up kdump on the test node and run the test manually again to try to get the vmcore dump file. |
| Comment by Jian Yu [ 18/Sep/13 ] |
|
Is this ticket related to |
| Comment by Jian Yu [ 22/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/45/ (2.4.1 RC2) MDSCOUNT=1 In manual test runs, sanity-quota did not hang, it finally passed in 27153s: https://maloo.whamcloud.com/test_sets/529eda72-21ac-11e3-8292-52540035b04c test_2 16566s test_12a 3196s |
| Comment by Alex Zhuravlev [ 23/Sep/13 ] |
|
could some one run some single test (like simul) with ofd.*.sync_on_lock_cancel set to never please? |
| Comment by Alex Zhuravlev [ 23/Sep/13 ] |
|
I did submit http://review.whamcloud.com/#/c/7711/ few days ago, but Jira hasn't started even .. |
| Comment by Jian Yu [ 23/Sep/13 ] |
Patch for Lustre b2_4 branch is in http://review.whamcloud.com/7725. A manual test run is ongoing on wtm-[82-85], which are the test nodes where the parallel-scale test was performed against Lustre 2.4.1 RC2 previously. We'll compare the test results. |
| Comment by Jian Yu [ 24/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/45/ (2.4.1 RC2) MDSCOUNT=1 Without the patch, parallel-scale test simul passed in 8727s: https://maloo.whamcloud.com/test_sets/aacb3504-1a8d-11e3-8fec-52540035b04c |
| Comment by Alex Zhuravlev [ 24/Sep/13 ] |
|
ahaha |
| Comment by Alex Zhuravlev [ 25/Sep/13 ] |
|
Jian, can you tell which devices were used on OST in that 189s run? thanks. |
| Comment by Alex Zhuravlev [ 25/Sep/13 ] |
|
yet another question: if we're using raw devices for ZFS pools (not /dev/loop*), then what I/O scheduler is used? |
| Comment by Jian Yu [ 25/Sep/13 ] |
|
Hi Alex, The devices are SCSI disk partitions. For I/O scheduler: # cat /sys/block/sdb/queue/scheduler noop anticipatory [deadline] cfq |
| Comment by Alex Zhuravlev [ 25/Sep/13 ] |
|
thanks. no luck then.. |
| Comment by Alex Zhuravlev [ 25/Sep/13 ] |
|
Jian, would you mind to try with noop please? (on the clean master branch) |
| Comment by Jian Yu [ 26/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-master/1689/ With the default "deadline" I/O scheduler: # pdsh -l root -S -w wtm-[84,85] "cat /sys/block/sdb/queue/scheduler" wtm-84: noop anticipatory [deadline] cfq wtm-85: noop anticipatory [deadline] cfq parallel-scale test simul passed in 13457s: https://maloo.whamcloud.com/test_sets/a154ee76-2680-11e3-b741-52540035b04c With the "noop" I/O scheduler: # pdsh -l root -S -w wtm-[84,85] "cat /sys/block/sdb/queue/scheduler" wtm-84: [noop] anticipatory deadline cfq wtm-85: [noop] anticipatory deadline cfq parallel-scale test simul passed in 13589s: https://maloo.whamcloud.com/test_sets/dae61290-26a4-11e3-94b1-52540035b04c |
| Comment by Alex Zhuravlev [ 26/Sep/13 ] |
|
given no quick reply with the results, it's going to take another zillion seconds would you mind to run manually simul and grab /proc/spl/kstat/zfs/txgs-* few times over the run? |
| Comment by Alex Zhuravlev [ 26/Sep/13 ] |
|
I guess it makes sense to specify zfs_txg_history=500 (or 1000) when you load zfs module. |
| Comment by Jian Yu [ 26/Sep/13 ] |
OK, let me do more experiments. |
| Comment by Jian Yu [ 26/Sep/13 ] |
|
Hi Alex, |
| Comment by Alex Zhuravlev [ 26/Sep/13 ] |
|
thanks! learning ... |
| Comment by Alex Zhuravlev [ 26/Sep/13 ] |
|
do I understand correctly that all OSTs are using the same physical device? |
| Comment by Jian Yu [ 26/Sep/13 ] |
Yes. ost_HOST=wtm-85 OSTCOUNT=7 OSTSIZE=2097152 OSTDEV1=/dev/sdb5 OSTDEV2=/dev/sdb6 OSTDEV3=/dev/sdb7 OSTDEV4=/dev/sdb8 OSTDEV5=/dev/sdb9 OSTDEV6=/dev/sdb10 OSTDEV7=/dev/sdb11 |
| Comment by Alex Zhuravlev [ 26/Sep/13 ] |
|
would it be possible to run simul using single OST on a non-shared phys.device? |
| Comment by Jian Yu [ 26/Sep/13 ] |
Sure. I can start a new test run now because the previous simul test run was just finished. I'm uploading the Maloo report. So, the new test run will be performed with single OST, deadline I/O scheduler, and zfs_txg_history=1000 while loading zfs module. |
| Comment by Alex Zhuravlev [ 26/Sep/13 ] |
|
please, make sure that single OST uses a physical device nobody else uses (including MDT). so the idea is have one ZFS pool in use on that device and zero concurrency between different ZFS pools. thanks in advance. |
| Comment by Jian Yu [ 26/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-master/1689/ Test configuration: mds_HOST=wtm-84 MDSCOUNT=1 MDSSIZE=2097152 MDSDEV1=/dev/sdc ost_HOST=wtm-85 OSTCOUNT=1 OSTSIZE=15000000 OSTDEV1=/dev/sdd # pdsh -l root -S -w wtm-84 "cat /sys/block/sdc/queue/scheduler" wtm-84: [noop] anticipatory deadline cfq # pdsh -l root -S -w wtm-85 "cat /sys/block/sdd/queue/scheduler" wtm-85: [noop] anticipatory deadline cfq parallel-scale test simul passed in 1950s: https://maloo.whamcloud.com/test_sets/55b02a2c-26b1-11e3-9d3a-52540035b04c |
| Comment by Alex Zhuravlev [ 26/Sep/13 ] |
|
hmm, sounds interesting.. any idea whether simul's workset depends on the space available? or probably it makes sense to return with OSTSIZE=$((6*2097152)) ? |
| Comment by Jian Yu [ 26/Sep/13 ] |
|
I think the performance of simul test does not depend on the available space on OST because no parameter is calculated from the available space on OST, and the simul test mainly does the open/close/unlink/rename/creat/truncate, etc. operations. |
| Comment by Alex Zhuravlev [ 26/Sep/13 ] |
|
thanks a lot for help. it seems the issue is due to two factors: 1) expensive sync (to be learned yet) 2) high cost of competing of few pools on a same phys.device. I don't think anybody will be running few pools on a same device in wild.. |
| Comment by Jian Yu [ 26/Sep/13 ] |
|
You're welcome, Alex. I just checked the autotest configuration and found the OST devices used by it were: OSTDEV1=/dev/lvm-OSS/P1 OSTDEV2=/dev/lvm-OSS/P2 OSTDEV3=/dev/lvm-OSS/P3 OSTDEV4=/dev/lvm-OSS/P4 OSTDEV5=/dev/lvm-OSS/P5 OSTDEV6=/dev/lvm-OSS/P6 OSTDEV7=/dev/lvm-OSS/P7 It was likely that all of the LVs were created from the same PV, especially on virtual machines. Since most of the test nodes used by autotest system are VMs, and ZFS full test runs are performed by autotest regularly, it's hard to bypass the above factor 2). |
| Comment by Alex Zhuravlev [ 26/Sep/13 ] |
|
frankly I don't think we should be trying to improve ZFS for such an environment - it's not realistic to have many pools on a same device. solution to this isn't obvious either. |
| Comment by Nathaniel Clark [ 26/Sep/13 ] |
|
http://review.whamcloud.com/7778 Skip simul test for ZFS, since it takes too long. |
| Comment by Nathaniel Clark [ 26/Sep/13 ] |
|
http://review.whamcloud.com/7780 - FOR TEST ONLY patch to run parallel-scale with only 2 OSTs |
| Comment by Jian Yu [ 27/Sep/13 ] |
|
Hi Alex and Nathaniel, FYI, all of the following sub-tests run slowly on ZFS under the current test environment (7 OSTs share the same physical device): sanityn:
test 33a 13949s
sanity-quota:
test 2 16905s
test 12a 2556s
replay-ost-single:
test 8a 2326s
test 8b 2290s
parallel-scale:
test compilebench 6913s
test simul 8727s
test connectathon 6880s
test iorssf 5570s
test iorfpp 5960s
test write_append_truncate 13940s
test write_disjoint 29272s
We'd better not to skip those tests for ZFS. We can improve the test scripts to reduce file sizes/counts, iteration times, etc. to make the tests still run for ZFS within a reasonable time (less than 1800s with SLOW=yes). |
| Comment by Alex Zhuravlev [ 27/Sep/13 ] |
|
can we reconfigure with OSTCOUNT=1 for parallel-scale specifically ? |
| Comment by Jian Yu [ 27/Sep/13 ] |
This needs changes on autotest system. We can get the result of parallel-scale test with SLOW=no first to see whether those sub-tests can pass within a reasonable time or not. |
| Comment by Jian Yu [ 29/Sep/13 ] |
|
In manual test run with SLOW=no and OSTCOUNT=7, parallel-scale test simul passed in 112s: https://maloo.whamcloud.com/test_sets/4b220e1e-28c7-11e3-8951-52540035b04c compilebench 7760s iorssf 5061s iorfpp 5748s Let's wait for the autotest test result in http://review.whamcloud.com/7778 to do a comparison. |
| Comment by Jian Yu [ 30/Sep/13 ] |
In autotest test run with SLOW=no and OSTCOUNT=7, parallel-scale timed out: https://maloo.whamcloud.com/test_sets/487e3fe6-29c3-11e3-b5ea-52540035b04c compilebench 6005s metabench 14400s (TIMEOUT) We still have to reduce the number of cbench_IDIRS, cbench_RUNS, mbench_NFILES, etc. for ZFS. |
| Comment by Nathaniel Clark [ 30/Sep/13 ] |
|
From 7778 https://maloo.whamcloud.com/test_sets/487e3fe6-29c3-11e3-b5ea-52540035b04c metabench normally runs fairly quickly (~200-500) judging by the results in maloo for ZFS runs. It looks like the whole system (client-30) just went out to lunch for 4 hours. |
| Comment by Nathaniel Clark [ 03/Oct/13 ] |
|
Reduce performance expectation for ZFS in sanity-quota/0, lowest observed over last 4 weeks is ~150. |
| Comment by Andreas Dilger [ 16/Oct/13 ] |
|
I would prefer the approach of setting OSTCOUNT=2 for ZFS-backed test filesystems. This is more in line with real systems, since we will typically only have a single OST per OSS with ZFS instead of 4 or more OSTs per OSS with ldiskfs. I think a lot of tests depend on having at least two OSTs, so OSTCOUNT=1 will probably cause some tests to be skipped. Another complimentary approach would be to format a single ZFS pool across a few LVs and then have the different OST/MDT targets in their own datasets in the shared pool. That would avoid the extra commits caused by having separate pools. The drawback is that all of the datasets would store their files into the same space, so some of the Lustre tests would be broken if we don't add ZFS reservations for the minimum size of the datasets (e.g. fill one OST and allocate objects to another OST would break). |
| Comment by Jian Yu [ 17/Oct/13 ] |
I just created TEI-790 to ask TEI team for help on this change.
This needs change mdsdevname(), ostdevname() and the failover testing support codes for ZFS (http://review.whamcloud.com/6429) in test-framework.sh. |
| Comment by Jian Yu [ 01/Nov/13 ] |
|
Since TEI-790 was fixed, I've triggered a full group test session on ZFS against Lustre b2_4 build #47. I'll vet the test results to see whether the timed-out tests can pass or not with OSTCOUNT=2 and SLOW=yes. |
| Comment by Jian Yu [ 04/Nov/13 ] |
|
Here is the test result on Lustre b2_4 build #47 with FSTYPE=zfs, OSTCOUNT=2 and SLOW=yes: Timeout failures with OSTCOUNT=7 on the following sub-tests disappeared with OSTCOUNT=2: sanity-benchmark test bonnie LU-1960 replay-ost-single test 5 LU-2887 sanity-quota test 7a and 12a LU-2887 large-scale test 3a LU-2887 obdfilter-survey test 1a LU-2124 However, timeout failure on sanityn test 33a ( And because of |
| Comment by Jian Yu [ 15/Nov/13 ] |
|
The above patch and http://review.whamcloud.com/8234 have landed on master branch. They are also needed on Lustre b2_4 and b2_5 branches. The two patches are combined and back-ported to Lustre b2_4 branch: http://review.whamcloud.com/8284 |
| Comment by Jian Yu [ 22/Nov/13 ] |
|
Patch http://review.whamcloud.com/8284 landed on Lustre b2_4 branch. I'll trigger ZFS full group test session on Lustre b2_4 build #57 to check the test results. |
| Comment by Jian Yu [ 26/Nov/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/58/ FSTYPE=zfs replay-ost-single test 5 still timed out: https://maloo.whamcloud.com/test_sets/a76a6b78-5606-11e3-8e94-52540035b04c txg_sync D 0000000000000000 0 32309 2 0x00000080 ffff88004925dba0 0000000000000046 ffff8800ffffffff 00001e6701abaaf4 000000005cd72ac8 ffff88007126d2f0 00000000001ec6e4 ffffffffad6d9703 ffff880072f07ab8 ffff88004925dfd8 000000000000fb88 ffff880072f07ab8 Call Trace: [<ffffffff810a2431>] ? ktime_get_ts+0xb1/0xf0 [<ffffffff8150ed93>] io_schedule+0x73/0xc0 [<ffffffffa03e6d4c>] cv_wait_common+0x8c/0x100 [spl] [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa03e3717>] ? taskq_dispatch_ent+0x57/0x110 [spl] [<ffffffffa03e6dd8>] __cv_wait_io+0x18/0x20 [spl] [<ffffffffa052939b>] zio_wait+0xfb/0x190 [zfs] [<ffffffffa04c1f2c>] dsl_pool_sync+0xec/0x540 [zfs] [<ffffffffa04da82e>] spa_sync+0x39e/0x970 [zfs] [<ffffffff8103b8d9>] ? kvm_clock_get_cycles+0x9/0x10 [<ffffffffa04e582a>] txg_sync_thread+0x27a/0x4b0 [zfs] [<ffffffff810560a9>] ? set_user_nice+0xc9/0x130 [<ffffffffa04e55b0>] ? txg_sync_thread+0x0/0x4b0 [zfs] [<ffffffffa03e2a3f>] thread_generic_wrapper+0x5f/0x70 [spl] [<ffffffffa03e29e0>] ? thread_generic_wrapper+0x0/0x70 [spl] [<ffffffff81096a36>] kthread+0x96/0xa0 [<ffffffff8100c0ca>] child_rip+0xa/0x20 [<ffffffff810969a0>] ? kthread+0x0/0xa0 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 |
| Comment by Jian Yu [ 16/Dec/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/67/ FSTYPE=zfs performance-sanity test 4 timed out in 28800s: parallel-scale test metabench timed out in 14400s: |
| Comment by Peter Jones [ 20/Dec/13 ] |
|
Landed for 2.4.2 and 2.6. Will land for 2.5.1 shortly. |
| Comment by Andreas Dilger [ 01/Oct/14 ] |
|
sanity-quota.sh test_12a is still being skipped on ZFS due to this bug, reopen until problem is actually fixed. At a minimum, the skip should only be done in the case of VM nodes, or ZFS-on-LVM or something. |
| Comment by Alex Zhuravlev [ 30/Sep/15 ] |
|
ZIL will improve this |