[LU-2829] Timeout on sanityn test_33a: zfs slow when commit_on_sharing enabled Created: 18/Feb/13  Updated: 31/Dec/13  Resolved: 02/Dec/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0
Fix Version/s: Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: performance, zfs

Issue Links:
Duplicate
is duplicated by LU-4251 Failure on test suite sanityn test_33a Closed
Related
is related to LU-2887 sanity-quota test_12a: slow due to ZF... Resolved
Severity: 3
Rank (Obsolete): 6852

 Description   

This issue was created by maloo for sarah <sarah@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/2b779918-7787-11e2-987d-52540035b04c.

The sub-test test_33a failed with the following error:

test failed to respond and timed out



 Comments   
Comment by Keith Mannthey (Inactive) [ 18/Feb/13 ]

Also the test reports
test_33a
Error: 'test failed to respond and timed out'
Failure Rate: 100.00% of last 100 executions [all branches]

The systems look like they just waited around for an hours.

The test log as a good bit of "/proc/fs/jbd2/mdt1:*/info: No such file or directory" in it. I am guessing the test got lost. I don't expect zfs to have jbd2 business.

Comment by Nathaniel Clark [ 26/Feb/13 ]

http://review.whamcloud.com/5530 - ABANDONED

Comment by Nathaniel Clark [ 28/Feb/13 ]

Fix for 33a and 33b need to fix assumptions about device names for these tests to be useful even on autotests since they use dm devices.

Comment by Nathaniel Clark [ 28/Feb/13 ]

This particular failure is not actually related to zfs proc entries.

One of the createmany processes (run on client1) seems to go out to lunch

 17:43:26 client-19vm1 xinetd[5727]: START: mshell pid=4698 from=::ffff:10.10.4.220
Feb 14 17:43:26 client-19vm1 mrshd[4699]: root@client-19vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  MGSFSTYPE=zfs MDSFSTYPE=zfs OSTFSTYPE=zfs FSTYPE=zfs sh -c "createmany -o /mnt/lustre/d0.sanityn/d33-\$(hostname)-1/f- -r /mnt/lustre2/d0.sanityn/d33-\$(hostname)-1/f- 10000 > /dev/null 2>&1");echo XXRETCODE:$?'
Feb 14 18:40:17 client-19vm1 xinetd[5727]: START: shell pid=28648 from=::ffff:10.10.4.220
Feb 14 18:40:17 client-19vm1 rshd[28650]: autotest@client-19vm1.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70180499028200+1360896014.64875'
Feb 14 18:40:18 client-19vm1 kernel: SysRq : Show State
...

OST has one process blocked on:

18:41:13:txg_sync      D 0000000000000000     0  2707      2 0x00000080
18:41:13: ffff88007c881bc0 0000000000000046 ffff88007cecf680 0000000000000086
18:41:13: ffff88007c881b70 ffff88007cfeda48 0000000000000001 ffff88007cfeda60
18:41:13: ffff8800796a9af8 ffff88007c881fd8 000000000000fb88 ffff8800796a9af8
18:41:13:Call Trace:
18:41:13: [<ffffffff81090b9e>] ? prepare_to_wait_exclusive+0x4e/0x80
18:41:13: [<ffffffffa016b5ac>] cv_wait_common+0x9c/0x1a0 [spl]
18:41:13: [<ffffffffa02d5160>] ? zio_execute+0x0/0xf0 [zfs]
18:41:13: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
18:41:13: [<ffffffffa016b6e3>] __cv_wait+0x13/0x20 [spl]
18:41:13: [<ffffffffa02d533b>] zio_wait+0xeb/0x160 [zfs]
18:41:13: [<ffffffffa026b807>] dsl_pool_sync+0x2a7/0x480 [zfs]
18:41:13: [<ffffffffa027e147>] spa_sync+0x397/0x9a0 [zfs]
18:41:13: [<ffffffffa028fd41>] txg_sync_thread+0x2c1/0x490 [zfs]
18:41:13: [<ffffffff810527f9>] ? set_user_nice+0xc9/0x130
18:41:13: [<ffffffffa028fa80>] ? txg_sync_thread+0x0/0x490 [zfs]
18:41:13: [<ffffffffa0164668>] thread_generic_wrapper+0x68/0x80 [spl]
18:41:13: [<ffffffffa0164600>] ? thread_generic_wrapper+0x0/0x80 [spl]
18:41:14: [<ffffffff81090626>] kthread+0x96/0xa0
18:41:14: [<ffffffff8100c0ca>] child_rip+0xa/0x20
18:41:14: [<ffffffff81090590>] ? kthread+0x0/0xa0
18:41:14: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

This txg_sync stack trace looks very similar to the one posted in LU-2887

Comment by Nathaniel Clark [ 12/Mar/13 ]

This test will complete on ZFS but it takes an inordinate amount of time (usually slightly over 1 hour).

zfs times:
commit_on_sharing=0 ~58s
commit_on_sharing=1 ~1300s

ldiskfs times:
commit_on_sharing=0 ~58s
commit_on_sharing=1 ~240s

Comment by Nathaniel Clark [ 12/Mar/13 ]

Patch to make logging print fewer errors:
http://review.whamcloud.com/5693

Comment by Peter Jones [ 13/Mar/13 ]

Landed for 2.4

Comment by Nathaniel Clark [ 14/Mar/13 ]

The core issue isn't resolved. This could be de-prioritized because the test will EVENTUALLY pass, just not within a reasonable time.

The speed is probably related to bug LU-2887 and ilk.

Comment by Jian Yu [ 12/Aug/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/29/

With FSTYPE=zfs, sanityn test 33a cannot pass within 3600s:
https://maloo.whamcloud.com/test_sets/3fbb7eca-0257-11e3-b384-52540035b04c

Comment by Jian Yu [ 05/Sep/13 ]

Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1)
Distro/Arch: RHEL6.4/x86_64
FSTYPE=zfs

sanityn test 33a timed out with 18000s:
https://maloo.whamcloud.com/test_sets/fbf13be0-1556-11e3-8938-52540035b04c

Comment by Jian Yu [ 01/Nov/13 ]

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/46/
FSTYPE=zfs

sanityn test 33a timed out with 3600s:
https://maloo.whamcloud.com/test_sets/cff9f9da-305b-11e3-b28a-52540035b04c

We'll see whether the timeout failure can disappear after TEI-790 is resolved.

Comment by Jian Yu [ 04/Nov/13 ]

Even with OSTCOUNT=2, sanityn test 33a still timed out in 18000s:
https://maloo.whamcloud.com/test_sets/70536e32-444f-11e3-8472-52540035b04c
https://maloo.whamcloud.com/test_sets/fb7ca724-43ed-11e3-8599-52540035b04c

Comment by Nathaniel Clark [ 04/Nov/13 ]

http://review.whamcloud.com/8161

Comment by Jian Yu [ 14/Nov/13 ]

The above patch is also needed on Lustre b2_4 and b2_5 branches.

Comment by Bob Glossman (Inactive) [ 14/Nov/13 ]

backport to b2_4
http://review.whamcloud.com/8273

Comment by Nathaniel Clark [ 15/Nov/13 ]

8161 cherry-picks over to b2_5 just fine

Comment by Nathaniel Clark [ 15/Nov/13 ]

Landed post 2.5.51

Comment by Jian Yu [ 22/Nov/13 ]

Patch landed on Lustre b2_4 branch.

Comment by Jian Yu [ 26/Nov/13 ]

backport to b2_4 http://review.whamcloud.com/8273

The failure still occurred:

Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/58/
Distro/Arch: RHEL6.4/x86_64

FSTYPE=zfs
MDSCOUNT=1
MDSSIZE=2097152
OSTCOUNT=2
OSTSIZE=2097152

https://maloo.whamcloud.com/test_sets/0f0a6172-5605-11e3-8e94-52540035b04c

Comment by Jian Yu [ 29/Nov/13 ]

Patches to decrease the number of files again for sanityn test 33a and 33b:
For master branch: http://review.whamcloud.com/8440
For b2_4 branch: http://review.whamcloud.com/8439

Comment by Jian Yu [ 02/Dec/13 ]

The above patches landed on master and b2_4 branches.

Generated at Sat Feb 10 01:28:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.