[LU-2829] Timeout on sanityn test_33a: zfs slow when commit_on_sharing enabled Created: 18/Feb/13 Updated: 31/Dec/13 Resolved: 02/Dec/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.4.1, Lustre 2.5.0 |
| Fix Version/s: | Lustre 2.6.0, Lustre 2.4.2, Lustre 2.5.1 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance, zfs | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 6852 | ||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/2b779918-7787-11e2-987d-52540035b04c. The sub-test test_33a failed with the following error:
|
| Comments |
| Comment by Keith Mannthey (Inactive) [ 18/Feb/13 ] |
|
Also the test reports The systems look like they just waited around for an hours. The test log as a good bit of "/proc/fs/jbd2/mdt1:*/info: No such file or directory" in it. I am guessing the test got lost. I don't expect zfs to have jbd2 business. |
| Comment by Nathaniel Clark [ 26/Feb/13 ] |
|
http://review.whamcloud.com/5530 - ABANDONED |
| Comment by Nathaniel Clark [ 28/Feb/13 ] |
|
Fix for 33a and 33b need to fix assumptions about device names for these tests to be useful even on autotests since they use dm devices. |
| Comment by Nathaniel Clark [ 28/Feb/13 ] |
|
This particular failure is not actually related to zfs proc entries. One of the createmany processes (run on client1) seems to go out to lunch 17:43:26 client-19vm1 xinetd[5727]: START: mshell pid=4698 from=::ffff:10.10.4.220 Feb 14 17:43:26 client-19vm1 mrshd[4699]: root@client-19vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre" MGSFSTYPE=zfs MDSFSTYPE=zfs OSTFSTYPE=zfs FSTYPE=zfs sh -c "createmany -o /mnt/lustre/d0.sanityn/d33-\$(hostname)-1/f- -r /mnt/lustre2/d0.sanityn/d33-\$(hostname)-1/f- 10000 > /dev/null 2>&1");echo XXRETCODE:$?' Feb 14 18:40:17 client-19vm1 xinetd[5727]: START: shell pid=28648 from=::ffff:10.10.4.220 Feb 14 18:40:17 client-19vm1 rshd[28650]: autotest@client-19vm1.lab.whamcloud.com as root: cmd='/home/autotest/.autotest/dynamic_bash/70180499028200+1360896014.64875' Feb 14 18:40:18 client-19vm1 kernel: SysRq : Show State ... OST has one process blocked on: 18:41:13:txg_sync D 0000000000000000 0 2707 2 0x00000080 18:41:13: ffff88007c881bc0 0000000000000046 ffff88007cecf680 0000000000000086 18:41:13: ffff88007c881b70 ffff88007cfeda48 0000000000000001 ffff88007cfeda60 18:41:13: ffff8800796a9af8 ffff88007c881fd8 000000000000fb88 ffff8800796a9af8 18:41:13:Call Trace: 18:41:13: [<ffffffff81090b9e>] ? prepare_to_wait_exclusive+0x4e/0x80 18:41:13: [<ffffffffa016b5ac>] cv_wait_common+0x9c/0x1a0 [spl] 18:41:13: [<ffffffffa02d5160>] ? zio_execute+0x0/0xf0 [zfs] 18:41:13: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40 18:41:13: [<ffffffffa016b6e3>] __cv_wait+0x13/0x20 [spl] 18:41:13: [<ffffffffa02d533b>] zio_wait+0xeb/0x160 [zfs] 18:41:13: [<ffffffffa026b807>] dsl_pool_sync+0x2a7/0x480 [zfs] 18:41:13: [<ffffffffa027e147>] spa_sync+0x397/0x9a0 [zfs] 18:41:13: [<ffffffffa028fd41>] txg_sync_thread+0x2c1/0x490 [zfs] 18:41:13: [<ffffffff810527f9>] ? set_user_nice+0xc9/0x130 18:41:13: [<ffffffffa028fa80>] ? txg_sync_thread+0x0/0x490 [zfs] 18:41:13: [<ffffffffa0164668>] thread_generic_wrapper+0x68/0x80 [spl] 18:41:13: [<ffffffffa0164600>] ? thread_generic_wrapper+0x0/0x80 [spl] 18:41:14: [<ffffffff81090626>] kthread+0x96/0xa0 18:41:14: [<ffffffff8100c0ca>] child_rip+0xa/0x20 18:41:14: [<ffffffff81090590>] ? kthread+0x0/0xa0 18:41:14: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 This txg_sync stack trace looks very similar to the one posted in |
| Comment by Nathaniel Clark [ 12/Mar/13 ] |
|
This test will complete on ZFS but it takes an inordinate amount of time (usually slightly over 1 hour). zfs times: ldiskfs times: |
| Comment by Nathaniel Clark [ 12/Mar/13 ] |
|
Patch to make logging print fewer errors: |
| Comment by Peter Jones [ 13/Mar/13 ] |
|
Landed for 2.4 |
| Comment by Nathaniel Clark [ 14/Mar/13 ] |
|
The core issue isn't resolved. This could be de-prioritized because the test will EVENTUALLY pass, just not within a reasonable time. The speed is probably related to bug |
| Comment by Jian Yu [ 12/Aug/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/29/ With FSTYPE=zfs, sanityn test 33a cannot pass within 3600s: |
| Comment by Jian Yu [ 05/Sep/13 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/44/ (2.4.1 RC1) sanityn test 33a timed out with 18000s: |
| Comment by Jian Yu [ 01/Nov/13 ] |
|
Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/46/ sanityn test 33a timed out with 3600s: We'll see whether the timeout failure can disappear after TEI-790 is resolved. |
| Comment by Jian Yu [ 04/Nov/13 ] |
|
Even with OSTCOUNT=2, sanityn test 33a still timed out in 18000s: |
| Comment by Nathaniel Clark [ 04/Nov/13 ] |
| Comment by Jian Yu [ 14/Nov/13 ] |
|
The above patch is also needed on Lustre b2_4 and b2_5 branches. |
| Comment by Bob Glossman (Inactive) [ 14/Nov/13 ] |
|
backport to b2_4 |
| Comment by Nathaniel Clark [ 15/Nov/13 ] |
|
8161 cherry-picks over to b2_5 just fine |
| Comment by Nathaniel Clark [ 15/Nov/13 ] |
|
Landed post 2.5.51 |
| Comment by Jian Yu [ 22/Nov/13 ] |
|
Patch landed on Lustre b2_4 branch. |
| Comment by Jian Yu [ 26/Nov/13 ] |
The failure still occurred: Lustre Build: http://build.whamcloud.com/job/lustre-b2_4/58/ FSTYPE=zfs https://maloo.whamcloud.com/test_sets/0f0a6172-5605-11e3-8e94-52540035b04c |
| Comment by Jian Yu [ 29/Nov/13 ] |
|
Patches to decrease the number of files again for sanityn test 33a and 33b: |
| Comment by Jian Yu [ 02/Dec/13 ] |
|
The above patches landed on master and b2_4 branches. |