[LU-8326] sanity-quota test_21 test failed to respond and timed out - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Duplicate
Priority: Blocker
Fix Version/s: Lustre 2.10.0
Affects Version/s: Lustre 2.9.0, Lustre 2.10.0
Labels:
- zfs

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

This issue was created by maloo for nasf <fan.yong@intel.com>

Please provide additional information about the failure here.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/648de7b4-3a40-11e6-bbf5-5254006e85c2.

Attachments

Issue Links

is related to

LU-4448 Test failure timeout on sanity-quota test_7c: client sysrq and reboot

Open

is related to

LU-2435 inode accounting in osd-zfs is racy

Resolved

mentioned in: Page Loading...; Page Loading...

Activity

[LU-8326] sanity-quota test_21 test failed to respond and timed out

Andreas Dilger added a comment - 23/Jan/17 12:08 AM

https://review.whamcloud.com/15294

Andreas Dilger added a comment - 23/Jan/17 12:08 AM https://review.whamcloud.com/15294

Niu Yawei (Inactive) added a comment - 19/Jan/17 1:52 AM

Andreas, that could be helpful, will that fix being deployed in our testing system soon?

Niu Yawei (Inactive) added a comment - 19/Jan/17 1:52 AM Andreas, that could be helpful, will that fix being deployed in our testing system soon?

Andreas Dilger added a comment - 18/Jan/17 7:32 PM

Would this be fixed by the native ZFS dnode accounting patch from Jinshan?

Andreas Dilger added a comment - 18/Jan/17 7:32 PM Would this be fixed by the native ZFS dnode accounting patch from Jinshan?

Niu Yawei (Inactive) added a comment - 18/Jan/17 3:12 AM

I'm wondering if should disable this test for zfs temporarily before the zfs slow sync problem is fixed. What's your opinion? Andreas/Alex?

Niu Yawei (Inactive) added a comment - 18/Jan/17 3:12 AM I'm wondering if should disable this test for zfs temporarily before the zfs slow sync problem is fixed. What's your opinion? Andreas/Alex?

Jian Yu added a comment - 17/Jan/17 7:03 AM

More failure instances in patch review test sessions on master branch:
https://testing.hpdd.intel.com/test_sets/6d09a4dc-dbad-11e6-8b41-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ba9eb33c-daee-11e6-8752-5254006e85c2

Jian Yu added a comment - 17/Jan/17 7:03 AM More failure instances in patch review test sessions on master branch: https://testing.hpdd.intel.com/test_sets/6d09a4dc-dbad-11e6-8b41-5254006e85c2 https://testing.hpdd.intel.com/test_sets/ba9eb33c-daee-11e6-8752-5254006e85c2

Andreas Dilger added a comment - 13/Jan/17 7:29 PM

+1 on master https://testing.hpdd.intel.com/test_sets/48ddbe22-d919-11e6-a61d-5254006e85c2

Andreas Dilger added a comment - 13/Jan/17 7:29 PM +1 on master https://testing.hpdd.intel.com/test_sets/48ddbe22-d919-11e6-a61d-5254006e85c2

James Nunez (Inactive) added a comment - 12/Jan/17 6:23 PM - edited

Wrong ticket

James Nunez (Inactive) added a comment - 12/Jan/17 6:23 PM - edited Wrong ticket

nasf (Inactive) added a comment - 03/Jan/17 9:02 AM

+1 on master:
https://testing.hpdd.intel.com/test_sets/4f5129c4-d18f-11e6-a844-5254006e85c2

nasf (Inactive) added a comment - 03/Jan/17 9:02 AM +1 on master: https://testing.hpdd.intel.com/test_sets/4f5129c4-d18f-11e6-a844-5254006e85c2

Niu Yawei (Inactive) added a comment - 29/Dec/16 3:44 AM

Several 'dd' processes were blocked on dir rwsem like:

02:05:56:[14258.624960] dd              D ffff88007add20c0     0 32124  32116 0x00000080
02:05:56:[14258.626623]  ffff880052e4bc70 0000000000000082 ffff88000b628fb0 ffff880052e4bfd8
02:05:56:[14258.628406]  ffff880052e4bfd8 ffff880052e4bfd8 ffff88000b628fb0 ffff88007add20b8
02:05:56:[14258.630175]  ffff88007add20bc ffff88000b628fb0 00000000ffffffff ffff88007add20c0
02:05:56:[14258.631880] Call Trace:
02:05:56:[14258.633182]  [<ffffffff8168c669>] schedule_preempt_disabled+0x29/0x70
02:05:56:[14258.634743]  [<ffffffff8168a2c5>] __mutex_lock_slowpath+0xc5/0x1c0
02:05:56:[14258.636370]  [<ffffffff8168972f>] mutex_lock+0x1f/0x2f
02:05:56:[14258.637839]  [<ffffffff8120cb9f>] do_last+0x28f/0x12a0
02:05:56:[14258.639371]  [<ffffffff8120dc72>] path_openat+0xc2/0x490
02:05:56:[14258.640849]  [<ffffffff8118055b>] ? unlock_page+0x2b/0x30
02:05:56:[14258.642408]  [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0
02:05:56:[14258.643862]  [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130
02:05:56:[14258.645382]  [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0
02:05:56:[14258.646831]  [<ffffffff811fd40e>] SyS_open+0x1e/0x20
02:05:56:[14258.648296]  [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b

Because one 'dd' can't finish open create (or it took too long time for an open-create):

02:05:56:[14258.566251] dd              S 000000000000f888     0 32123  32115 0x00000080
02:05:56:[14258.567866]  ffff88007b03f700 0000000000000082 ffff88007c075e20 ffff88007b03ffd8
02:05:56:[14258.569629]  ffff88007b03ffd8 ffff88007b03ffd8 ffff88007c075e20 ffff88007c0dc000
02:05:56:[14258.571373]  ffff88007b03f730 0000000100d5b01c ffff88007c0dc000 000000000000f888
02:05:56:[14258.573070] Call Trace:
02:05:56:[14258.574276]  [<ffffffff8168b579>] schedule+0x29/0x70
02:05:56:[14258.575745]  [<ffffffff81688f05>] schedule_timeout+0x175/0x2d0
02:05:56:[14258.577234]  [<ffffffff81096820>] ? internal_add_timer+0x70/0x70
02:05:56:[14258.578795]  [<ffffffffa09e3d50>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc]
02:05:56:[14258.580347]  [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
02:05:56:[14258.581878]  [<ffffffffa09e420d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
02:05:56:[14258.583463]  [<ffffffffa09c9236>] ldlm_cli_enqueue+0x686/0x810 [ptlrpc]
02:05:56:[14258.585097]  [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc]
02:05:56:[14258.586766]  [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
02:05:56:[14258.588506]  [<ffffffffa0b3f1df>] mdc_enqueue_base+0x2bf/0x17d0 [mdc]
02:05:56:[14258.590072]  [<ffffffffa0b4164b>] mdc_intent_lock+0x26b/0x520 [mdc]
02:05:56:[14258.591682]  [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
02:05:56:[14258.593387]  [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc]
02:05:56:[14258.595129]  [<ffffffffa0b0e6dd>] lmv_intent_lock+0xc0d/0x1b50 [lmv]
02:05:56:[14258.596700]  [<ffffffff811144f2>] ? from_kgid+0x12/0x20
02:05:56:[14258.598234]  [<ffffffffa0c89217>] ? ll_i2suppgid+0x37/0x40 [lustre]
02:05:56:[14258.599804]  [<ffffffffa0c89244>] ? ll_i2gids+0x24/0xb0 [lustre]
02:05:56:[14258.601401]  [<ffffffff811144f2>] ? from_kgid+0x12/0x20
02:05:56:[14258.602856]  [<ffffffffa0c73dd0>] ? ll_prep_md_op_data+0x220/0x510 [lustre]
02:05:56:[14258.604559]  [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
02:05:56:[14258.606285]  [<ffffffffa0ca174e>] ? ll_dentry_init_security+0x2e/0x90 [lustre]
02:05:56:[14258.608026]  [<ffffffffa0c8a8ee>] ll_lookup_it+0x27e/0xe40 [lustre]
02:05:56:[14258.609663]  [<ffffffffa0c8b5cd>] ll_atomic_open+0x11d/0x11e0 [lustre]
02:05:56:[14258.611268]  [<ffffffff812a88cc>] ? security_inode_permission+0x1c/0x30
02:05:56:[14258.612919]  [<ffffffff8120d321>] do_last+0xa11/0x12a0
02:05:56:[14258.614392]  [<ffffffff8120dc72>] path_openat+0xc2/0x490
02:05:56:[14258.615933]  [<ffffffff8118055b>] ? unlock_page+0x2b/0x30
02:05:56:[14258.617410]  [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0
02:05:56:[14258.618930]  [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130
02:05:56:[14258.620405]  [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0
02:05:56:[14258.621929]  [<ffffffff811fd40e>] SyS_open+0x1e/0x20
02:05:56:[14258.623368]  [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b

The weird thing is I didn't find any mdt service thread was working the open-create, probably the open-create is already done when dumping the stack trace? What I can see from MDS side is that txg_sync thread was stuck:

[14300.192361] txg_sync        D 0000000000000001     0 23196      2 0x00000080
[14300.194020]  ffff8800467cbb20 0000000000000046 ffff880038e2de20 ffff8800467cbfd8
[14300.195757]  ffff8800467cbfd8 ffff8800467cbfd8 ffff880038e2de20 ffff88007fd16c40
[14300.197507]  0000000000000000 7fffffffffffffff ffff880038e437d0 0000000000000001
[14300.199213] Call Trace:
[14300.200474]  [<ffffffff8168b7c9>] schedule+0x29/0x70
[14300.202035]  [<ffffffff81689219>] schedule_timeout+0x239/0x2d0
[14300.203583]  [<ffffffff810c4ed2>] ? default_wake_function+0x12/0x20
[14300.205164]  [<ffffffff810ba118>] ? __wake_up_common+0x58/0x90
[14300.206703]  [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30
[14300.208274]  [<ffffffff8168ad6e>] io_schedule_timeout+0xae/0x130
[14300.209830]  [<ffffffff810b12f6>] ? prepare_to_wait_exclusive+0x56/0x90
[14300.211437]  [<ffffffff8168ae08>] io_schedule+0x18/0x20
[14300.212934]  [<ffffffffa02d16d0>] cv_wait_common+0xb0/0x150 [spl]
[14300.214491]  [<ffffffff810b1600>] ? wake_up_atomic_t+0x30/0x30
[14300.216043]  [<ffffffffa02d17c8>] __cv_wait_io+0x18/0x20 [spl]
[14300.217600]  [<ffffffffa0505ab3>] zio_wait+0x123/0x210 [zfs]
[14300.219131]  [<ffffffffa04a84db>] spa_sync+0x75b/0xb70 [zfs]
[14300.220611]  [<ffffffff810b162b>] ? autoremove_wake_function+0x2b/0x40
[14300.222179]  [<ffffffff81689972>] ? mutex_lock+0x12/0x2f
[14300.223665]  [<ffffffffa04ba81c>] txg_sync_thread+0x3cc/0x640 [zfs]
[14300.225225]  [<ffffffffa04ba450>] ? txg_fini+0x2a0/0x2a0 [zfs]
[14300.226724]  [<ffffffffa02cc851>] thread_generic_wrapper+0x71/0x80 [spl]
[14300.228287]  [<ffffffffa02cc7e0>] ? __thread_exit+0x20/0x20 [spl]
[14300.229808]  [<ffffffff810b052f>] kthread+0xcf/0xe0
[14300.231226]  [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
[14300.232768]  [<ffffffff81696658>] ret_from_fork+0x58/0x90
[14300.234226]  [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140

I think it's perhaps another zfs slow sync issue? Alex, do you have any comments? Are there still any open ticket for such kind of problem? Thanks.

Niu Yawei (Inactive) added a comment - 29/Dec/16 3:44 AM Several 'dd' processes were blocked on dir rwsem like: 02:05:56:[14258.624960] dd D ffff88007add20c0 0 32124 32116 0x00000080 02:05:56:[14258.626623] ffff880052e4bc70 0000000000000082 ffff88000b628fb0 ffff880052e4bfd8 02:05:56:[14258.628406] ffff880052e4bfd8 ffff880052e4bfd8 ffff88000b628fb0 ffff88007add20b8 02:05:56:[14258.630175] ffff88007add20bc ffff88000b628fb0 00000000ffffffff ffff88007add20c0 02:05:56:[14258.631880] Call Trace: 02:05:56:[14258.633182] [<ffffffff8168c669>] schedule_preempt_disabled+0x29/0x70 02:05:56:[14258.634743] [<ffffffff8168a2c5>] __mutex_lock_slowpath+0xc5/0x1c0 02:05:56:[14258.636370] [<ffffffff8168972f>] mutex_lock+0x1f/0x2f 02:05:56:[14258.637839] [<ffffffff8120cb9f>] do_last+0x28f/0x12a0 02:05:56:[14258.639371] [<ffffffff8120dc72>] path_openat+0xc2/0x490 02:05:56:[14258.640849] [<ffffffff8118055b>] ? unlock_page+0x2b/0x30 02:05:56:[14258.642408] [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0 02:05:56:[14258.643862] [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130 02:05:56:[14258.645382] [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0 02:05:56:[14258.646831] [<ffffffff811fd40e>] SyS_open+0x1e/0x20 02:05:56:[14258.648296] [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b Because one 'dd' can't finish open create (or it took too long time for an open-create): 02:05:56:[14258.566251] dd S 000000000000f888 0 32123 32115 0x00000080 02:05:56:[14258.567866] ffff88007b03f700 0000000000000082 ffff88007c075e20 ffff88007b03ffd8 02:05:56:[14258.569629] ffff88007b03ffd8 ffff88007b03ffd8 ffff88007c075e20 ffff88007c0dc000 02:05:56:[14258.571373] ffff88007b03f730 0000000100d5b01c ffff88007c0dc000 000000000000f888 02:05:56:[14258.573070] Call Trace: 02:05:56:[14258.574276] [<ffffffff8168b579>] schedule+0x29/0x70 02:05:56:[14258.575745] [<ffffffff81688f05>] schedule_timeout+0x175/0x2d0 02:05:56:[14258.577234] [<ffffffff81096820>] ? internal_add_timer+0x70/0x70 02:05:56:[14258.578795] [<ffffffffa09e3d50>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc] 02:05:56:[14258.580347] [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20 02:05:56:[14258.581878] [<ffffffffa09e420d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] 02:05:56:[14258.583463] [<ffffffffa09c9236>] ldlm_cli_enqueue+0x686/0x810 [ptlrpc] 02:05:56:[14258.585097] [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc] 02:05:56:[14258.586766] [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] 02:05:56:[14258.588506] [<ffffffffa0b3f1df>] mdc_enqueue_base+0x2bf/0x17d0 [mdc] 02:05:56:[14258.590072] [<ffffffffa0b4164b>] mdc_intent_lock+0x26b/0x520 [mdc] 02:05:56:[14258.591682] [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] 02:05:56:[14258.593387] [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc] 02:05:56:[14258.595129] [<ffffffffa0b0e6dd>] lmv_intent_lock+0xc0d/0x1b50 [lmv] 02:05:56:[14258.596700] [<ffffffff811144f2>] ? from_kgid+0x12/0x20 02:05:56:[14258.598234] [<ffffffffa0c89217>] ? ll_i2suppgid+0x37/0x40 [lustre] 02:05:56:[14258.599804] [<ffffffffa0c89244>] ? ll_i2gids+0x24/0xb0 [lustre] 02:05:56:[14258.601401] [<ffffffff811144f2>] ? from_kgid+0x12/0x20 02:05:56:[14258.602856] [<ffffffffa0c73dd0>] ? ll_prep_md_op_data+0x220/0x510 [lustre] 02:05:56:[14258.604559] [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] 02:05:56:[14258.606285] [<ffffffffa0ca174e>] ? ll_dentry_init_security+0x2e/0x90 [lustre] 02:05:56:[14258.608026] [<ffffffffa0c8a8ee>] ll_lookup_it+0x27e/0xe40 [lustre] 02:05:56:[14258.609663] [<ffffffffa0c8b5cd>] ll_atomic_open+0x11d/0x11e0 [lustre] 02:05:56:[14258.611268] [<ffffffff812a88cc>] ? security_inode_permission+0x1c/0x30 02:05:56:[14258.612919] [<ffffffff8120d321>] do_last+0xa11/0x12a0 02:05:56:[14258.614392] [<ffffffff8120dc72>] path_openat+0xc2/0x490 02:05:56:[14258.615933] [<ffffffff8118055b>] ? unlock_page+0x2b/0x30 02:05:56:[14258.617410] [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0 02:05:56:[14258.618930] [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130 02:05:56:[14258.620405] [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0 02:05:56:[14258.621929] [<ffffffff811fd40e>] SyS_open+0x1e/0x20 02:05:56:[14258.623368] [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b The weird thing is I didn't find any mdt service thread was working the open-create, probably the open-create is already done when dumping the stack trace? What I can see from MDS side is that txg_sync thread was stuck: [14300.192361] txg_sync D 0000000000000001 0 23196 2 0x00000080 [14300.194020] ffff8800467cbb20 0000000000000046 ffff880038e2de20 ffff8800467cbfd8 [14300.195757] ffff8800467cbfd8 ffff8800467cbfd8 ffff880038e2de20 ffff88007fd16c40 [14300.197507] 0000000000000000 7fffffffffffffff ffff880038e437d0 0000000000000001 [14300.199213] Call Trace: [14300.200474] [<ffffffff8168b7c9>] schedule+0x29/0x70 [14300.202035] [<ffffffff81689219>] schedule_timeout+0x239/0x2d0 [14300.203583] [<ffffffff810c4ed2>] ? default_wake_function+0x12/0x20 [14300.205164] [<ffffffff810ba118>] ? __wake_up_common+0x58/0x90 [14300.206703] [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 [14300.208274] [<ffffffff8168ad6e>] io_schedule_timeout+0xae/0x130 [14300.209830] [<ffffffff810b12f6>] ? prepare_to_wait_exclusive+0x56/0x90 [14300.211437] [<ffffffff8168ae08>] io_schedule+0x18/0x20 [14300.212934] [<ffffffffa02d16d0>] cv_wait_common+0xb0/0x150 [spl] [14300.214491] [<ffffffff810b1600>] ? wake_up_atomic_t+0x30/0x30 [14300.216043] [<ffffffffa02d17c8>] __cv_wait_io+0x18/0x20 [spl] [14300.217600] [<ffffffffa0505ab3>] zio_wait+0x123/0x210 [zfs] [14300.219131] [<ffffffffa04a84db>] spa_sync+0x75b/0xb70 [zfs] [14300.220611] [<ffffffff810b162b>] ? autoremove_wake_function+0x2b/0x40 [14300.222179] [<ffffffff81689972>] ? mutex_lock+0x12/0x2f [14300.223665] [<ffffffffa04ba81c>] txg_sync_thread+0x3cc/0x640 [zfs] [14300.225225] [<ffffffffa04ba450>] ? txg_fini+0x2a0/0x2a0 [zfs] [14300.226724] [<ffffffffa02cc851>] thread_generic_wrapper+0x71/0x80 [spl] [14300.228287] [<ffffffffa02cc7e0>] ? __thread_exit+0x20/0x20 [spl] [14300.229808] [<ffffffff810b052f>] kthread+0xcf/0xe0 [14300.231226] [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140 [14300.232768] [<ffffffff81696658>] ret_from_fork+0x58/0x90 [14300.234226] [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140 I think it's perhaps another zfs slow sync issue? Alex, do you have any comments? Are there still any open ticket for such kind of problem? Thanks.

Joseph Gmitter (Inactive) added a comment - 28/Dec/16 12:48 PM

Hi Niu,

Can you please have a look at this issue? James reports that there are 32 test failures on master in the past 10 days due to this issue.

Thanks.
Joe

Joseph Gmitter (Inactive) added a comment - 28/Dec/16 12:48 PM Hi Niu, Can you please have a look at this issue? James reports that there are 32 test failures on master in the past 10 days due to this issue. Thanks. Joe

nasf (Inactive) added a comment - 23/Dec/16 10:38 AM

+2 on master:
https://testing.hpdd.intel.com/test_sets/d6b55222-c8a3-11e6-a95f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/eca4660a-c889-11e6-a95f-5254006e85c2

nasf (Inactive) added a comment - 23/Dec/16 10:38 AM +2 on master: https://testing.hpdd.intel.com/test_sets/d6b55222-c8a3-11e6-a95f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/eca4660a-c889-11e6-a95f-5254006e85c2

People

Assignee:: Niu Yawei (Inactive)

Reporter:: Maloo

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Dates

Created:: 25/Jun/16 12:12 AM

Updated:: 17/Apr/17 4:53 AM

Resolved:: 17/Apr/17 4:53 AM