[LU-8326] sanity-quota test_21 test failed to respond and timed out Created: 25/Jun/16 Updated: 17/Apr/17 Resolved: 17/Apr/17 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.9.0, Lustre 2.10.0 |
| Fix Version/s: | Lustre 2.10.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Maloo | Assignee: | Niu Yawei (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | zfs | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for nasf <fan.yong@intel.com> Please provide additional information about the failure here. This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/648de7b4-3a40-11e6-bbf5-5254006e85c2. |
| Comments |
| Comment by Jian Yu [ 13/Jul/16 ] |
|
One more instance on master branch: https://testing.hpdd.intel.com/test_sets/ab8deb42-487c-11e6-a80f-5254006e85c2 |
| Comment by nasf (Inactive) [ 29/Jul/16 ] |
|
Another one on master: |
| Comment by Bob Glossman (Inactive) [ 04/Aug/16 ] |
|
another on master: |
| Comment by Bob Glossman (Inactive) [ 06/Aug/16 ] |
|
another on master: |
| Comment by James Nunez (Inactive) [ 20/Oct/16 ] |
|
I've seen this issue several times recently. In the client log, the dd is hung, but not sure why: 08:26:46:[ 9196.946469] Lustre: DEBUG MARKER: Files for user (quota_usr): 08:26:46:[ 9240.580066] INFO: task dd:28549 blocked for more than 120 seconds. 08:26:46:[ 9240.581799] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 08:26:46:[ 9240.583601] dd D ffff880016aeb2c0 0 28549 28540 0x00000080 08:26:46:[ 9240.585324] ffff8800001cbc70 0000000000000086 ffff880079d19700 ffff8800001cbfd8 08:26:46:[ 9240.587145] ffff8800001cbfd8 ffff8800001cbfd8 ffff880079d19700 ffff880016aeb2b8 08:26:46:[ 9240.589335] ffff880016aeb2bc ffff880079d19700 00000000ffffffff ffff880016aeb2c0 08:26:46:[ 9240.591322] Call Trace: 08:26:46:[ 9240.592709] [<ffffffff8163cae9>] schedule_preempt_disabled+0x29/0x70 08:26:46:[ 9240.594300] [<ffffffff8163a7e5>] __mutex_lock_slowpath+0xc5/0x1c0 08:26:46:[ 9240.595846] [<ffffffff81639c4f>] mutex_lock+0x1f/0x2f 08:26:46:[ 9240.597336] [<ffffffff811ec11f>] do_last+0x28f/0x1270 08:26:46:[ 9240.598791] [<ffffffff812f9062>] ? radix_tree_lookup_slot+0x22/0x50 08:26:46:[ 9240.600310] [<ffffffff811eede2>] path_openat+0xc2/0x490 08:26:46:[ 9240.601744] [<ffffffff811f05ab>] do_filp_open+0x4b/0xb0 08:26:46:[ 9240.603178] [<ffffffff811fd147>] ? __alloc_fd+0xa7/0x130 08:26:46:[ 9240.604623] [<ffffffff811ddf53>] do_sys_open+0xf3/0x1f0 08:26:46:[ 9240.606030] [<ffffffff811de06e>] SyS_open+0x1e/0x20 08:26:46:[ 9240.607404] [<ffffffff81646a09>] system_call_fastpath+0x16/0x1b The recent failures that I've seen are all for ZFS servers. Here are few logs from recent failures: |
| Comment by Bob Glossman (Inactive) [ 04/Nov/16 ] |
|
another on master: |
| Comment by nasf (Inactive) [ 27/Nov/16 ] |
|
+1 on master: |
| Comment by nasf (Inactive) [ 01/Dec/16 ] |
|
+1 on master: |
| Comment by nasf (Inactive) [ 01/Dec/16 ] |
|
+1 on master: |
| Comment by Jian Yu [ 08/Dec/16 ] |
|
More failures on master branch in review-zfs-part-1 patch review test sessions: |
| Comment by Jian Yu [ 09/Dec/16 ] |
|
This is blocking patch review testing for https://review.whamcloud.com/22588 on master branch. |
| Comment by Patrick Farrell (Inactive) [ 09/Dec/16 ] |
|
Also blocking testing of https://review.whamcloud.com/#/c/22853/... |
| Comment by Patrick Farrell (Inactive) [ 09/Dec/16 ] |
|
Referencing https://jira.hpdd.intel.com/browse/LU-2887... Are we sure this test just isn't too slow sometimes? Since it's just on ZFS that it's failing. (Just something to consider. Doesn't really match up with the dd waiting on i_mutex thing James shared.) |
| Comment by Patrick Farrell (Inactive) [ 09/Dec/16 ] |
|
Another failure: |
| Comment by nasf (Inactive) [ 10/Dec/16 ] |
|
+1 on master: |
| Comment by nasf (Inactive) [ 11/Dec/16 ] |
|
+1 on master: |
| Comment by nasf (Inactive) [ 23/Dec/16 ] |
|
+2 on master: |
| Comment by Joseph Gmitter (Inactive) [ 28/Dec/16 ] |
|
Hi Niu, Can you please have a look at this issue? James reports that there are 32 test failures on master in the past 10 days due to this issue. Thanks. |
| Comment by Niu Yawei (Inactive) [ 29/Dec/16 ] |
|
Several 'dd' processes were blocked on dir rwsem like: 02:05:56:[14258.624960] dd D ffff88007add20c0 0 32124 32116 0x00000080 02:05:56:[14258.626623] ffff880052e4bc70 0000000000000082 ffff88000b628fb0 ffff880052e4bfd8 02:05:56:[14258.628406] ffff880052e4bfd8 ffff880052e4bfd8 ffff88000b628fb0 ffff88007add20b8 02:05:56:[14258.630175] ffff88007add20bc ffff88000b628fb0 00000000ffffffff ffff88007add20c0 02:05:56:[14258.631880] Call Trace: 02:05:56:[14258.633182] [<ffffffff8168c669>] schedule_preempt_disabled+0x29/0x70 02:05:56:[14258.634743] [<ffffffff8168a2c5>] __mutex_lock_slowpath+0xc5/0x1c0 02:05:56:[14258.636370] [<ffffffff8168972f>] mutex_lock+0x1f/0x2f 02:05:56:[14258.637839] [<ffffffff8120cb9f>] do_last+0x28f/0x12a0 02:05:56:[14258.639371] [<ffffffff8120dc72>] path_openat+0xc2/0x490 02:05:56:[14258.640849] [<ffffffff8118055b>] ? unlock_page+0x2b/0x30 02:05:56:[14258.642408] [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0 02:05:56:[14258.643862] [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130 02:05:56:[14258.645382] [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0 02:05:56:[14258.646831] [<ffffffff811fd40e>] SyS_open+0x1e/0x20 02:05:56:[14258.648296] [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b Because one 'dd' can't finish open create (or it took too long time for an open-create): 02:05:56:[14258.566251] dd S 000000000000f888 0 32123 32115 0x00000080 02:05:56:[14258.567866] ffff88007b03f700 0000000000000082 ffff88007c075e20 ffff88007b03ffd8 02:05:56:[14258.569629] ffff88007b03ffd8 ffff88007b03ffd8 ffff88007c075e20 ffff88007c0dc000 02:05:56:[14258.571373] ffff88007b03f730 0000000100d5b01c ffff88007c0dc000 000000000000f888 02:05:56:[14258.573070] Call Trace: 02:05:56:[14258.574276] [<ffffffff8168b579>] schedule+0x29/0x70 02:05:56:[14258.575745] [<ffffffff81688f05>] schedule_timeout+0x175/0x2d0 02:05:56:[14258.577234] [<ffffffff81096820>] ? internal_add_timer+0x70/0x70 02:05:56:[14258.578795] [<ffffffffa09e3d50>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc] 02:05:56:[14258.580347] [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20 02:05:56:[14258.581878] [<ffffffffa09e420d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] 02:05:56:[14258.583463] [<ffffffffa09c9236>] ldlm_cli_enqueue+0x686/0x810 [ptlrpc] 02:05:56:[14258.585097] [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc] 02:05:56:[14258.586766] [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] 02:05:56:[14258.588506] [<ffffffffa0b3f1df>] mdc_enqueue_base+0x2bf/0x17d0 [mdc] 02:05:56:[14258.590072] [<ffffffffa0b4164b>] mdc_intent_lock+0x26b/0x520 [mdc] 02:05:56:[14258.591682] [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] 02:05:56:[14258.593387] [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc] 02:05:56:[14258.595129] [<ffffffffa0b0e6dd>] lmv_intent_lock+0xc0d/0x1b50 [lmv] 02:05:56:[14258.596700] [<ffffffff811144f2>] ? from_kgid+0x12/0x20 02:05:56:[14258.598234] [<ffffffffa0c89217>] ? ll_i2suppgid+0x37/0x40 [lustre] 02:05:56:[14258.599804] [<ffffffffa0c89244>] ? ll_i2gids+0x24/0xb0 [lustre] 02:05:56:[14258.601401] [<ffffffff811144f2>] ? from_kgid+0x12/0x20 02:05:56:[14258.602856] [<ffffffffa0c73dd0>] ? ll_prep_md_op_data+0x220/0x510 [lustre] 02:05:56:[14258.604559] [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] 02:05:56:[14258.606285] [<ffffffffa0ca174e>] ? ll_dentry_init_security+0x2e/0x90 [lustre] 02:05:56:[14258.608026] [<ffffffffa0c8a8ee>] ll_lookup_it+0x27e/0xe40 [lustre] 02:05:56:[14258.609663] [<ffffffffa0c8b5cd>] ll_atomic_open+0x11d/0x11e0 [lustre] 02:05:56:[14258.611268] [<ffffffff812a88cc>] ? security_inode_permission+0x1c/0x30 02:05:56:[14258.612919] [<ffffffff8120d321>] do_last+0xa11/0x12a0 02:05:56:[14258.614392] [<ffffffff8120dc72>] path_openat+0xc2/0x490 02:05:56:[14258.615933] [<ffffffff8118055b>] ? unlock_page+0x2b/0x30 02:05:56:[14258.617410] [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0 02:05:56:[14258.618930] [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130 02:05:56:[14258.620405] [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0 02:05:56:[14258.621929] [<ffffffff811fd40e>] SyS_open+0x1e/0x20 02:05:56:[14258.623368] [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b The weird thing is I didn't find any mdt service thread was working the open-create, probably the open-create is already done when dumping the stack trace? What I can see from MDS side is that txg_sync thread was stuck: [14300.192361] txg_sync D 0000000000000001 0 23196 2 0x00000080 [14300.194020] ffff8800467cbb20 0000000000000046 ffff880038e2de20 ffff8800467cbfd8 [14300.195757] ffff8800467cbfd8 ffff8800467cbfd8 ffff880038e2de20 ffff88007fd16c40 [14300.197507] 0000000000000000 7fffffffffffffff ffff880038e437d0 0000000000000001 [14300.199213] Call Trace: [14300.200474] [<ffffffff8168b7c9>] schedule+0x29/0x70 [14300.202035] [<ffffffff81689219>] schedule_timeout+0x239/0x2d0 [14300.203583] [<ffffffff810c4ed2>] ? default_wake_function+0x12/0x20 [14300.205164] [<ffffffff810ba118>] ? __wake_up_common+0x58/0x90 [14300.206703] [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 [14300.208274] [<ffffffff8168ad6e>] io_schedule_timeout+0xae/0x130 [14300.209830] [<ffffffff810b12f6>] ? prepare_to_wait_exclusive+0x56/0x90 [14300.211437] [<ffffffff8168ae08>] io_schedule+0x18/0x20 [14300.212934] [<ffffffffa02d16d0>] cv_wait_common+0xb0/0x150 [spl] [14300.214491] [<ffffffff810b1600>] ? wake_up_atomic_t+0x30/0x30 [14300.216043] [<ffffffffa02d17c8>] __cv_wait_io+0x18/0x20 [spl] [14300.217600] [<ffffffffa0505ab3>] zio_wait+0x123/0x210 [zfs] [14300.219131] [<ffffffffa04a84db>] spa_sync+0x75b/0xb70 [zfs] [14300.220611] [<ffffffff810b162b>] ? autoremove_wake_function+0x2b/0x40 [14300.222179] [<ffffffff81689972>] ? mutex_lock+0x12/0x2f [14300.223665] [<ffffffffa04ba81c>] txg_sync_thread+0x3cc/0x640 [zfs] [14300.225225] [<ffffffffa04ba450>] ? txg_fini+0x2a0/0x2a0 [zfs] [14300.226724] [<ffffffffa02cc851>] thread_generic_wrapper+0x71/0x80 [spl] [14300.228287] [<ffffffffa02cc7e0>] ? __thread_exit+0x20/0x20 [spl] [14300.229808] [<ffffffff810b052f>] kthread+0xcf/0xe0 [14300.231226] [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140 [14300.232768] [<ffffffff81696658>] ret_from_fork+0x58/0x90 [14300.234226] [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140 I think it's perhaps another zfs slow sync issue? Alex, do you have any comments? Are there still any open ticket for such kind of problem? Thanks. |
| Comment by nasf (Inactive) [ 03/Jan/17 ] |
|
+1 on master: |
| Comment by James Nunez (Inactive) [ 12/Jan/17 ] |
|
Wrong ticket |
| Comment by Andreas Dilger [ 13/Jan/17 ] |
|
+1 on master https://testing.hpdd.intel.com/test_sets/48ddbe22-d919-11e6-a61d-5254006e85c2 |
| Comment by Jian Yu [ 17/Jan/17 ] |
|
More failure instances in patch review test sessions on master branch: |
| Comment by Niu Yawei (Inactive) [ 18/Jan/17 ] |
|
I'm wondering if should disable this test for zfs temporarily before the zfs slow sync problem is fixed. What's your opinion? Andreas/Alex? |
| Comment by Andreas Dilger [ 18/Jan/17 ] |
|
Would this be fixed by the native ZFS dnode accounting patch from Jinshan? |
| Comment by Niu Yawei (Inactive) [ 19/Jan/17 ] |
|
Andreas, that could be helpful, will that fix being deployed in our testing system soon? |
| Comment by Andreas Dilger [ 23/Jan/17 ] |
| Comment by Bruno Faccini (Inactive) [ 25/Jan/17 ] |
|
+2 more failures in patch review test sessions for https://testing.hpdd.intel.com/test_sessions/4819224e-de94-11e6-a009-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/c16af6e2-e28e-11e6-8473-5254006e85c2 and always wrongly assigned to LU-4448. |
| Comment by Bob Glossman (Inactive) [ 03/Feb/17 ] |
|
another on master: |
| Comment by Jian Yu [ 06/Feb/17 ] |
|
+1 on master: https://testing.hpdd.intel.com/test_sets/dd3ea466-ea33-11e6-ac20-5254006e85c2 |
| Comment by Sebastien Buisson (Inactive) [ 06/Feb/17 ] |
|
3 more instances on master: |
| Comment by Bob Glossman (Inactive) [ 10/Feb/17 ] |
|
another on master: |
| Comment by Bob Glossman (Inactive) [ 20/Feb/17 ] |
|
another on master: |
| Comment by Bob Glossman (Inactive) [ 25/Feb/17 ] |
|
another on master: |
| Comment by Bob Glossman (Inactive) [ 02/Mar/17 ] |
|
another on master: |
| Comment by Andreas Dilger [ 03/Mar/17 ] |
|
+1 on master ZFS https://testing.hpdd.intel.com/test_sets/d8a98be0-ffc9-11e6-9f56-5254006e85c2 |
| Comment by Joseph Gmitter (Inactive) [ 08/Mar/17 ] |
|
Per discussion with Niu, he believes that this is caused by slow zfs performance and when all the related ZFS patches are deployed to the test system that the problem should be resolved. Niu recommends disabling the test if the zfs patches can not be deployed to the test system quickly. Niu, do you have any more detail you would like to add? |
| Comment by Alex Zhuravlev [ 08/Mar/17 ] |
|
|
| Comment by Gerrit Updater [ 09/Mar/17 ] |
|
James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/25919 |
| Comment by James Nunez (Inactive) [ 09/Mar/17 ] |
|
Uploaded a patch, https://review.whamcloud.com/25919, to add sanity-quota test 21 to the ALWAYS_EXCEPT list for ZFS testing in case we decide to stop running the test .... temporarily. |
| Comment by Andreas Dilger [ 13/Mar/17 ] |
|
As discussed today, we will try to get patch https://review.whamcloud.com/15294 " |
| Comment by Emoly Liu [ 14/Mar/17 ] |
|
+1 on master: |
| Comment by Henri Doreau (Inactive) [ 14/Mar/17 ] |
|
+1 on master: |
| Comment by Shuichi Ihara (Inactive) [ 26/Mar/17 ] |
|
+1 on master |
| Comment by Peter Jones [ 16/Apr/17 ] |
|
Has this been hit since |
| Comment by Niu Yawei (Inactive) [ 17/Apr/17 ] |
|
dup of |