[LU-8326] sanity-quota test_21 test failed to respond and timed out Created: 25/Jun/16  Updated: 17/Apr/17  Resolved: 17/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.9.0, Lustre 2.10.0
Fix Version/s: Lustre 2.10.0

Type: Bug Priority: Blocker
Reporter: Maloo Assignee: Niu Yawei (Inactive)
Resolution: Duplicate Votes: 0
Labels: zfs

Issue Links:
Related
is related to LU-2435 inode accounting in osd-zfs is racy Resolved
is related to LU-4448 Test failure timeout on sanity-quota ... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for nasf <fan.yong@intel.com>

Please provide additional information about the failure here.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/648de7b4-3a40-11e6-bbf5-5254006e85c2.



 Comments   
Comment by Jian Yu [ 13/Jul/16 ]

One more instance on master branch: https://testing.hpdd.intel.com/test_sets/ab8deb42-487c-11e6-a80f-5254006e85c2

Comment by nasf (Inactive) [ 29/Jul/16 ]

Another one on master:
https://testing.hpdd.intel.com/test_sets/ed5db8b4-5483-11e6-a39e-5254006e85c2

Comment by Bob Glossman (Inactive) [ 04/Aug/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/93b4ce8a-5a85-11e6-aa74-5254006e85c2

Comment by Bob Glossman (Inactive) [ 06/Aug/16 ]

another on master:
https://testing.hpdd.intel.com/test_sessions/ba14ecf6-5bc3-11e6-aa74-5254006e85c2

Comment by James Nunez (Inactive) [ 20/Oct/16 ]

I've seen this issue several times recently. In the client log, the dd is hung, but not sure why:

08:26:46:[ 9196.946469] Lustre: DEBUG MARKER: Files for user (quota_usr):
08:26:46:[ 9240.580066] INFO: task dd:28549 blocked for more than 120 seconds.
08:26:46:[ 9240.581799] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
08:26:46:[ 9240.583601] dd              D ffff880016aeb2c0     0 28549  28540 0x00000080
08:26:46:[ 9240.585324]  ffff8800001cbc70 0000000000000086 ffff880079d19700 ffff8800001cbfd8
08:26:46:[ 9240.587145]  ffff8800001cbfd8 ffff8800001cbfd8 ffff880079d19700 ffff880016aeb2b8
08:26:46:[ 9240.589335]  ffff880016aeb2bc ffff880079d19700 00000000ffffffff ffff880016aeb2c0
08:26:46:[ 9240.591322] Call Trace:
08:26:46:[ 9240.592709]  [<ffffffff8163cae9>] schedule_preempt_disabled+0x29/0x70
08:26:46:[ 9240.594300]  [<ffffffff8163a7e5>] __mutex_lock_slowpath+0xc5/0x1c0
08:26:46:[ 9240.595846]  [<ffffffff81639c4f>] mutex_lock+0x1f/0x2f
08:26:46:[ 9240.597336]  [<ffffffff811ec11f>] do_last+0x28f/0x1270
08:26:46:[ 9240.598791]  [<ffffffff812f9062>] ? radix_tree_lookup_slot+0x22/0x50
08:26:46:[ 9240.600310]  [<ffffffff811eede2>] path_openat+0xc2/0x490
08:26:46:[ 9240.601744]  [<ffffffff811f05ab>] do_filp_open+0x4b/0xb0
08:26:46:[ 9240.603178]  [<ffffffff811fd147>] ? __alloc_fd+0xa7/0x130
08:26:46:[ 9240.604623]  [<ffffffff811ddf53>] do_sys_open+0xf3/0x1f0
08:26:46:[ 9240.606030]  [<ffffffff811de06e>] SyS_open+0x1e/0x20
08:26:46:[ 9240.607404]  [<ffffffff81646a09>] system_call_fastpath+0x16/0x1b

The recent failures that I've seen are all for ZFS servers.

Here are few logs from recent failures:
https://testing.hpdd.intel.com/test_sets/4c947094-95f1-11e6-a96f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/156ff7f6-964f-11e6-9722-5254006e85c2
https://testing.hpdd.intel.com/test_sets/3c2e0518-967c-11e6-9fed-5254006e85c2

Comment by Bob Glossman (Inactive) [ 04/Nov/16 ]

another on master:
https://testing.hpdd.intel.com/test_sets/54b0e7e2-a23c-11e6-8ed2-5254006e85c2

Comment by nasf (Inactive) [ 27/Nov/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/7ad90eb6-b4e5-11e6-b735-5254006e85c2

Comment by nasf (Inactive) [ 01/Dec/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/589cc6e6-b817-11e6-a76e-5254006e85c2

Comment by nasf (Inactive) [ 01/Dec/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/965a8882-b818-11e6-a76e-5254006e85c2

Comment by Jian Yu [ 08/Dec/16 ]

More failures on master branch in review-zfs-part-1 patch review test sessions:
https://testing.hpdd.intel.com/test_sets/0d49d972-bd67-11e6-9937-5254006e85c2
https://testing.hpdd.intel.com/test_sets/408bfcbe-bd1e-11e6-9937-5254006e85c2

Comment by Jian Yu [ 09/Dec/16 ]

This is blocking patch review testing for https://review.whamcloud.com/22588 on master branch.

Comment by Patrick Farrell (Inactive) [ 09/Dec/16 ]

Also blocking testing of https://review.whamcloud.com/#/c/22853/...

Comment by Patrick Farrell (Inactive) [ 09/Dec/16 ]

Referencing https://jira.hpdd.intel.com/browse/LU-2887...

Are we sure this test just isn't too slow sometimes? Since it's just on ZFS that it's failing. (Just something to consider. Doesn't really match up with the dd waiting on i_mutex thing James shared.)

Comment by Patrick Farrell (Inactive) [ 09/Dec/16 ]

Another failure:
https://review.whamcloud.com/#/c/24253/
https://testing.hpdd.intel.com/test_sessions/6eeeb11c-be0b-11e6-8111-5254006e85c2

Comment by nasf (Inactive) [ 10/Dec/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/3bd42d8a-be5f-11e6-9f18-5254006e85c2

Comment by nasf (Inactive) [ 11/Dec/16 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/93be821e-bf2e-11e6-8560-5254006e85c2

Comment by nasf (Inactive) [ 23/Dec/16 ]

+2 on master:
https://testing.hpdd.intel.com/test_sets/d6b55222-c8a3-11e6-a95f-5254006e85c2
https://testing.hpdd.intel.com/test_sets/eca4660a-c889-11e6-a95f-5254006e85c2

Comment by Joseph Gmitter (Inactive) [ 28/Dec/16 ]

Hi Niu,

Can you please have a look at this issue? James reports that there are 32 test failures on master in the past 10 days due to this issue.

Thanks.
Joe

Comment by Niu Yawei (Inactive) [ 29/Dec/16 ]

Several 'dd' processes were blocked on dir rwsem like:

02:05:56:[14258.624960] dd              D ffff88007add20c0     0 32124  32116 0x00000080
02:05:56:[14258.626623]  ffff880052e4bc70 0000000000000082 ffff88000b628fb0 ffff880052e4bfd8
02:05:56:[14258.628406]  ffff880052e4bfd8 ffff880052e4bfd8 ffff88000b628fb0 ffff88007add20b8
02:05:56:[14258.630175]  ffff88007add20bc ffff88000b628fb0 00000000ffffffff ffff88007add20c0
02:05:56:[14258.631880] Call Trace:
02:05:56:[14258.633182]  [<ffffffff8168c669>] schedule_preempt_disabled+0x29/0x70
02:05:56:[14258.634743]  [<ffffffff8168a2c5>] __mutex_lock_slowpath+0xc5/0x1c0
02:05:56:[14258.636370]  [<ffffffff8168972f>] mutex_lock+0x1f/0x2f
02:05:56:[14258.637839]  [<ffffffff8120cb9f>] do_last+0x28f/0x12a0
02:05:56:[14258.639371]  [<ffffffff8120dc72>] path_openat+0xc2/0x490
02:05:56:[14258.640849]  [<ffffffff8118055b>] ? unlock_page+0x2b/0x30
02:05:56:[14258.642408]  [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0
02:05:56:[14258.643862]  [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130
02:05:56:[14258.645382]  [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0
02:05:56:[14258.646831]  [<ffffffff811fd40e>] SyS_open+0x1e/0x20
02:05:56:[14258.648296]  [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b

Because one 'dd' can't finish open create (or it took too long time for an open-create):

02:05:56:[14258.566251] dd              S 000000000000f888     0 32123  32115 0x00000080
02:05:56:[14258.567866]  ffff88007b03f700 0000000000000082 ffff88007c075e20 ffff88007b03ffd8
02:05:56:[14258.569629]  ffff88007b03ffd8 ffff88007b03ffd8 ffff88007c075e20 ffff88007c0dc000
02:05:56:[14258.571373]  ffff88007b03f730 0000000100d5b01c ffff88007c0dc000 000000000000f888
02:05:56:[14258.573070] Call Trace:
02:05:56:[14258.574276]  [<ffffffff8168b579>] schedule+0x29/0x70
02:05:56:[14258.575745]  [<ffffffff81688f05>] schedule_timeout+0x175/0x2d0
02:05:56:[14258.577234]  [<ffffffff81096820>] ? internal_add_timer+0x70/0x70
02:05:56:[14258.578795]  [<ffffffffa09e3d50>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc]
02:05:56:[14258.580347]  [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
02:05:56:[14258.581878]  [<ffffffffa09e420d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
02:05:56:[14258.583463]  [<ffffffffa09c9236>] ldlm_cli_enqueue+0x686/0x810 [ptlrpc]
02:05:56:[14258.585097]  [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc]
02:05:56:[14258.586766]  [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
02:05:56:[14258.588506]  [<ffffffffa0b3f1df>] mdc_enqueue_base+0x2bf/0x17d0 [mdc]
02:05:56:[14258.590072]  [<ffffffffa0b4164b>] mdc_intent_lock+0x26b/0x520 [mdc]
02:05:56:[14258.591682]  [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
02:05:56:[14258.593387]  [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc]
02:05:56:[14258.595129]  [<ffffffffa0b0e6dd>] lmv_intent_lock+0xc0d/0x1b50 [lmv]
02:05:56:[14258.596700]  [<ffffffff811144f2>] ? from_kgid+0x12/0x20
02:05:56:[14258.598234]  [<ffffffffa0c89217>] ? ll_i2suppgid+0x37/0x40 [lustre]
02:05:56:[14258.599804]  [<ffffffffa0c89244>] ? ll_i2gids+0x24/0xb0 [lustre]
02:05:56:[14258.601401]  [<ffffffff811144f2>] ? from_kgid+0x12/0x20
02:05:56:[14258.602856]  [<ffffffffa0c73dd0>] ? ll_prep_md_op_data+0x220/0x510 [lustre]
02:05:56:[14258.604559]  [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
02:05:56:[14258.606285]  [<ffffffffa0ca174e>] ? ll_dentry_init_security+0x2e/0x90 [lustre]
02:05:56:[14258.608026]  [<ffffffffa0c8a8ee>] ll_lookup_it+0x27e/0xe40 [lustre]
02:05:56:[14258.609663]  [<ffffffffa0c8b5cd>] ll_atomic_open+0x11d/0x11e0 [lustre]
02:05:56:[14258.611268]  [<ffffffff812a88cc>] ? security_inode_permission+0x1c/0x30
02:05:56:[14258.612919]  [<ffffffff8120d321>] do_last+0xa11/0x12a0
02:05:56:[14258.614392]  [<ffffffff8120dc72>] path_openat+0xc2/0x490
02:05:56:[14258.615933]  [<ffffffff8118055b>] ? unlock_page+0x2b/0x30
02:05:56:[14258.617410]  [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0
02:05:56:[14258.618930]  [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130
02:05:56:[14258.620405]  [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0
02:05:56:[14258.621929]  [<ffffffff811fd40e>] SyS_open+0x1e/0x20
02:05:56:[14258.623368]  [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b

The weird thing is I didn't find any mdt service thread was working the open-create, probably the open-create is already done when dumping the stack trace? What I can see from MDS side is that txg_sync thread was stuck:

[14300.192361] txg_sync        D 0000000000000001     0 23196      2 0x00000080
[14300.194020]  ffff8800467cbb20 0000000000000046 ffff880038e2de20 ffff8800467cbfd8
[14300.195757]  ffff8800467cbfd8 ffff8800467cbfd8 ffff880038e2de20 ffff88007fd16c40
[14300.197507]  0000000000000000 7fffffffffffffff ffff880038e437d0 0000000000000001
[14300.199213] Call Trace:
[14300.200474]  [<ffffffff8168b7c9>] schedule+0x29/0x70
[14300.202035]  [<ffffffff81689219>] schedule_timeout+0x239/0x2d0
[14300.203583]  [<ffffffff810c4ed2>] ? default_wake_function+0x12/0x20
[14300.205164]  [<ffffffff810ba118>] ? __wake_up_common+0x58/0x90
[14300.206703]  [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30
[14300.208274]  [<ffffffff8168ad6e>] io_schedule_timeout+0xae/0x130
[14300.209830]  [<ffffffff810b12f6>] ? prepare_to_wait_exclusive+0x56/0x90
[14300.211437]  [<ffffffff8168ae08>] io_schedule+0x18/0x20
[14300.212934]  [<ffffffffa02d16d0>] cv_wait_common+0xb0/0x150 [spl]
[14300.214491]  [<ffffffff810b1600>] ? wake_up_atomic_t+0x30/0x30
[14300.216043]  [<ffffffffa02d17c8>] __cv_wait_io+0x18/0x20 [spl]
[14300.217600]  [<ffffffffa0505ab3>] zio_wait+0x123/0x210 [zfs]
[14300.219131]  [<ffffffffa04a84db>] spa_sync+0x75b/0xb70 [zfs]
[14300.220611]  [<ffffffff810b162b>] ? autoremove_wake_function+0x2b/0x40
[14300.222179]  [<ffffffff81689972>] ? mutex_lock+0x12/0x2f
[14300.223665]  [<ffffffffa04ba81c>] txg_sync_thread+0x3cc/0x640 [zfs]
[14300.225225]  [<ffffffffa04ba450>] ? txg_fini+0x2a0/0x2a0 [zfs]
[14300.226724]  [<ffffffffa02cc851>] thread_generic_wrapper+0x71/0x80 [spl]
[14300.228287]  [<ffffffffa02cc7e0>] ? __thread_exit+0x20/0x20 [spl]
[14300.229808]  [<ffffffff810b052f>] kthread+0xcf/0xe0
[14300.231226]  [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
[14300.232768]  [<ffffffff81696658>] ret_from_fork+0x58/0x90
[14300.234226]  [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140

I think it's perhaps another zfs slow sync issue? Alex, do you have any comments? Are there still any open ticket for such kind of problem? Thanks.

Comment by nasf (Inactive) [ 03/Jan/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/4f5129c4-d18f-11e6-a844-5254006e85c2

Comment by James Nunez (Inactive) [ 12/Jan/17 ]

Wrong ticket

Comment by Andreas Dilger [ 13/Jan/17 ]

+1 on master https://testing.hpdd.intel.com/test_sets/48ddbe22-d919-11e6-a61d-5254006e85c2

Comment by Jian Yu [ 17/Jan/17 ]

More failure instances in patch review test sessions on master branch:
https://testing.hpdd.intel.com/test_sets/6d09a4dc-dbad-11e6-8b41-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ba9eb33c-daee-11e6-8752-5254006e85c2

Comment by Niu Yawei (Inactive) [ 18/Jan/17 ]

I'm wondering if should disable this test for zfs temporarily before the zfs slow sync problem is fixed. What's your opinion? Andreas/Alex?

Comment by Andreas Dilger [ 18/Jan/17 ]

Would this be fixed by the native ZFS dnode accounting patch from Jinshan?

Comment by Niu Yawei (Inactive) [ 19/Jan/17 ]

Andreas, that could be helpful, will that fix being deployed in our testing system soon?

Comment by Andreas Dilger [ 23/Jan/17 ]

https://review.whamcloud.com/15294

Comment by Bruno Faccini (Inactive) [ 25/Jan/17 ]

+2 more failures in patch review test sessions for LU-6499 on master branch :

https://testing.hpdd.intel.com/test_sessions/4819224e-de94-11e6-a009-5254006e85c2
https://testing.hpdd.intel.com/test_sessions/c16af6e2-e28e-11e6-8473-5254006e85c2

and always wrongly assigned to LU-4448.

Comment by Bob Glossman (Inactive) [ 03/Feb/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/40df5852-e184-11e6-8cc2-5254006e85c2

Comment by Jian Yu [ 06/Feb/17 ]

+1 on master:

https://testing.hpdd.intel.com/test_sets/dd3ea466-ea33-11e6-ac20-5254006e85c2

Comment by Sebastien Buisson (Inactive) [ 06/Feb/17 ]

3 more instances on master:
https://testing.hpdd.intel.com/test_sets/3f5a9cca-ea77-11e6-be3b-5254006e85c2
https://testing.hpdd.intel.com/test_sets/ce1f7286-ea7d-11e6-be3b-5254006e85c2
https://testing.hpdd.intel.com/test_sets/d7f08a76-ea7c-11e6-ac20-5254006e85c2

Comment by Bob Glossman (Inactive) [ 10/Feb/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/40df5852-e184-11e6-8cc2-5254006e85c2

Comment by Bob Glossman (Inactive) [ 20/Feb/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/9c081ffa-f784-11e6-bd34-5254006e85c2

Comment by Bob Glossman (Inactive) [ 25/Feb/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/21489ab4-fb27-11e6-816d-5254006e85c2

Comment by Bob Glossman (Inactive) [ 02/Mar/17 ]

another on master:
https://testing.hpdd.intel.com/test_sets/3d56a15c-ff45-11e6-aa1f-5254006e85c2

Comment by Andreas Dilger [ 03/Mar/17 ]

+1 on master ZFS https://testing.hpdd.intel.com/test_sets/d8a98be0-ffc9-11e6-9f56-5254006e85c2

Comment by Joseph Gmitter (Inactive) [ 08/Mar/17 ]

Per discussion with Niu, he believes that this is caused by slow zfs performance and when all the related ZFS patches are deployed to the test system that the problem should be resolved. Niu recommends disabling the test if the zfs patches can not be deployed to the test system quickly.

Niu, do you have any more detail you would like to add?

Comment by Alex Zhuravlev [ 08/Mar/17 ]

LU-2435 has been around and ready for months.. let's not disable the tests, but land the fix instead?

Comment by Gerrit Updater [ 09/Mar/17 ]

James Nunez (james.a.nunez@intel.com) uploaded a new patch: https://review.whamcloud.com/25919
Subject: LU-8326 test: skip sanity-quota 21 for ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 6af31b54dbf5783b067c24d0c712b1127b8f1ee1

Comment by James Nunez (Inactive) [ 09/Mar/17 ]

Uploaded a patch, https://review.whamcloud.com/25919, to add sanity-quota test 21 to the ALWAYS_EXCEPT list for ZFS testing in case we decide to stop running the test .... temporarily.

Comment by Andreas Dilger [ 13/Mar/17 ]

As discussed today, we will try to get patch https://review.whamcloud.com/15294 "LU-2435 osd-zfs: use zfs native dnode accounting" landed, since that is actually fixing the problem rather than skipping the test. It should be very close to landing.

Comment by Emoly Liu [ 14/Mar/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/4772f1a6-07d6-11e7-9ad1-5254006e85c2

Comment by Henri Doreau (Inactive) [ 14/Mar/17 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/8079c25a-085c-11e7-a090-5254006e85c2

Comment by Shuichi Ihara (Inactive) [ 26/Mar/17 ]

+1 on master
https://testing.hpdd.intel.com/test_sessions/e930b05a-d1c3-4ab9-ab8d-fa8e222f9d54

Comment by Peter Jones [ 16/Apr/17 ]

Has this been hit since LU-2435 landed on March 30th? If not, then can we mark it resolved as a dupe?

Comment by Niu Yawei (Inactive) [ 17/Apr/17 ]

dup of LU-2435.

Generated at Sat Feb 10 02:16:34 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.