Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8326

sanity-quota test_21 test failed to respond and timed out

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • Lustre 2.10.0
    • Lustre 2.9.0, Lustre 2.10.0
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for nasf <fan.yong@intel.com>

      Please provide additional information about the failure here.

      This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/648de7b4-3a40-11e6-bbf5-5254006e85c2.

      Attachments

        Issue Links

          Activity

            [LU-8326] sanity-quota test_21 test failed to respond and timed out
            adilger Andreas Dilger added a comment - https://review.whamcloud.com/15294

            Andreas, that could be helpful, will that fix being deployed in our testing system soon?

            niu Niu Yawei (Inactive) added a comment - Andreas, that could be helpful, will that fix being deployed in our testing system soon?

            Would this be fixed by the native ZFS dnode accounting patch from Jinshan?

            adilger Andreas Dilger added a comment - Would this be fixed by the native ZFS dnode accounting patch from Jinshan?

            I'm wondering if should disable this test for zfs temporarily before the zfs slow sync problem is fixed. What's your opinion? Andreas/Alex?

            niu Niu Yawei (Inactive) added a comment - I'm wondering if should disable this test for zfs temporarily before the zfs slow sync problem is fixed. What's your opinion? Andreas/Alex?
            yujian Jian Yu added a comment - More failure instances in patch review test sessions on master branch: https://testing.hpdd.intel.com/test_sets/6d09a4dc-dbad-11e6-8b41-5254006e85c2 https://testing.hpdd.intel.com/test_sets/ba9eb33c-daee-11e6-8752-5254006e85c2
            adilger Andreas Dilger added a comment - +1 on master https://testing.hpdd.intel.com/test_sets/48ddbe22-d919-11e6-a61d-5254006e85c2
            jamesanunez James Nunez (Inactive) added a comment - - edited

            Wrong ticket

            jamesanunez James Nunez (Inactive) added a comment - - edited Wrong ticket
            yong.fan nasf (Inactive) added a comment - +1 on master: https://testing.hpdd.intel.com/test_sets/4f5129c4-d18f-11e6-a844-5254006e85c2

            Several 'dd' processes were blocked on dir rwsem like:

            02:05:56:[14258.624960] dd              D ffff88007add20c0     0 32124  32116 0x00000080
            02:05:56:[14258.626623]  ffff880052e4bc70 0000000000000082 ffff88000b628fb0 ffff880052e4bfd8
            02:05:56:[14258.628406]  ffff880052e4bfd8 ffff880052e4bfd8 ffff88000b628fb0 ffff88007add20b8
            02:05:56:[14258.630175]  ffff88007add20bc ffff88000b628fb0 00000000ffffffff ffff88007add20c0
            02:05:56:[14258.631880] Call Trace:
            02:05:56:[14258.633182]  [<ffffffff8168c669>] schedule_preempt_disabled+0x29/0x70
            02:05:56:[14258.634743]  [<ffffffff8168a2c5>] __mutex_lock_slowpath+0xc5/0x1c0
            02:05:56:[14258.636370]  [<ffffffff8168972f>] mutex_lock+0x1f/0x2f
            02:05:56:[14258.637839]  [<ffffffff8120cb9f>] do_last+0x28f/0x12a0
            02:05:56:[14258.639371]  [<ffffffff8120dc72>] path_openat+0xc2/0x490
            02:05:56:[14258.640849]  [<ffffffff8118055b>] ? unlock_page+0x2b/0x30
            02:05:56:[14258.642408]  [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0
            02:05:56:[14258.643862]  [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130
            02:05:56:[14258.645382]  [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0
            02:05:56:[14258.646831]  [<ffffffff811fd40e>] SyS_open+0x1e/0x20
            02:05:56:[14258.648296]  [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b
            

            Because one 'dd' can't finish open create (or it took too long time for an open-create):

            02:05:56:[14258.566251] dd              S 000000000000f888     0 32123  32115 0x00000080
            02:05:56:[14258.567866]  ffff88007b03f700 0000000000000082 ffff88007c075e20 ffff88007b03ffd8
            02:05:56:[14258.569629]  ffff88007b03ffd8 ffff88007b03ffd8 ffff88007c075e20 ffff88007c0dc000
            02:05:56:[14258.571373]  ffff88007b03f730 0000000100d5b01c ffff88007c0dc000 000000000000f888
            02:05:56:[14258.573070] Call Trace:
            02:05:56:[14258.574276]  [<ffffffff8168b579>] schedule+0x29/0x70
            02:05:56:[14258.575745]  [<ffffffff81688f05>] schedule_timeout+0x175/0x2d0
            02:05:56:[14258.577234]  [<ffffffff81096820>] ? internal_add_timer+0x70/0x70
            02:05:56:[14258.578795]  [<ffffffffa09e3d50>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc]
            02:05:56:[14258.580347]  [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
            02:05:56:[14258.581878]  [<ffffffffa09e420d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc]
            02:05:56:[14258.583463]  [<ffffffffa09c9236>] ldlm_cli_enqueue+0x686/0x810 [ptlrpc]
            02:05:56:[14258.585097]  [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc]
            02:05:56:[14258.586766]  [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
            02:05:56:[14258.588506]  [<ffffffffa0b3f1df>] mdc_enqueue_base+0x2bf/0x17d0 [mdc]
            02:05:56:[14258.590072]  [<ffffffffa0b4164b>] mdc_intent_lock+0x26b/0x520 [mdc]
            02:05:56:[14258.591682]  [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
            02:05:56:[14258.593387]  [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc]
            02:05:56:[14258.595129]  [<ffffffffa0b0e6dd>] lmv_intent_lock+0xc0d/0x1b50 [lmv]
            02:05:56:[14258.596700]  [<ffffffff811144f2>] ? from_kgid+0x12/0x20
            02:05:56:[14258.598234]  [<ffffffffa0c89217>] ? ll_i2suppgid+0x37/0x40 [lustre]
            02:05:56:[14258.599804]  [<ffffffffa0c89244>] ? ll_i2gids+0x24/0xb0 [lustre]
            02:05:56:[14258.601401]  [<ffffffff811144f2>] ? from_kgid+0x12/0x20
            02:05:56:[14258.602856]  [<ffffffffa0c73dd0>] ? ll_prep_md_op_data+0x220/0x510 [lustre]
            02:05:56:[14258.604559]  [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre]
            02:05:56:[14258.606285]  [<ffffffffa0ca174e>] ? ll_dentry_init_security+0x2e/0x90 [lustre]
            02:05:56:[14258.608026]  [<ffffffffa0c8a8ee>] ll_lookup_it+0x27e/0xe40 [lustre]
            02:05:56:[14258.609663]  [<ffffffffa0c8b5cd>] ll_atomic_open+0x11d/0x11e0 [lustre]
            02:05:56:[14258.611268]  [<ffffffff812a88cc>] ? security_inode_permission+0x1c/0x30
            02:05:56:[14258.612919]  [<ffffffff8120d321>] do_last+0xa11/0x12a0
            02:05:56:[14258.614392]  [<ffffffff8120dc72>] path_openat+0xc2/0x490
            02:05:56:[14258.615933]  [<ffffffff8118055b>] ? unlock_page+0x2b/0x30
            02:05:56:[14258.617410]  [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0
            02:05:56:[14258.618930]  [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130
            02:05:56:[14258.620405]  [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0
            02:05:56:[14258.621929]  [<ffffffff811fd40e>] SyS_open+0x1e/0x20
            02:05:56:[14258.623368]  [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b
            

            The weird thing is I didn't find any mdt service thread was working the open-create, probably the open-create is already done when dumping the stack trace? What I can see from MDS side is that txg_sync thread was stuck:

            [14300.192361] txg_sync        D 0000000000000001     0 23196      2 0x00000080
            [14300.194020]  ffff8800467cbb20 0000000000000046 ffff880038e2de20 ffff8800467cbfd8
            [14300.195757]  ffff8800467cbfd8 ffff8800467cbfd8 ffff880038e2de20 ffff88007fd16c40
            [14300.197507]  0000000000000000 7fffffffffffffff ffff880038e437d0 0000000000000001
            [14300.199213] Call Trace:
            [14300.200474]  [<ffffffff8168b7c9>] schedule+0x29/0x70
            [14300.202035]  [<ffffffff81689219>] schedule_timeout+0x239/0x2d0
            [14300.203583]  [<ffffffff810c4ed2>] ? default_wake_function+0x12/0x20
            [14300.205164]  [<ffffffff810ba118>] ? __wake_up_common+0x58/0x90
            [14300.206703]  [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30
            [14300.208274]  [<ffffffff8168ad6e>] io_schedule_timeout+0xae/0x130
            [14300.209830]  [<ffffffff810b12f6>] ? prepare_to_wait_exclusive+0x56/0x90
            [14300.211437]  [<ffffffff8168ae08>] io_schedule+0x18/0x20
            [14300.212934]  [<ffffffffa02d16d0>] cv_wait_common+0xb0/0x150 [spl]
            [14300.214491]  [<ffffffff810b1600>] ? wake_up_atomic_t+0x30/0x30
            [14300.216043]  [<ffffffffa02d17c8>] __cv_wait_io+0x18/0x20 [spl]
            [14300.217600]  [<ffffffffa0505ab3>] zio_wait+0x123/0x210 [zfs]
            [14300.219131]  [<ffffffffa04a84db>] spa_sync+0x75b/0xb70 [zfs]
            [14300.220611]  [<ffffffff810b162b>] ? autoremove_wake_function+0x2b/0x40
            [14300.222179]  [<ffffffff81689972>] ? mutex_lock+0x12/0x2f
            [14300.223665]  [<ffffffffa04ba81c>] txg_sync_thread+0x3cc/0x640 [zfs]
            [14300.225225]  [<ffffffffa04ba450>] ? txg_fini+0x2a0/0x2a0 [zfs]
            [14300.226724]  [<ffffffffa02cc851>] thread_generic_wrapper+0x71/0x80 [spl]
            [14300.228287]  [<ffffffffa02cc7e0>] ? __thread_exit+0x20/0x20 [spl]
            [14300.229808]  [<ffffffff810b052f>] kthread+0xcf/0xe0
            [14300.231226]  [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
            [14300.232768]  [<ffffffff81696658>] ret_from_fork+0x58/0x90
            [14300.234226]  [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
            

            I think it's perhaps another zfs slow sync issue? Alex, do you have any comments? Are there still any open ticket for such kind of problem? Thanks.

            niu Niu Yawei (Inactive) added a comment - Several 'dd' processes were blocked on dir rwsem like: 02:05:56:[14258.624960] dd D ffff88007add20c0 0 32124 32116 0x00000080 02:05:56:[14258.626623] ffff880052e4bc70 0000000000000082 ffff88000b628fb0 ffff880052e4bfd8 02:05:56:[14258.628406] ffff880052e4bfd8 ffff880052e4bfd8 ffff88000b628fb0 ffff88007add20b8 02:05:56:[14258.630175] ffff88007add20bc ffff88000b628fb0 00000000ffffffff ffff88007add20c0 02:05:56:[14258.631880] Call Trace: 02:05:56:[14258.633182] [<ffffffff8168c669>] schedule_preempt_disabled+0x29/0x70 02:05:56:[14258.634743] [<ffffffff8168a2c5>] __mutex_lock_slowpath+0xc5/0x1c0 02:05:56:[14258.636370] [<ffffffff8168972f>] mutex_lock+0x1f/0x2f 02:05:56:[14258.637839] [<ffffffff8120cb9f>] do_last+0x28f/0x12a0 02:05:56:[14258.639371] [<ffffffff8120dc72>] path_openat+0xc2/0x490 02:05:56:[14258.640849] [<ffffffff8118055b>] ? unlock_page+0x2b/0x30 02:05:56:[14258.642408] [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0 02:05:56:[14258.643862] [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130 02:05:56:[14258.645382] [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0 02:05:56:[14258.646831] [<ffffffff811fd40e>] SyS_open+0x1e/0x20 02:05:56:[14258.648296] [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b Because one 'dd' can't finish open create (or it took too long time for an open-create): 02:05:56:[14258.566251] dd S 000000000000f888 0 32123 32115 0x00000080 02:05:56:[14258.567866] ffff88007b03f700 0000000000000082 ffff88007c075e20 ffff88007b03ffd8 02:05:56:[14258.569629] ffff88007b03ffd8 ffff88007b03ffd8 ffff88007c075e20 ffff88007c0dc000 02:05:56:[14258.571373] ffff88007b03f730 0000000100d5b01c ffff88007c0dc000 000000000000f888 02:05:56:[14258.573070] Call Trace: 02:05:56:[14258.574276] [<ffffffff8168b579>] schedule+0x29/0x70 02:05:56:[14258.575745] [<ffffffff81688f05>] schedule_timeout+0x175/0x2d0 02:05:56:[14258.577234] [<ffffffff81096820>] ? internal_add_timer+0x70/0x70 02:05:56:[14258.578795] [<ffffffffa09e3d50>] ptlrpc_set_wait+0x4c0/0x900 [ptlrpc] 02:05:56:[14258.580347] [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20 02:05:56:[14258.581878] [<ffffffffa09e420d>] ptlrpc_queue_wait+0x7d/0x220 [ptlrpc] 02:05:56:[14258.583463] [<ffffffffa09c9236>] ldlm_cli_enqueue+0x686/0x810 [ptlrpc] 02:05:56:[14258.585097] [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc] 02:05:56:[14258.586766] [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] 02:05:56:[14258.588506] [<ffffffffa0b3f1df>] mdc_enqueue_base+0x2bf/0x17d0 [mdc] 02:05:56:[14258.590072] [<ffffffffa0b4164b>] mdc_intent_lock+0x26b/0x520 [mdc] 02:05:56:[14258.591682] [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] 02:05:56:[14258.593387] [<ffffffffa09c40b0>] ? ldlm_expired_completion_wait+0x250/0x250 [ptlrpc] 02:05:56:[14258.595129] [<ffffffffa0b0e6dd>] lmv_intent_lock+0xc0d/0x1b50 [lmv] 02:05:56:[14258.596700] [<ffffffff811144f2>] ? from_kgid+0x12/0x20 02:05:56:[14258.598234] [<ffffffffa0c89217>] ? ll_i2suppgid+0x37/0x40 [lustre] 02:05:56:[14258.599804] [<ffffffffa0c89244>] ? ll_i2gids+0x24/0xb0 [lustre] 02:05:56:[14258.601401] [<ffffffff811144f2>] ? from_kgid+0x12/0x20 02:05:56:[14258.602856] [<ffffffffa0c73dd0>] ? ll_prep_md_op_data+0x220/0x510 [lustre] 02:05:56:[14258.604559] [<ffffffffa0c88740>] ? ll_invalidate_negative_children+0x1d0/0x1d0 [lustre] 02:05:56:[14258.606285] [<ffffffffa0ca174e>] ? ll_dentry_init_security+0x2e/0x90 [lustre] 02:05:56:[14258.608026] [<ffffffffa0c8a8ee>] ll_lookup_it+0x27e/0xe40 [lustre] 02:05:56:[14258.609663] [<ffffffffa0c8b5cd>] ll_atomic_open+0x11d/0x11e0 [lustre] 02:05:56:[14258.611268] [<ffffffff812a88cc>] ? security_inode_permission+0x1c/0x30 02:05:56:[14258.612919] [<ffffffff8120d321>] do_last+0xa11/0x12a0 02:05:56:[14258.614392] [<ffffffff8120dc72>] path_openat+0xc2/0x490 02:05:56:[14258.615933] [<ffffffff8118055b>] ? unlock_page+0x2b/0x30 02:05:56:[14258.617410] [<ffffffff8120fdeb>] do_filp_open+0x4b/0xb0 02:05:56:[14258.618930] [<ffffffff8121ca67>] ? __alloc_fd+0xa7/0x130 02:05:56:[14258.620405] [<ffffffff811fd2f3>] do_sys_open+0xf3/0x1f0 02:05:56:[14258.621929] [<ffffffff811fd40e>] SyS_open+0x1e/0x20 02:05:56:[14258.623368] [<ffffffff816964c9>] system_call_fastpath+0x16/0x1b The weird thing is I didn't find any mdt service thread was working the open-create, probably the open-create is already done when dumping the stack trace? What I can see from MDS side is that txg_sync thread was stuck: [14300.192361] txg_sync D 0000000000000001 0 23196 2 0x00000080 [14300.194020] ffff8800467cbb20 0000000000000046 ffff880038e2de20 ffff8800467cbfd8 [14300.195757] ffff8800467cbfd8 ffff8800467cbfd8 ffff880038e2de20 ffff88007fd16c40 [14300.197507] 0000000000000000 7fffffffffffffff ffff880038e437d0 0000000000000001 [14300.199213] Call Trace: [14300.200474] [<ffffffff8168b7c9>] schedule+0x29/0x70 [14300.202035] [<ffffffff81689219>] schedule_timeout+0x239/0x2d0 [14300.203583] [<ffffffff810c4ed2>] ? default_wake_function+0x12/0x20 [14300.205164] [<ffffffff810ba118>] ? __wake_up_common+0x58/0x90 [14300.206703] [<ffffffff81060aef>] ? kvm_clock_get_cycles+0x1f/0x30 [14300.208274] [<ffffffff8168ad6e>] io_schedule_timeout+0xae/0x130 [14300.209830] [<ffffffff810b12f6>] ? prepare_to_wait_exclusive+0x56/0x90 [14300.211437] [<ffffffff8168ae08>] io_schedule+0x18/0x20 [14300.212934] [<ffffffffa02d16d0>] cv_wait_common+0xb0/0x150 [spl] [14300.214491] [<ffffffff810b1600>] ? wake_up_atomic_t+0x30/0x30 [14300.216043] [<ffffffffa02d17c8>] __cv_wait_io+0x18/0x20 [spl] [14300.217600] [<ffffffffa0505ab3>] zio_wait+0x123/0x210 [zfs] [14300.219131] [<ffffffffa04a84db>] spa_sync+0x75b/0xb70 [zfs] [14300.220611] [<ffffffff810b162b>] ? autoremove_wake_function+0x2b/0x40 [14300.222179] [<ffffffff81689972>] ? mutex_lock+0x12/0x2f [14300.223665] [<ffffffffa04ba81c>] txg_sync_thread+0x3cc/0x640 [zfs] [14300.225225] [<ffffffffa04ba450>] ? txg_fini+0x2a0/0x2a0 [zfs] [14300.226724] [<ffffffffa02cc851>] thread_generic_wrapper+0x71/0x80 [spl] [14300.228287] [<ffffffffa02cc7e0>] ? __thread_exit+0x20/0x20 [spl] [14300.229808] [<ffffffff810b052f>] kthread+0xcf/0xe0 [14300.231226] [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140 [14300.232768] [<ffffffff81696658>] ret_from_fork+0x58/0x90 [14300.234226] [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140 I think it's perhaps another zfs slow sync issue? Alex, do you have any comments? Are there still any open ticket for such kind of problem? Thanks.

            Hi Niu,

            Can you please have a look at this issue? James reports that there are 32 test failures on master in the past 10 days due to this issue.

            Thanks.
            Joe

            jgmitter Joseph Gmitter (Inactive) added a comment - Hi Niu, Can you please have a look at this issue? James reports that there are 32 test failures on master in the past 10 days due to this issue. Thanks. Joe
            yong.fan nasf (Inactive) added a comment - +2 on master: https://testing.hpdd.intel.com/test_sets/d6b55222-c8a3-11e6-a95f-5254006e85c2 https://testing.hpdd.intel.com/test_sets/eca4660a-c889-11e6-a95f-5254006e85c2

            People

              niu Niu Yawei (Inactive)
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: