Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11235

sanity: test_230d timeout on ZFS backends

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • 3
    • 9223372036854775807

    Description

      This issue was created by maloo for Lai Siyao <lai.siyao@whamcloud.com>

      This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/b3a1452a-9c82-11e8-8ee3-52540065bddc

      MDS backtrace https://testing.whamcloud.com/test_logs/b50461b8-9c82-11e8-8ee3-52540065bddc/show_text shows it's stuck in txg_wait_synced().

      [10285.072535] ldlm_cn00_002   D ffff9825b67adee0     0 32218      2 0x00000080
      [10285.073583] Call Trace:
      [10285.073963]  [<ffffffff8b914029>] schedule+0x29/0x70
      [10285.074659]  [<ffffffffc05da205>] cv_wait_common+0x125/0x150 [spl]
      [10285.075360]  [<ffffffff8b2bc610>] ? wake_up_atomic_t+0x30/0x30
      [10285.076201]  [<ffffffffc05da245>] __cv_wait+0x15/0x20 [spl]
      [10285.076939]  [<ffffffffc0704bdf>] txg_wait_synced+0xef/0x140 [zfs]
      [10285.077663]  [<ffffffffc1348a43>] osd_sync+0xc3/0x140 [osd_zfs]
      [10285.078507]  [<ffffffffc1163a5b>] tgt_sync+0x14b/0x270 [ptlrpc]
      [10285.079287]  [<ffffffffc1165f32>] tgt_blocking_ast+0x302/0x630 [ptlrpc]
      [10285.080043]  [<ffffffffc10b6e5a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
      [10285.080787]  [<ffffffffc10b7156>] ldlm_lock_cancel+0x56/0x1f0 [ptlrpc]
      [10285.081492]  [<ffffffffc10dc06b>] ldlm_request_cancel+0x18b/0x730 [ptlrpc]
      [10285.082266]  [<ffffffffc10df5ea>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
      [10285.082964]  [<ffffffffc10df8d8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
      [10285.083688]  [<ffffffffc111040b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [10285.084449]  [<ffffffff8b2c52ab>] ? __wake_up_common+0x5b/0x90
      [10285.085095]  [<ffffffffc1113c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
      [10285.085752]  [<ffffffffc1113130>] ? ptlrpc_register_service+0xe90/0xe90 [ptlrpc]
      [10285.086464]  [<ffffffff8b2bb621>] kthread+0xd1/0xe0
      [10285.087009]  [<ffffffff8b2bb550>] ? insert_kthread_work+0x40/0x40
      [10285.087626]  [<ffffffff8b9205f7>] ret_from_fork_nospec_begin+0x21/0x21
      [10285.088259]  [<ffffffff8b2bb550>] ? insert_kthread_work+0x40/0x40
      

      Attachments

        Issue Links

          Activity

            [LU-11235] sanity: test_230d timeout on ZFS backends

            +1 on master review-dne-selinux-ssk:
            https://testing.whamcloud.com/test_sets/dd21c09d-9d88-44e4-9e8e-2f7e93045034

            there are no errors in any logs, it just looks like it is taking a long time?

            adilger Andreas Dilger added a comment - +1 on master review-dne-selinux-ssk: https://testing.whamcloud.com/test_sets/dd21c09d-9d88-44e4-9e8e-2f7e93045034 there are no errors in any logs, it just looks like it is taking a long time?
            yujian Jian Yu added a comment - Occurred again on Lustre b2_10 branch: https://testing.whamcloud.com/test_sets/ff7e4cd0-e698-11e8-bfe1-52540065bddc
            pjones Peter Jones added a comment -

            >  BTW which version of ZFS is deployed in autotest?

            Will be 0.7.9 IIUC

            pjones Peter Jones added a comment - >  BTW which version of ZFS is deployed in autotest? Will be 0.7.9 IIUC

            another theoretical possibility would be a missing I/O completion callback or some race in that area.

            bzzz Alex Zhuravlev added a comment - another theoretical possibility would be a missing I/O completion callback or some race in that area.
            laisiyao Lai Siyao added a comment -

            I agree, there are two possible issues about ZFS which needs investigation:
            1. transaction in sync mode may not work.
            2. txg_wait_synced() may hung if called frequently.

            I looked though the latest commits in ZFS code, bug didn't find related fix. BTW which version of ZFS is deployed in autotest?

            laisiyao Lai Siyao added a comment - I agree, there are two possible issues about ZFS which needs investigation: 1. transaction in sync mode may not work. 2. txg_wait_synced() may hung if called frequently. I looked though the latest commits in ZFS code, bug didn't find related fix. BTW which version of ZFS is deployed in autotest?

            I guess the important question is not whether the test is failing , but whether the test represents something that will happen in the real world when directory restripe is being used? If yes, then it doesn't help us to disable the test, because it just means that users will get similar hangs on their production systems.

            adilger Andreas Dilger added a comment - I guess the important question is not whether the test is failing , but whether the test represents something that will happen in the real world when directory restripe is being used? If yes, then it doesn't help us to disable the test, because it just means that users will get similar hangs on their production systems.
            laisiyao Lai Siyao added a comment -

            This may be an old issue, because the same symptom appears on 2.10 also, see LDEV-309 and LU-8601.

            I've enabled full debug and checked the logs, each time it's stuck in osd_sync() --> txg_wait_synced(), and with LU-4684 patches this happens more often is because the new dir migration implementation triggers more lock revoke which triggers Sync-on-Lock-Cancel more often. Per Alex's suggestion in LU-8601, I'd suggest to disable this test for ZFS system, otherwise the patch for LU-4684 can't pass this test.

            Andreas, what's your opinion?

            laisiyao Lai Siyao added a comment - This may be an old issue, because the same symptom appears on 2.10 also, see LDEV-309 and LU-8601 . I've enabled full debug and checked the logs, each time it's stuck in osd_sync() --> txg_wait_synced(), and with LU-4684 patches this happens more often is because the new dir migration implementation triggers more lock revoke which triggers Sync-on-Lock-Cancel more often. Per Alex's suggestion in LU-8601 , I'd suggest to disable this test for ZFS system, otherwise the patch for LU-4684 can't pass this test. Andreas, what's your opinion?

            I see some other stuck threads, and in particular txg_sync is waiting on IO to complete, and a second thread is also stuck in the same place (I'm not sure what the in:imjournal thread is for):

            [10284.337042] in:imjournal    D ffff9825fafbdee0     0  1304      1 0x00000080
            [10284.337732] Call Trace:
            [10284.338475]  [<ffffffff8b914029>] schedule+0x29/0x70
            [10284.338958]  [<ffffffff8b911999>] schedule_timeout+0x239/0x2c0
            [10284.341130]  [<ffffffff8b91353d>] io_schedule_timeout+0xad/0x130
            [10284.341696]  [<ffffffff8b9135d8>] io_schedule+0x18/0x20
            [10284.342205]  [<ffffffff8b911fc1>] bit_wait_io+0x11/0x50
            [10284.342703]  [<ffffffff8b911ae7>] __wait_on_bit+0x67/0x90
            [10284.343789]  [<ffffffff8b3936d1>] wait_on_page_bit+0x81/0xa0
            [10284.344939]  [<ffffffff8b3a4e7b>] truncate_inode_pages_range+0x42b/0x700
            [10284.348657]  [<ffffffff8b3a51bf>] truncate_inode_pages_final+0x4f/0x60
            [10284.349288]  [<ffffffffc047610f>] ext4_evict_inode+0x11f/0x4c0 [ext4]
            [10284.349899]  [<ffffffff8b4387d4>] evict+0xb4/0x180
            [10284.350372]  [<ffffffff8b4390dc>] iput+0xfc/0x190
            [10284.350816]  [<ffffffff8b4339a0>] __dentry_kill+0x120/0x180
            [10284.351361]  [<ffffffff8b433aa9>] dput+0xa9/0x160
            [10284.351811]  [<ffffffff8b42db08>] SYSC_renameat2+0x518/0x5a0
            [10284.355455]  [<ffffffff8b42ea0e>] SyS_renameat2+0xe/0x10
            [10284.355977]  [<ffffffff8b42ea4e>] SyS_rename+0x1e/0x20
            
            [10285.072535] ldlm_cn00_002   D ffff9825b67adee0     0 32218      2 0x00000080
            [10285.073583] Call Trace:
            [10285.073963]  [<ffffffff8b914029>] schedule+0x29/0x70
            [10285.074659]  [<ffffffffc05da205>] cv_wait_common+0x125/0x150 [spl]
            [10285.076201]  [<ffffffffc05da245>] __cv_wait+0x15/0x20 [spl]
            [10285.076939]  [<ffffffffc0704bdf>] txg_wait_synced+0xef/0x140 [zfs]
            [10285.077663]  [<ffffffffc1348a43>] osd_sync+0xc3/0x140 [osd_zfs]
            [10285.078507]  [<ffffffffc1163a5b>] tgt_sync+0x14b/0x270 [ptlrpc]
            [10285.079287]  [<ffffffffc1165f32>] tgt_blocking_ast+0x302/0x630 [ptlrpc]
            [10285.080043]  [<ffffffffc10b6e5a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc]
            [10285.080787]  [<ffffffffc10b7156>] ldlm_lock_cancel+0x56/0x1f0 [ptlrpc]
            [10285.081492]  [<ffffffffc10dc06b>] ldlm_request_cancel+0x18b/0x730 [ptlrpc]
            [10285.082266]  [<ffffffffc10df5ea>] ldlm_handle_cancel+0xba/0x250 [ptlrpc]
            [10285.082964]  [<ffffffffc10df8d8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc]
            [10285.083688]  [<ffffffffc111040b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
            [10285.085095]  [<ffffffffc1113c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc]
            [10285.086464]  [<ffffffff8b2bb621>] kthread+0xd1/0xe0
            
            [10285.745845] txg_sync        D ffff9825d9792f70     0  1592      2 0x00000080
            [10285.746553] Call Trace:
            [10285.747373]  [<ffffffff8b914029>] schedule+0x29/0x70
            [10285.747843]  [<ffffffff8b911999>] schedule_timeout+0x239/0x2c0
            [10285.749545]  [<ffffffff8b91353d>] io_schedule_timeout+0xad/0x130
            [10285.750737]  [<ffffffff8b9135d8>] io_schedule+0x18/0x20
            [10285.751252]  [<ffffffffc05da192>] cv_wait_common+0xb2/0x150 [spl]
            [10285.752390]  [<ffffffffc05da268>] __cv_wait_io+0x18/0x20 [spl]
            [10285.752994]  [<ffffffffc075d023>] zio_wait+0x113/0x1c0 [zfs]
            [10285.753560]  [<ffffffffc0711bc1>] vdev_config_sync+0xf1/0x180 [zfs]
            [10285.754199]  [<ffffffffc06f1a9c>] spa_sync+0xa1c/0xd90 [zfs]
            [10285.755372]  [<ffffffffc0705c41>] txg_sync_thread+0x301/0x510 [zfs]
            [10285.756565]  [<ffffffffc05d5013>] thread_generic_wrapper+0x73/0x80 [spl]
            [10285.757786]  [<ffffffff8b2bb621>] kthread+0xd1/0xe0
            

            There is a second txg_sync thread that doesn't appear to be doing anything, I'm not sure if that is for a different MDT zpool?:

            [10285.004345] txg_sync        S ffff9825e40b9fa0     0 31971      2 0x00000080
            [10285.005189] Call Trace:
            [10285.005442]  [<ffffffff8b914029>] schedule+0x29/0x70
            [10285.005912]  [<ffffffff8b9118d4>] schedule_timeout+0x174/0x2c0
            [10285.007064]  [<ffffffffc05da362>] __cv_timedwait_common+0xd2/0x190 [spl]
            [10285.008255]  [<ffffffffc05da476>] __cv_timedwait_sig+0x16/0x20 [spl]
            [10285.008890]  [<ffffffffc0705add>] txg_sync_thread+0x19d/0x510 [zfs]
            [10285.010105]  [<ffffffffc05d5013>] thread_generic_wrapper+0x73/0x80 [spl]
            [10285.011345]  [<ffffffff8b2bb621>] kthread+0xd1/0xe0
            

            This could conceivably be stuck waiting for IO from the VM host to complete, I'm not sure yet, so I've resubmitted this test session to re-run to see if the problem hits again. I wouldn't want to just to just accept a 1/2 failure rate for landing, but it is possible this is a freak failure. Resubmitting the patch with several lines of Test-Parameters: mdtbackfstype=zfs ostbackfstype=zfs mdscount=2 mdtcount=2 testlist=sanity,sanity,sanity,sanity would tell us whether this was a freak accident or a regular problem.

            adilger Andreas Dilger added a comment - I see some other stuck threads, and in particular txg_sync is waiting on IO to complete, and a second thread is also stuck in the same place (I'm not sure what the in:imjournal thread is for): [10284.337042] in:imjournal D ffff9825fafbdee0 0 1304 1 0x00000080 [10284.337732] Call Trace: [10284.338475] [<ffffffff8b914029>] schedule+0x29/0x70 [10284.338958] [<ffffffff8b911999>] schedule_timeout+0x239/0x2c0 [10284.341130] [<ffffffff8b91353d>] io_schedule_timeout+0xad/0x130 [10284.341696] [<ffffffff8b9135d8>] io_schedule+0x18/0x20 [10284.342205] [<ffffffff8b911fc1>] bit_wait_io+0x11/0x50 [10284.342703] [<ffffffff8b911ae7>] __wait_on_bit+0x67/0x90 [10284.343789] [<ffffffff8b3936d1>] wait_on_page_bit+0x81/0xa0 [10284.344939] [<ffffffff8b3a4e7b>] truncate_inode_pages_range+0x42b/0x700 [10284.348657] [<ffffffff8b3a51bf>] truncate_inode_pages_final+0x4f/0x60 [10284.349288] [<ffffffffc047610f>] ext4_evict_inode+0x11f/0x4c0 [ext4] [10284.349899] [<ffffffff8b4387d4>] evict+0xb4/0x180 [10284.350372] [<ffffffff8b4390dc>] iput+0xfc/0x190 [10284.350816] [<ffffffff8b4339a0>] __dentry_kill+0x120/0x180 [10284.351361] [<ffffffff8b433aa9>] dput+0xa9/0x160 [10284.351811] [<ffffffff8b42db08>] SYSC_renameat2+0x518/0x5a0 [10284.355455] [<ffffffff8b42ea0e>] SyS_renameat2+0xe/0x10 [10284.355977] [<ffffffff8b42ea4e>] SyS_rename+0x1e/0x20 [10285.072535] ldlm_cn00_002 D ffff9825b67adee0 0 32218 2 0x00000080 [10285.073583] Call Trace: [10285.073963] [<ffffffff8b914029>] schedule+0x29/0x70 [10285.074659] [<ffffffffc05da205>] cv_wait_common+0x125/0x150 [spl] [10285.076201] [<ffffffffc05da245>] __cv_wait+0x15/0x20 [spl] [10285.076939] [<ffffffffc0704bdf>] txg_wait_synced+0xef/0x140 [zfs] [10285.077663] [<ffffffffc1348a43>] osd_sync+0xc3/0x140 [osd_zfs] [10285.078507] [<ffffffffc1163a5b>] tgt_sync+0x14b/0x270 [ptlrpc] [10285.079287] [<ffffffffc1165f32>] tgt_blocking_ast+0x302/0x630 [ptlrpc] [10285.080043] [<ffffffffc10b6e5a>] ldlm_cancel_callback+0x8a/0x330 [ptlrpc] [10285.080787] [<ffffffffc10b7156>] ldlm_lock_cancel+0x56/0x1f0 [ptlrpc] [10285.081492] [<ffffffffc10dc06b>] ldlm_request_cancel+0x18b/0x730 [ptlrpc] [10285.082266] [<ffffffffc10df5ea>] ldlm_handle_cancel+0xba/0x250 [ptlrpc] [10285.082964] [<ffffffffc10df8d8>] ldlm_cancel_handler+0x158/0x590 [ptlrpc] [10285.083688] [<ffffffffc111040b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [10285.085095] [<ffffffffc1113c44>] ptlrpc_main+0xb14/0x1fb0 [ptlrpc] [10285.086464] [<ffffffff8b2bb621>] kthread+0xd1/0xe0 [10285.745845] txg_sync D ffff9825d9792f70 0 1592 2 0x00000080 [10285.746553] Call Trace: [10285.747373] [<ffffffff8b914029>] schedule+0x29/0x70 [10285.747843] [<ffffffff8b911999>] schedule_timeout+0x239/0x2c0 [10285.749545] [<ffffffff8b91353d>] io_schedule_timeout+0xad/0x130 [10285.750737] [<ffffffff8b9135d8>] io_schedule+0x18/0x20 [10285.751252] [<ffffffffc05da192>] cv_wait_common+0xb2/0x150 [spl] [10285.752390] [<ffffffffc05da268>] __cv_wait_io+0x18/0x20 [spl] [10285.752994] [<ffffffffc075d023>] zio_wait+0x113/0x1c0 [zfs] [10285.753560] [<ffffffffc0711bc1>] vdev_config_sync+0xf1/0x180 [zfs] [10285.754199] [<ffffffffc06f1a9c>] spa_sync+0xa1c/0xd90 [zfs] [10285.755372] [<ffffffffc0705c41>] txg_sync_thread+0x301/0x510 [zfs] [10285.756565] [<ffffffffc05d5013>] thread_generic_wrapper+0x73/0x80 [spl] [10285.757786] [<ffffffff8b2bb621>] kthread+0xd1/0xe0 There is a second txg_sync thread that doesn't appear to be doing anything, I'm not sure if that is for a different MDT zpool?: [10285.004345] txg_sync S ffff9825e40b9fa0 0 31971 2 0x00000080 [10285.005189] Call Trace: [10285.005442] [<ffffffff8b914029>] schedule+0x29/0x70 [10285.005912] [<ffffffff8b9118d4>] schedule_timeout+0x174/0x2c0 [10285.007064] [<ffffffffc05da362>] __cv_timedwait_common+0xd2/0x190 [spl] [10285.008255] [<ffffffffc05da476>] __cv_timedwait_sig+0x16/0x20 [spl] [10285.008890] [<ffffffffc0705add>] txg_sync_thread+0x19d/0x510 [zfs] [10285.010105] [<ffffffffc05d5013>] thread_generic_wrapper+0x73/0x80 [spl] [10285.011345] [<ffffffff8b2bb621>] kthread+0xd1/0xe0 This could conceivably be stuck waiting for IO from the VM host to complete, I'm not sure yet, so I've resubmitted this test session to re-run to see if the problem hits again. I wouldn't want to just to just accept a 1/2 failure rate for landing, but it is possible this is a freak failure. Resubmitting the patch with several lines of Test-Parameters: mdtbackfstype=zfs ostbackfstype=zfs mdscount=2 mdtcount=2 testlist=sanity,sanity,sanity,sanity would tell us whether this was a freak accident or a regular problem.

            Given that this test is "check migrate big directory" it seems probable that this relates to the patches themselves. The txg_wait_synced() call is the equivalent of being stuck in jbd2_journal_stop() - this is waiting for the transaction to commit, but something is keeping the transaction open for a long time. Normally ZFS will commit a TXG every 1s, so I suspect some kind of reference leak.

            adilger Andreas Dilger added a comment - Given that this test is " check migrate big directory " it seems probable that this relates to the patches themselves. The txg_wait_synced() call is the equivalent of being stuck in jbd2_journal_stop() - this is waiting for the transaction to commit, but something is keeping the transaction open for a long time. Normally ZFS will commit a TXG every 1s, so I suspect some kind of reference leak.
            laisiyao Lai Siyao added a comment -

            The test log shows it was stuck there over 170mins.

            laisiyao Lai Siyao added a comment - The test log shows it was stuck there over 170mins.

            People

              wc-triage WC Triage
              maloo Maloo
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated: