Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-4950

sanity-benchmark test fsx hung: txg_sync was stuck on OSS

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3, Lustre 2.8.0, Lustre 2.10.0
    • 3
    • 13698

    Description

      While running sanity-benchmark test fsx with the following configuration, it timed out in 3600s.

      MDSCOUNT=1
      MDSSIZE=2097152
      OSTCOUNT=2
      OSTSIZE=8388608
      FSTYPE=zfs
      SLOW=yes
      

      txg_sync was stuck on OSS node:

      txg_sync      D 0000000000000001     0  2665      2 0x00000080
       ffff88006f4cbba0 0000000000000046 00000000ffffffff 000037007418a66d
       ffff88006f4cbb10 ffff88007cba1a60 0000000000bba084 ffffffffaca9e91e
       ffff880070bc05f8 ffff88006f4cbfd8 000000000000fbc8 ffff880070bc05f8
      Call Trace:
       [<ffffffff810a7091>] ? ktime_get_ts+0xb1/0xf0
       [<ffffffff81528823>] io_schedule+0x73/0xc0
       [<ffffffffa0142e7c>] cv_wait_common+0x8c/0x100 [spl]
       [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
       [<ffffffffa0142f08>] __cv_wait_io+0x18/0x20 [spl]
       [<ffffffffa02864ab>] zio_wait+0xfb/0x1b0 [zfs]
       [<ffffffffa021e5a5>] dsl_pool_sync+0x2f5/0x540 [zfs]
       [<ffffffffa023725e>] spa_sync+0x40e/0xa00 [zfs]
       [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10
       [<ffffffffa024253a>] txg_sync_thread+0x27a/0x4b0 [zfs]
       [<ffffffff81059329>] ? set_user_nice+0xc9/0x130
       [<ffffffffa02422c0>] ? txg_sync_thread+0x0/0x4b0 [zfs]
       [<ffffffffa013e9bf>] thread_generic_wrapper+0x5f/0x70 [spl]
       [<ffffffffa013e960>] ? thread_generic_wrapper+0x0/0x70 [spl]
       [<ffffffff8109aee6>] kthread+0x96/0xa0
       [<ffffffff8100c20a>] child_rip+0xa/0x20
       [<ffffffff8109ae50>] ? kthread+0x0/0xa0
       [<ffffffff8100c200>] ? child_rip+0x0/0x20
      

      Maloo report: https://maloo.whamcloud.com/test_sets/7ec85f26-cb6e-11e3-95c9-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-4950] sanity-benchmark test fsx hung: txg_sync was stuck on OSS

            a duplicate of LU-5242

            bzzz Alex Zhuravlev added a comment - a duplicate of LU-5242

            Sarah, I can't find ofd_destroy() in the traces, I doubt this is LU-4950..

            bzzz Alex Zhuravlev added a comment - Sarah, I can't find ofd_destroy() in the traces, I doubt this is LU-4950 ..
            sarah Sarah Liu added a comment -

            In master branch, tag-2.9.53 el7 zfs is still experiencing the same error, reopen it and see if anything else is needed
            https://testing.hpdd.intel.com/test_sets/19e62a78-f8f9-11e6-aac4-5254006e85c2

            oss console

            07:34:00:[ 3545.939658] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 07:23:02 (1487748182)
            07:34:00:[ 3546.225025] Lustre: DEBUG MARKER: lctl dl | grep obdfilter
            07:34:00:[ 3546.764980] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
            07:34:00:[ 3547.846310] Lustre: Echo OBD driver; http://www.lustre.org/
            07:34:00:[ 4200.108290] INFO: task txg_sync:22316 blocked for more than 120 seconds.
            07:34:00:[ 4200.110327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            07:34:00:
            07:34:00:[ 4200.112402] txg_sync        D 0000000000000001     0 22316      2 0x00000080
            07:34:00:[ 4200.116024]  ffff88003b79fac0 0000000000000046 ffff88003a438fb0 ffff88003b79ffd8
            07:34:00:[ 4200.118769]  ffff88003b79ffd8 ffff88003b79ffd8 ffff88003a438fb0 ffff88007fd16c40
            07:34:00:[ 4200.120870]  0000000000000000 7fffffffffffffff ffff88003bd09908 0000000000000001
            07:34:00:[ 4200.123100] Call Trace:
            07:34:00:[ 4200.124786]  [<ffffffff8168bac9>] schedule+0x29/0x70
            07:34:00:[ 4200.126894]  [<ffffffff81689519>] schedule_timeout+0x239/0x2d0
            07:34:00:[ 4200.128916]  [<ffffffff810c4fe2>] ? default_wake_function+0x12/0x20
            07:34:00:[ 4200.130962]  [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90
            07:34:00:[ 4200.133032]  [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30
            07:34:00:[ 4200.135083]  [<ffffffff8168b06e>] io_schedule_timeout+0xae/0x130
            07:34:00:[ 4200.137129]  [<ffffffff810b1416>] ? prepare_to_wait_exclusive+0x56/0x90
            07:34:00:[ 4200.139259]  [<ffffffff8168b108>] io_schedule+0x18/0x20
            07:34:00:[ 4200.141213]  [<ffffffffa0679617>] cv_wait_common+0xa7/0x130 [spl]
            07:34:00:[ 4200.143303]  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
            07:34:00:[ 4200.145316]  [<ffffffffa06796f8>] __cv_wait_io+0x18/0x20 [spl]
            07:34:00:[ 4200.147464]  [<ffffffffa07d351b>] zio_wait+0x10b/0x1f0 [zfs]
            07:34:00:[ 4200.149508]  [<ffffffffa075ccdf>] dsl_pool_sync+0xbf/0x440 [zfs]
            07:34:00:[ 4200.151553]  [<ffffffffa0777868>] spa_sync+0x388/0xb50 [zfs]
            07:34:00:[ 4200.153602]  [<ffffffff810b174b>] ? autoremove_wake_function+0x2b/0x40
            07:34:00:[ 4200.155664]  [<ffffffff81689c72>] ? mutex_lock+0x12/0x2f
            07:34:00:[ 4200.157616]  [<ffffffffa07894e5>] txg_sync_thread+0x3c5/0x620 [zfs]
            07:34:00:[ 4200.159653]  [<ffffffffa0789120>] ? txg_init+0x280/0x280 [zfs]
            07:34:00:[ 4200.161605]  [<ffffffffa0674851>] thread_generic_wrapper+0x71/0x80 [spl]
            07:34:00:[ 4200.163625]  [<ffffffffa06747e0>] ? __thread_exit+0x20/0x20 [spl]
            07:34:00:[ 4200.165586]  [<ffffffff810b064f>] kthread+0xcf/0xe0
            07:34:00:[ 4200.167428]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
            07:34:00:[ 4200.169377]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
            07:34:00:[ 4200.171244]  [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140
            07:34:00:[ 4200.173334] INFO: task lctl:3231 blocked for more than 120 seconds.
            07:34:00:[ 4200.175136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
            07:34:00:[ 4200.177040] lctl            D 0000000000000000     0  3231   3197 0x00000080
            07:34:00:[ 4200.178923]  ffff88003d3db948 0000000000000082 ffff88001d2f9f60 ffff88003d3dbfd8
            07:34:00:[ 4200.181132]  ffff88003d3dbfd8 ffff88003d3dbfd8 ffff88001d2f9f60 ffff88003b9c5b50
            07:34:00:[ 4200.183052]  ffff88003b9c5a10 ffff88003b9c5b58 ffff88003b9c5a38 0000000000000000
            07:34:00:[ 4200.185136] Call Trace:
            07:34:00:[ 4200.186747]  [<ffffffff8168bac9>] schedule+0x29/0x70
            07:34:00:[ 4200.188372]  [<ffffffffa067967d>] cv_wait_common+0x10d/0x130 [spl]
            07:34:00:[ 4200.190210]  [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30
            07:34:00:[ 4200.191823]  [<ffffffffa06796b5>] __cv_wait+0x15/0x20 [spl]
            07:34:00:[ 4200.193657]  [<ffffffffa07886a3>] txg_wait_open+0xb3/0xf0 [zfs]
            07:34:00:[ 4200.195240]  [<ffffffffa0743b0d>] dmu_tx_wait+0x34d/0x360 [zfs]
            07:34:00:[ 4200.196968]  [<ffffffffa0743bb1>] dmu_tx_assign+0x91/0x4b0 [zfs]
            07:34:00:[ 4200.198521]  [<ffffffffa0f8c05a>] osd_trans_start+0xaa/0x3c0 [osd_zfs]
            07:34:00:[ 4200.200290]  [<ffffffffa108e2db>] ofd_trans_start+0x6b/0xe0 [ofd]
            07:34:00:[ 4200.202029]  [<ffffffffa109460b>] ofd_commitrw_write+0x94b/0x1c30 [ofd]
            07:34:00:[ 4200.203800]  [<ffffffffa109850d>] ofd_commitrw+0x56d/0xa30 [ofd]
            07:34:00:[ 4200.205564]  [<ffffffffa0b30679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
            07:34:00:[ 4200.207167]  [<ffffffffa11895ea>] echo_client_prep_commit.isra.49+0x69a/0xc30 [obdecho]
            07:34:00:[ 4200.209038]  [<ffffffffa11930cf>] echo_client_iocontrol+0x9cf/0x1c50 [obdecho]
            07:34:00:[ 4200.210635]  [<ffffffffa0b30679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
            07:34:00:[ 4200.212403]  [<ffffffffa0b1bb1e>] class_handle_ioctl+0x19de/0x2150 [obdclass]
            07:34:00:[ 4200.213966]  [<ffffffff812a8558>] ? security_capable+0x18/0x20
            07:34:00:[ 4200.215613]  [<ffffffffa0b005d2>] obd_class_ioctl+0xd2/0x170 [obdclass]
            07:34:00:[ 4200.217176]  [<ffffffff81212025>] do_vfs_ioctl+0x2d5/0x4b0
            07:34:00:[ 4200.218778]  [<ffffffff81691ef1>] ? __do_page_fault+0x171/0x450
            07:34:00:[ 4200.220285]  [<ffffffffa0b1b258>] ? class_handle_ioctl+0x1118/0x2150 [obdclass]
            07:34:00:[ 4200.222049]  [<ffffffff812122a1>] SyS_ioctl+0xa1/0xc0
            07:34:00:[ 4200.223471]  [<ffffffff81696a09>] system_call_fastpath+0x16/0x1b
            07:34:00:[ 4200.225197] INFO: task lctl:3232 blocked for more than 120 seconds.
            
            sarah Sarah Liu added a comment - In master branch, tag-2.9.53 el7 zfs is still experiencing the same error, reopen it and see if anything else is needed https://testing.hpdd.intel.com/test_sets/19e62a78-f8f9-11e6-aac4-5254006e85c2 oss console 07:34:00:[ 3545.939658] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 07:23:02 (1487748182) 07:34:00:[ 3546.225025] Lustre: DEBUG MARKER: lctl dl | grep obdfilter 07:34:00:[ 3546.764980] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@' 07:34:00:[ 3547.846310] Lustre: Echo OBD driver; http://www.lustre.org/ 07:34:00:[ 4200.108290] INFO: task txg_sync:22316 blocked for more than 120 seconds. 07:34:00:[ 4200.110327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 07:34:00: 07:34:00:[ 4200.112402] txg_sync D 0000000000000001 0 22316 2 0x00000080 07:34:00:[ 4200.116024] ffff88003b79fac0 0000000000000046 ffff88003a438fb0 ffff88003b79ffd8 07:34:00:[ 4200.118769] ffff88003b79ffd8 ffff88003b79ffd8 ffff88003a438fb0 ffff88007fd16c40 07:34:00:[ 4200.120870] 0000000000000000 7fffffffffffffff ffff88003bd09908 0000000000000001 07:34:00:[ 4200.123100] Call Trace: 07:34:00:[ 4200.124786] [<ffffffff8168bac9>] schedule+0x29/0x70 07:34:00:[ 4200.126894] [<ffffffff81689519>] schedule_timeout+0x239/0x2d0 07:34:00:[ 4200.128916] [<ffffffff810c4fe2>] ? default_wake_function+0x12/0x20 07:34:00:[ 4200.130962] [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90 07:34:00:[ 4200.133032] [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30 07:34:00:[ 4200.135083] [<ffffffff8168b06e>] io_schedule_timeout+0xae/0x130 07:34:00:[ 4200.137129] [<ffffffff810b1416>] ? prepare_to_wait_exclusive+0x56/0x90 07:34:00:[ 4200.139259] [<ffffffff8168b108>] io_schedule+0x18/0x20 07:34:00:[ 4200.141213] [<ffffffffa0679617>] cv_wait_common+0xa7/0x130 [spl] 07:34:00:[ 4200.143303] [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 07:34:00:[ 4200.145316] [<ffffffffa06796f8>] __cv_wait_io+0x18/0x20 [spl] 07:34:00:[ 4200.147464] [<ffffffffa07d351b>] zio_wait+0x10b/0x1f0 [zfs] 07:34:00:[ 4200.149508] [<ffffffffa075ccdf>] dsl_pool_sync+0xbf/0x440 [zfs] 07:34:00:[ 4200.151553] [<ffffffffa0777868>] spa_sync+0x388/0xb50 [zfs] 07:34:00:[ 4200.153602] [<ffffffff810b174b>] ? autoremove_wake_function+0x2b/0x40 07:34:00:[ 4200.155664] [<ffffffff81689c72>] ? mutex_lock+0x12/0x2f 07:34:00:[ 4200.157616] [<ffffffffa07894e5>] txg_sync_thread+0x3c5/0x620 [zfs] 07:34:00:[ 4200.159653] [<ffffffffa0789120>] ? txg_init+0x280/0x280 [zfs] 07:34:00:[ 4200.161605] [<ffffffffa0674851>] thread_generic_wrapper+0x71/0x80 [spl] 07:34:00:[ 4200.163625] [<ffffffffa06747e0>] ? __thread_exit+0x20/0x20 [spl] 07:34:00:[ 4200.165586] [<ffffffff810b064f>] kthread+0xcf/0xe0 07:34:00:[ 4200.167428] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 07:34:00:[ 4200.169377] [<ffffffff81696958>] ret_from_fork+0x58/0x90 07:34:00:[ 4200.171244] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 07:34:00:[ 4200.173334] INFO: task lctl:3231 blocked for more than 120 seconds. 07:34:00:[ 4200.175136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 07:34:00:[ 4200.177040] lctl D 0000000000000000 0 3231 3197 0x00000080 07:34:00:[ 4200.178923] ffff88003d3db948 0000000000000082 ffff88001d2f9f60 ffff88003d3dbfd8 07:34:00:[ 4200.181132] ffff88003d3dbfd8 ffff88003d3dbfd8 ffff88001d2f9f60 ffff88003b9c5b50 07:34:00:[ 4200.183052] ffff88003b9c5a10 ffff88003b9c5b58 ffff88003b9c5a38 0000000000000000 07:34:00:[ 4200.185136] Call Trace: 07:34:00:[ 4200.186747] [<ffffffff8168bac9>] schedule+0x29/0x70 07:34:00:[ 4200.188372] [<ffffffffa067967d>] cv_wait_common+0x10d/0x130 [spl] 07:34:00:[ 4200.190210] [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 07:34:00:[ 4200.191823] [<ffffffffa06796b5>] __cv_wait+0x15/0x20 [spl] 07:34:00:[ 4200.193657] [<ffffffffa07886a3>] txg_wait_open+0xb3/0xf0 [zfs] 07:34:00:[ 4200.195240] [<ffffffffa0743b0d>] dmu_tx_wait+0x34d/0x360 [zfs] 07:34:00:[ 4200.196968] [<ffffffffa0743bb1>] dmu_tx_assign+0x91/0x4b0 [zfs] 07:34:00:[ 4200.198521] [<ffffffffa0f8c05a>] osd_trans_start+0xaa/0x3c0 [osd_zfs] 07:34:00:[ 4200.200290] [<ffffffffa108e2db>] ofd_trans_start+0x6b/0xe0 [ofd] 07:34:00:[ 4200.202029] [<ffffffffa109460b>] ofd_commitrw_write+0x94b/0x1c30 [ofd] 07:34:00:[ 4200.203800] [<ffffffffa109850d>] ofd_commitrw+0x56d/0xa30 [ofd] 07:34:00:[ 4200.205564] [<ffffffffa0b30679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] 07:34:00:[ 4200.207167] [<ffffffffa11895ea>] echo_client_prep_commit.isra.49+0x69a/0xc30 [obdecho] 07:34:00:[ 4200.209038] [<ffffffffa11930cf>] echo_client_iocontrol+0x9cf/0x1c50 [obdecho] 07:34:00:[ 4200.210635] [<ffffffffa0b30679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] 07:34:00:[ 4200.212403] [<ffffffffa0b1bb1e>] class_handle_ioctl+0x19de/0x2150 [obdclass] 07:34:00:[ 4200.213966] [<ffffffff812a8558>] ? security_capable+0x18/0x20 07:34:00:[ 4200.215613] [<ffffffffa0b005d2>] obd_class_ioctl+0xd2/0x170 [obdclass] 07:34:00:[ 4200.217176] [<ffffffff81212025>] do_vfs_ioctl+0x2d5/0x4b0 07:34:00:[ 4200.218778] [<ffffffff81691ef1>] ? __do_page_fault+0x171/0x450 07:34:00:[ 4200.220285] [<ffffffffa0b1b258>] ? class_handle_ioctl+0x1118/0x2150 [obdclass] 07:34:00:[ 4200.222049] [<ffffffff812122a1>] SyS_ioctl+0xa1/0xc0 07:34:00:[ 4200.223471] [<ffffffff81696a09>] system_call_fastpath+0x16/0x1b 07:34:00:[ 4200.225197] INFO: task lctl:3232 blocked for more than 120 seconds.

            dup of LU-5242

            bzzz Alex Zhuravlev added a comment - dup of LU-5242
            standan Saurabh Tandan (Inactive) added a comment - - edited Another instance found on b2_8 for failover testing , build# 6. https://testing.hpdd.intel.com/test_sessions/54ec62da-d99d-11e5-9ebe-5254006e85c2 https://testing.hpdd.intel.com/test_sessions/c5a8e44c-d9c7-11e5-85dd-5254006e85c2

            it would be interesting to try with http://review.whamcloud.com/#/c/18341/

            bzzz Alex Zhuravlev added a comment - it would be interesting to try with http://review.whamcloud.com/#/c/18341/

            People

              bzzz Alex Zhuravlev
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: