Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-6923

writing process hung at txg_wait_open

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • None
    • ZFS
    • 3
    • 9223372036854775807

    Description

      I saw an iozone writing process hung at waiting transaction to open. The backtrace is as follows:

      [10459.403319] iozone          D ffff88082f353440     0  6714      1 0x00000004
      [10459.403321]  ffff880085e83b28 0000000000000082 ffff88055b805180 ffff880085e83fd8
      [10459.403323]  0000000000013440 0000000000013440 ffff880803713d20 ffff8807f3701368
      [10459.403326]  ffff8807f3701220 ffff8807f3701370 ffff8807f3701248 0000000000000000
      [10459.403328] Call Trace:
      [10459.403330]  [<ffffffff81766c59>] schedule+0x29/0x70
      [10459.403334]  [<ffffffffc09b4b45>] cv_wait_common+0xe5/0x120 [spl]
      [10459.403337]  [<ffffffff810b4cd0>] ? prepare_to_wait_event+0x100/0x100
      [10459.403341]  [<ffffffffc09b4b95>] __cv_wait+0x15/0x20 [spl]
      [10459.403360]  [<ffffffffc0ab2f43>] txg_wait_open+0x83/0xd0 [zfs]
      [10459.403379]  [<ffffffffc0a6e760>] dmu_tx_wait+0x380/0x390 [zfs]
      [10459.403387]  [<ffffffff817690b2>] ? mutex_lock+0x12/0x2f
      [10459.403406]  [<ffffffffc0a6e80a>] dmu_tx_assign+0x9a/0x510 [zfs]
      [10459.403423]  [<ffffffffc0a5f69c>] dmu_free_long_range+0x18c/0x240 [zfs]
      [10459.403449]  [<ffffffffc0ad266d>] zfs_rmnode+0x5d/0x340 [zfs]
      [10459.403473]  [<ffffffffc0af2bb8>] zfs_zinactive+0x168/0x180 [zfs]
      [10459.403494]  [<ffffffffc0aecc30>] zfs_inactive+0x60/0x200 [zfs]
      [10459.403518]  [<ffffffffc0b034c3>] zpl_evict_inode+0x43/0x60 [zfs]
      [10459.403521]  [<ffffffff811ee864>] evict+0xb4/0x180
      [10459.403523]  [<ffffffff811ef045>] iput+0xf5/0x180
      [10459.403525]  [<ffffffff811e3a83>] do_unlinkat+0x193/0x2c0
      [10459.403528]  [<ffffffff811d8cbf>] ? SYSC_newstat+0x2f/0x40
      [10459.403530]  [<ffffffff811e4b06>] SyS_unlink+0x16/0x20
      [10459.403532]  [<ffffffff8176aced>] system_call_fastpath+0x1a/0x1f
      

      I will attach the backtrace of all running processes soon

      Attachments

        Issue Links

          Activity

            [LU-6923] writing process hung at txg_wait_open

            We hit similar trouble on lola as described in LU-7221.

            yong.fan nasf (Inactive) added a comment - We hit similar trouble on lola as described in LU-7221 .

            Hi Alex,

            If you're referring to commit 6102d037 for issue #3349, it has already been included in my code base for testing.

            Don,

            jxiong@titan:~/work/zfs/zfs$ sudo zdb -C zfstest
            
            MOS Configuration:
                    version: 5000
                    name: 'zfstest'
                    state: 0
                    txg: 101
                    pool_guid: 4724987114355670347
                    errata: 0
                    hostname: 'titan'
                    vdev_children: 1
                    vdev_tree:
                        type: 'root'
                        id: 0
                        guid: 4724987114355670347
                        children[0]:
                            type: 'disk'
                            id: 0
                            guid: 4902936745656769906
                            path: '/dev/nvme0n1p1'
                            whole_disk: 0
                            metaslab_array: 34
                            metaslab_shift: 29
                            ashift: 9
                            asize: 99993780224
                            is_log: 0
                            create_txg: 4
                    features_for_read:
                        com.delphix:hole_birth
                        com.delphix:embedded_data
            
            jay Jinshan Xiong (Inactive) added a comment - Hi Alex, If you're referring to commit 6102d037 for issue #3349, it has already been included in my code base for testing. Don, jxiong@titan:~/work/zfs/zfs$ sudo zdb -C zfstest MOS Configuration: version: 5000 name: 'zfstest' state: 0 txg: 101 pool_guid: 4724987114355670347 errata: 0 hostname: 'titan' vdev_children: 1 vdev_tree: type: 'root' id: 0 guid: 4724987114355670347 children[0]: type: 'disk' id: 0 guid: 4902936745656769906 path: '/dev/nvme0n1p1' whole_disk: 0 metaslab_array: 34 metaslab_shift: 29 ashift: 9 asize: 99993780224 is_log: 0 create_txg: 4 features_for_read: com.delphix:hole_birth com.delphix:embedded_data

            sorry, my wording wasn't good enough. see the patch, it switches Lustre to use 0.6.4.2, which:

            • Fix hard lockup due to infinite loop in zfs_zget() (zfsonlinux/zfs#3349)
            bzzz Alex Zhuravlev added a comment - sorry, my wording wasn't good enough. see the patch, it switches Lustre to use 0.6.4.2, which: Fix hard lockup due to infinite loop in zfs_zget() (zfsonlinux/zfs#3349)

            Could you supply the config (zdb -C <poolname>) and any recent pool events (zpool events -v)? Thanks.

            dbrady Don Brady (Inactive) added a comment - Could you supply the config (zdb -C <poolname>) and any recent pool events (zpool events -v)? Thanks.
            jay Jinshan Xiong (Inactive) added a comment - filed a ticket at: https://github.com/zfsonlinux/zfs/issues/3645

            txg_sync has been running for ages with the following backtrace:

            [16932.253158] txg_sync        R  running task        0 26338      2 0x00000000
            [16932.253166]  ffff8807b1c77c38 ffff8807f659a3e8 ffffc900086ad7f0 000000000ccd7dca
            [16932.253177]  ffffffffc09adb20 ffffffffc09adb20 0000000000000000 000000000000c210
            [16932.253185]  0000000000001000 0000000000000000 0000000000000000 ffff8807b1c77c68
            [16932.253187] Call Trace:
            [16932.253190]  [<ffffffff811b6cf9>] ? __kmalloc_node+0x1c9/0x2a0
            [16932.253193]  [<ffffffffc09adb20>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
            [16932.253197]  [<ffffffffc09adb20>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
            [16932.253201]  [<ffffffffc09b4bd8>] ? __cv_wait_io+0x18/0x20 [spl]
            [16932.253226]  [<ffffffffc0afbe83>] ? zio_wait+0x123/0x210 [zfs]
            [16932.253240]  [<ffffffffc0a5d9da>] ? ddt_get_dedup_stats+0x3a/0x60 [zfs]
            [16932.253243]  [<ffffffff8107c2ca>] ? mod_timer+0x12a/0x1e0
            [16932.253260]  [<ffffffffc0a875b1>] ? dsl_pool_sync+0xb1/0x470 [zfs]
            [16932.253267]  [<ffffffffc09ad76a>] ? spl_kmem_free+0x2a/0x40 [spl]
            [16932.253292]  [<ffffffffc0aae346>] ? spa_update_dspace+0x26/0x40 [zfs]
            [16932.253315]  [<ffffffffc0aa1d32>] ? spa_sync+0x3a2/0xb20 [zfs]
            [16932.253321]  [<ffffffff810b4ce2>] ? autoremove_wake_function+0x12/0x40
            [16932.253327]  [<ffffffff8101bdc9>] ? read_tsc+0x9/0x20
            [16932.253350]  [<ffffffffc0ab3d8b>] ? txg_sync_thread+0x36b/0x630 [zfs]
            [16932.253357]  [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
            [16932.253380]  [<ffffffffc0ab3a20>] ? txg_quiesce_thread+0x380/0x380 [zfs]
            [16932.253388]  [<ffffffffc09aff61>] ? thread_generic_wrapper+0x71/0x80 [spl]
            [16932.253395]  [<ffffffffc09afef0>] ? __thread_exit+0x20/0x20 [spl]
            [16932.253402]  [<ffffffff810912c2>] ? kthread+0xd2/0xf0
            [16932.253408]  [<ffffffff810911f0>] ? kthread_create_on_node+0x1c0/0x1c0
            [16932.253413]  [<ffffffff8176ac3c>] ? ret_from_fork+0x7c/0xb0
            [16932.253415]  [<ffffffff810911f0>] ? kthread_create_on_node+0x1c0/0x1c0
            
            jay Jinshan Xiong (Inactive) added a comment - txg_sync has been running for ages with the following backtrace: [16932.253158] txg_sync R running task 0 26338 2 0x00000000 [16932.253166] ffff8807b1c77c38 ffff8807f659a3e8 ffffc900086ad7f0 000000000ccd7dca [16932.253177] ffffffffc09adb20 ffffffffc09adb20 0000000000000000 000000000000c210 [16932.253185] 0000000000001000 0000000000000000 0000000000000000 ffff8807b1c77c68 [16932.253187] Call Trace: [16932.253190] [<ffffffff811b6cf9>] ? __kmalloc_node+0x1c9/0x2a0 [16932.253193] [<ffffffffc09adb20>] ? spl_kmem_zalloc+0xc0/0x170 [spl] [16932.253197] [<ffffffffc09adb20>] ? spl_kmem_zalloc+0xc0/0x170 [spl] [16932.253201] [<ffffffffc09b4bd8>] ? __cv_wait_io+0x18/0x20 [spl] [16932.253226] [<ffffffffc0afbe83>] ? zio_wait+0x123/0x210 [zfs] [16932.253240] [<ffffffffc0a5d9da>] ? ddt_get_dedup_stats+0x3a/0x60 [zfs] [16932.253243] [<ffffffff8107c2ca>] ? mod_timer+0x12a/0x1e0 [16932.253260] [<ffffffffc0a875b1>] ? dsl_pool_sync+0xb1/0x470 [zfs] [16932.253267] [<ffffffffc09ad76a>] ? spl_kmem_free+0x2a/0x40 [spl] [16932.253292] [<ffffffffc0aae346>] ? spa_update_dspace+0x26/0x40 [zfs] [16932.253315] [<ffffffffc0aa1d32>] ? spa_sync+0x3a2/0xb20 [zfs] [16932.253321] [<ffffffff810b4ce2>] ? autoremove_wake_function+0x12/0x40 [16932.253327] [<ffffffff8101bdc9>] ? read_tsc+0x9/0x20 [16932.253350] [<ffffffffc0ab3d8b>] ? txg_sync_thread+0x36b/0x630 [zfs] [16932.253357] [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10 [16932.253380] [<ffffffffc0ab3a20>] ? txg_quiesce_thread+0x380/0x380 [zfs] [16932.253388] [<ffffffffc09aff61>] ? thread_generic_wrapper+0x71/0x80 [spl] [16932.253395] [<ffffffffc09afef0>] ? __thread_exit+0x20/0x20 [spl] [16932.253402] [<ffffffff810912c2>] ? kthread+0xd2/0xf0 [16932.253408] [<ffffffff810911f0>] ? kthread_create_on_node+0x1c0/0x1c0 [16932.253413] [<ffffffff8176ac3c>] ? ret_from_fork+0x7c/0xb0 [16932.253415] [<ffffffff810911f0>] ? kthread_create_on_node+0x1c0/0x1c0

            Hi Alex,

            No, I was benchmarking ZFS master and there is no Lustre code involved.

            jay Jinshan Xiong (Inactive) added a comment - Hi Alex, No, I was benchmarking ZFS master and there is no Lustre code involved.

            probably this will be fixed with http://review.whamcloud.com/#/c/15481/

            bzzz Alex Zhuravlev added a comment - probably this will be fixed with http://review.whamcloud.com/#/c/15481/

            People

              wc-triage WC Triage
              jay Jinshan Xiong (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: