[LU-6923] writing process hung at txg_wait_open Created: 28/Jul/15  Updated: 15/Dec/18

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Jinshan Xiong (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: zfs
Environment:

ZFS


Issue Links:
Duplicate
is duplicated by LU-7721 lfsck: slab 'size-1048576' exhaust me... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

I saw an iozone writing process hung at waiting transaction to open. The backtrace is as follows:

[10459.403319] iozone          D ffff88082f353440     0  6714      1 0x00000004
[10459.403321]  ffff880085e83b28 0000000000000082 ffff88055b805180 ffff880085e83fd8
[10459.403323]  0000000000013440 0000000000013440 ffff880803713d20 ffff8807f3701368
[10459.403326]  ffff8807f3701220 ffff8807f3701370 ffff8807f3701248 0000000000000000
[10459.403328] Call Trace:
[10459.403330]  [<ffffffff81766c59>] schedule+0x29/0x70
[10459.403334]  [<ffffffffc09b4b45>] cv_wait_common+0xe5/0x120 [spl]
[10459.403337]  [<ffffffff810b4cd0>] ? prepare_to_wait_event+0x100/0x100
[10459.403341]  [<ffffffffc09b4b95>] __cv_wait+0x15/0x20 [spl]
[10459.403360]  [<ffffffffc0ab2f43>] txg_wait_open+0x83/0xd0 [zfs]
[10459.403379]  [<ffffffffc0a6e760>] dmu_tx_wait+0x380/0x390 [zfs]
[10459.403387]  [<ffffffff817690b2>] ? mutex_lock+0x12/0x2f
[10459.403406]  [<ffffffffc0a6e80a>] dmu_tx_assign+0x9a/0x510 [zfs]
[10459.403423]  [<ffffffffc0a5f69c>] dmu_free_long_range+0x18c/0x240 [zfs]
[10459.403449]  [<ffffffffc0ad266d>] zfs_rmnode+0x5d/0x340 [zfs]
[10459.403473]  [<ffffffffc0af2bb8>] zfs_zinactive+0x168/0x180 [zfs]
[10459.403494]  [<ffffffffc0aecc30>] zfs_inactive+0x60/0x200 [zfs]
[10459.403518]  [<ffffffffc0b034c3>] zpl_evict_inode+0x43/0x60 [zfs]
[10459.403521]  [<ffffffff811ee864>] evict+0xb4/0x180
[10459.403523]  [<ffffffff811ef045>] iput+0xf5/0x180
[10459.403525]  [<ffffffff811e3a83>] do_unlinkat+0x193/0x2c0
[10459.403528]  [<ffffffff811d8cbf>] ? SYSC_newstat+0x2f/0x40
[10459.403530]  [<ffffffff811e4b06>] SyS_unlink+0x16/0x20
[10459.403532]  [<ffffffff8176aced>] system_call_fastpath+0x1a/0x1f

I will attach the backtrace of all running processes soon



 Comments   
Comment by Alex Zhuravlev [ 28/Jul/15 ]

probably this will be fixed with http://review.whamcloud.com/#/c/15481/

Comment by Jinshan Xiong (Inactive) [ 28/Jul/15 ]

Hi Alex,

No, I was benchmarking ZFS master and there is no Lustre code involved.

Comment by Jinshan Xiong (Inactive) [ 28/Jul/15 ]

txg_sync has been running for ages with the following backtrace:

[16932.253158] txg_sync        R  running task        0 26338      2 0x00000000
[16932.253166]  ffff8807b1c77c38 ffff8807f659a3e8 ffffc900086ad7f0 000000000ccd7dca
[16932.253177]  ffffffffc09adb20 ffffffffc09adb20 0000000000000000 000000000000c210
[16932.253185]  0000000000001000 0000000000000000 0000000000000000 ffff8807b1c77c68
[16932.253187] Call Trace:
[16932.253190]  [<ffffffff811b6cf9>] ? __kmalloc_node+0x1c9/0x2a0
[16932.253193]  [<ffffffffc09adb20>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
[16932.253197]  [<ffffffffc09adb20>] ? spl_kmem_zalloc+0xc0/0x170 [spl]
[16932.253201]  [<ffffffffc09b4bd8>] ? __cv_wait_io+0x18/0x20 [spl]
[16932.253226]  [<ffffffffc0afbe83>] ? zio_wait+0x123/0x210 [zfs]
[16932.253240]  [<ffffffffc0a5d9da>] ? ddt_get_dedup_stats+0x3a/0x60 [zfs]
[16932.253243]  [<ffffffff8107c2ca>] ? mod_timer+0x12a/0x1e0
[16932.253260]  [<ffffffffc0a875b1>] ? dsl_pool_sync+0xb1/0x470 [zfs]
[16932.253267]  [<ffffffffc09ad76a>] ? spl_kmem_free+0x2a/0x40 [spl]
[16932.253292]  [<ffffffffc0aae346>] ? spa_update_dspace+0x26/0x40 [zfs]
[16932.253315]  [<ffffffffc0aa1d32>] ? spa_sync+0x3a2/0xb20 [zfs]
[16932.253321]  [<ffffffff810b4ce2>] ? autoremove_wake_function+0x12/0x40
[16932.253327]  [<ffffffff8101bdc9>] ? read_tsc+0x9/0x20
[16932.253350]  [<ffffffffc0ab3d8b>] ? txg_sync_thread+0x36b/0x630 [zfs]
[16932.253357]  [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
[16932.253380]  [<ffffffffc0ab3a20>] ? txg_quiesce_thread+0x380/0x380 [zfs]
[16932.253388]  [<ffffffffc09aff61>] ? thread_generic_wrapper+0x71/0x80 [spl]
[16932.253395]  [<ffffffffc09afef0>] ? __thread_exit+0x20/0x20 [spl]
[16932.253402]  [<ffffffff810912c2>] ? kthread+0xd2/0xf0
[16932.253408]  [<ffffffff810911f0>] ? kthread_create_on_node+0x1c0/0x1c0
[16932.253413]  [<ffffffff8176ac3c>] ? ret_from_fork+0x7c/0xb0
[16932.253415]  [<ffffffff810911f0>] ? kthread_create_on_node+0x1c0/0x1c0
Comment by Jinshan Xiong (Inactive) [ 28/Jul/15 ]

filed a ticket at: https://github.com/zfsonlinux/zfs/issues/3645

Comment by Don Brady (Inactive) [ 28/Jul/15 ]

Could you supply the config (zdb -C <poolname>) and any recent pool events (zpool events -v)? Thanks.

Comment by Alex Zhuravlev [ 29/Jul/15 ]

sorry, my wording wasn't good enough. see the patch, it switches Lustre to use 0.6.4.2, which:

  • Fix hard lockup due to infinite loop in zfs_zget() (zfsonlinux/zfs#3349)
Comment by Jinshan Xiong (Inactive) [ 29/Jul/15 ]

Hi Alex,

If you're referring to commit 6102d037 for issue #3349, it has already been included in my code base for testing.

Don,

jxiong@titan:~/work/zfs/zfs$ sudo zdb -C zfstest

MOS Configuration:
        version: 5000
        name: 'zfstest'
        state: 0
        txg: 101
        pool_guid: 4724987114355670347
        errata: 0
        hostname: 'titan'
        vdev_children: 1
        vdev_tree:
            type: 'root'
            id: 0
            guid: 4724987114355670347
            children[0]:
                type: 'disk'
                id: 0
                guid: 4902936745656769906
                path: '/dev/nvme0n1p1'
                whole_disk: 0
                metaslab_array: 34
                metaslab_shift: 29
                ashift: 9
                asize: 99993780224
                is_log: 0
                create_txg: 4
        features_for_read:
            com.delphix:hole_birth
            com.delphix:embedded_data
Comment by nasf (Inactive) [ 29/Jan/16 ]

We hit similar trouble on lola as described in LU-7221.

Generated at Sat Feb 10 02:04:27 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.