[LU-6923] writing process hung at txg_wait_open Created: 28/Jul/15 Updated: 15/Dec/18 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | zfs | ||
| Environment: |
ZFS |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
I saw an iozone writing process hung at waiting transaction to open. The backtrace is as follows: [10459.403319] iozone D ffff88082f353440 0 6714 1 0x00000004 [10459.403321] ffff880085e83b28 0000000000000082 ffff88055b805180 ffff880085e83fd8 [10459.403323] 0000000000013440 0000000000013440 ffff880803713d20 ffff8807f3701368 [10459.403326] ffff8807f3701220 ffff8807f3701370 ffff8807f3701248 0000000000000000 [10459.403328] Call Trace: [10459.403330] [<ffffffff81766c59>] schedule+0x29/0x70 [10459.403334] [<ffffffffc09b4b45>] cv_wait_common+0xe5/0x120 [spl] [10459.403337] [<ffffffff810b4cd0>] ? prepare_to_wait_event+0x100/0x100 [10459.403341] [<ffffffffc09b4b95>] __cv_wait+0x15/0x20 [spl] [10459.403360] [<ffffffffc0ab2f43>] txg_wait_open+0x83/0xd0 [zfs] [10459.403379] [<ffffffffc0a6e760>] dmu_tx_wait+0x380/0x390 [zfs] [10459.403387] [<ffffffff817690b2>] ? mutex_lock+0x12/0x2f [10459.403406] [<ffffffffc0a6e80a>] dmu_tx_assign+0x9a/0x510 [zfs] [10459.403423] [<ffffffffc0a5f69c>] dmu_free_long_range+0x18c/0x240 [zfs] [10459.403449] [<ffffffffc0ad266d>] zfs_rmnode+0x5d/0x340 [zfs] [10459.403473] [<ffffffffc0af2bb8>] zfs_zinactive+0x168/0x180 [zfs] [10459.403494] [<ffffffffc0aecc30>] zfs_inactive+0x60/0x200 [zfs] [10459.403518] [<ffffffffc0b034c3>] zpl_evict_inode+0x43/0x60 [zfs] [10459.403521] [<ffffffff811ee864>] evict+0xb4/0x180 [10459.403523] [<ffffffff811ef045>] iput+0xf5/0x180 [10459.403525] [<ffffffff811e3a83>] do_unlinkat+0x193/0x2c0 [10459.403528] [<ffffffff811d8cbf>] ? SYSC_newstat+0x2f/0x40 [10459.403530] [<ffffffff811e4b06>] SyS_unlink+0x16/0x20 [10459.403532] [<ffffffff8176aced>] system_call_fastpath+0x1a/0x1f I will attach the backtrace of all running processes soon |
| Comments |
| Comment by Alex Zhuravlev [ 28/Jul/15 ] |
|
probably this will be fixed with http://review.whamcloud.com/#/c/15481/ |
| Comment by Jinshan Xiong (Inactive) [ 28/Jul/15 ] |
|
Hi Alex, No, I was benchmarking ZFS master and there is no Lustre code involved. |
| Comment by Jinshan Xiong (Inactive) [ 28/Jul/15 ] |
|
txg_sync has been running for ages with the following backtrace: [16932.253158] txg_sync R running task 0 26338 2 0x00000000 [16932.253166] ffff8807b1c77c38 ffff8807f659a3e8 ffffc900086ad7f0 000000000ccd7dca [16932.253177] ffffffffc09adb20 ffffffffc09adb20 0000000000000000 000000000000c210 [16932.253185] 0000000000001000 0000000000000000 0000000000000000 ffff8807b1c77c68 [16932.253187] Call Trace: [16932.253190] [<ffffffff811b6cf9>] ? __kmalloc_node+0x1c9/0x2a0 [16932.253193] [<ffffffffc09adb20>] ? spl_kmem_zalloc+0xc0/0x170 [spl] [16932.253197] [<ffffffffc09adb20>] ? spl_kmem_zalloc+0xc0/0x170 [spl] [16932.253201] [<ffffffffc09b4bd8>] ? __cv_wait_io+0x18/0x20 [spl] [16932.253226] [<ffffffffc0afbe83>] ? zio_wait+0x123/0x210 [zfs] [16932.253240] [<ffffffffc0a5d9da>] ? ddt_get_dedup_stats+0x3a/0x60 [zfs] [16932.253243] [<ffffffff8107c2ca>] ? mod_timer+0x12a/0x1e0 [16932.253260] [<ffffffffc0a875b1>] ? dsl_pool_sync+0xb1/0x470 [zfs] [16932.253267] [<ffffffffc09ad76a>] ? spl_kmem_free+0x2a/0x40 [spl] [16932.253292] [<ffffffffc0aae346>] ? spa_update_dspace+0x26/0x40 [zfs] [16932.253315] [<ffffffffc0aa1d32>] ? spa_sync+0x3a2/0xb20 [zfs] [16932.253321] [<ffffffff810b4ce2>] ? autoremove_wake_function+0x12/0x40 [16932.253327] [<ffffffff8101bdc9>] ? read_tsc+0x9/0x20 [16932.253350] [<ffffffffc0ab3d8b>] ? txg_sync_thread+0x36b/0x630 [zfs] [16932.253357] [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10 [16932.253380] [<ffffffffc0ab3a20>] ? txg_quiesce_thread+0x380/0x380 [zfs] [16932.253388] [<ffffffffc09aff61>] ? thread_generic_wrapper+0x71/0x80 [spl] [16932.253395] [<ffffffffc09afef0>] ? __thread_exit+0x20/0x20 [spl] [16932.253402] [<ffffffff810912c2>] ? kthread+0xd2/0xf0 [16932.253408] [<ffffffff810911f0>] ? kthread_create_on_node+0x1c0/0x1c0 [16932.253413] [<ffffffff8176ac3c>] ? ret_from_fork+0x7c/0xb0 [16932.253415] [<ffffffff810911f0>] ? kthread_create_on_node+0x1c0/0x1c0 |
| Comment by Jinshan Xiong (Inactive) [ 28/Jul/15 ] |
|
filed a ticket at: https://github.com/zfsonlinux/zfs/issues/3645 |
| Comment by Don Brady (Inactive) [ 28/Jul/15 ] |
|
Could you supply the config (zdb -C <poolname>) and any recent pool events (zpool events -v)? Thanks. |
| Comment by Alex Zhuravlev [ 29/Jul/15 ] |
|
sorry, my wording wasn't good enough. see the patch, it switches Lustre to use 0.6.4.2, which:
|
| Comment by Jinshan Xiong (Inactive) [ 29/Jul/15 ] |
|
Hi Alex, If you're referring to commit 6102d037 for issue #3349, it has already been included in my code base for testing. Don, jxiong@titan:~/work/zfs/zfs$ sudo zdb -C zfstest
MOS Configuration:
version: 5000
name: 'zfstest'
state: 0
txg: 101
pool_guid: 4724987114355670347
errata: 0
hostname: 'titan'
vdev_children: 1
vdev_tree:
type: 'root'
id: 0
guid: 4724987114355670347
children[0]:
type: 'disk'
id: 0
guid: 4902936745656769906
path: '/dev/nvme0n1p1'
whole_disk: 0
metaslab_array: 34
metaslab_shift: 29
ashift: 9
asize: 99993780224
is_log: 0
create_txg: 4
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
|
| Comment by nasf (Inactive) [ 29/Jan/16 ] |
|
We hit similar trouble on lola as described in |