[LU-4950] sanity-benchmark test fsx hung: txg_sync was stuck on OSS Created: 24/Apr/14 Updated: 16/Apr/20 Resolved: 16/Apr/20 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0, Lustre 2.5.1, Lustre 2.7.0, Lustre 2.4.3, Lustre 2.5.3, Lustre 2.8.0, Lustre 2.10.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jian Yu | Assignee: | Alex Zhuravlev |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | zfs | ||
| Environment: |
Lustre build: http://build.whamcloud.com/job/lustre-b2_5/47/ |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Rank (Obsolete): | 13698 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
While running sanity-benchmark test fsx with the following configuration, it timed out in 3600s. MDSCOUNT=1 MDSSIZE=2097152 OSTCOUNT=2 OSTSIZE=8388608 FSTYPE=zfs SLOW=yes txg_sync was stuck on OSS node: txg_sync D 0000000000000001 0 2665 2 0x00000080 ffff88006f4cbba0 0000000000000046 00000000ffffffff 000037007418a66d ffff88006f4cbb10 ffff88007cba1a60 0000000000bba084 ffffffffaca9e91e ffff880070bc05f8 ffff88006f4cbfd8 000000000000fbc8 ffff880070bc05f8 Call Trace: [<ffffffff810a7091>] ? ktime_get_ts+0xb1/0xf0 [<ffffffff81528823>] io_schedule+0x73/0xc0 [<ffffffffa0142e7c>] cv_wait_common+0x8c/0x100 [spl] [<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40 [<ffffffffa0142f08>] __cv_wait_io+0x18/0x20 [spl] [<ffffffffa02864ab>] zio_wait+0xfb/0x1b0 [zfs] [<ffffffffa021e5a5>] dsl_pool_sync+0x2f5/0x540 [zfs] [<ffffffffa023725e>] spa_sync+0x40e/0xa00 [zfs] [<ffffffff8103ea79>] ? kvm_clock_get_cycles+0x9/0x10 [<ffffffffa024253a>] txg_sync_thread+0x27a/0x4b0 [zfs] [<ffffffff81059329>] ? set_user_nice+0xc9/0x130 [<ffffffffa02422c0>] ? txg_sync_thread+0x0/0x4b0 [zfs] [<ffffffffa013e9bf>] thread_generic_wrapper+0x5f/0x70 [spl] [<ffffffffa013e960>] ? thread_generic_wrapper+0x0/0x70 [spl] [<ffffffff8109aee6>] kthread+0x96/0xa0 [<ffffffff8100c20a>] child_rip+0xa/0x20 [<ffffffff8109ae50>] ? kthread+0x0/0xa0 [<ffffffff8100c200>] ? child_rip+0x0/0x20 Maloo report: https://maloo.whamcloud.com/test_sets/7ec85f26-cb6e-11e3-95c9-52540035b04c |
| Comments |
| Comment by Jian Yu [ 24/Apr/14 ] |
|
Lustre build: http://build.whamcloud.com/job/lustre-b2_4/74/ sanity-benchmark test iozone also hit the same issue: |
| Comment by Jian Yu [ 24/Apr/14 ] |
|
Hello Alex, Could you please take a look whether this is a duplicate of |
| Comment by Isaac Huang (Inactive) [ 25/Apr/14 ] |
|
It looks more likely a duplicate of What was the storage devices for the OST storage pools? At least we should make sure not to use certain IO schedulers known to cause ZFS IO starvation for those devices. |
| Comment by Jian Yu [ 27/Apr/14 ] |
|
Hi Isaac, The tests were performed on virtual machines by autotest. The storage devices were LVs on a /dev/vda partition. And the I/O scheduler was: # cat /sys/block/vda/queue/scheduler noop anticipatory [deadline] cfq FYI, the same failure was reported in |
| Comment by Alex Zhuravlev [ 28/Apr/14 ] |
|
Jian, no, this is definitely not 2476/2600. I hit this locally few times as well - basically everything got stuck. we discussed this a bit with Brian B., but the root cause isn't clear yet. |
| Comment by Isaac Huang (Inactive) [ 12/May/14 ] |
|
Hi Jian, I'd suggest to set IO scheduler of /dev/vda to noop. In fact, ZFS automatically sets IO schedulers of storage pool devices to noop, as other schedulers could cause ZFS IO starvation. But since the pool uses LVs, ZFS can't see /dev/vda at all. |
| Comment by Jian Yu [ 05/Jun/14 ] |
|
Hi Isaac, I tried noop before, please refer to: https://jira.hpdd.intel.com/browse/LU-2887?focusedCommentId=67639&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-67639 |
| Comment by Nathaniel Clark [ 05/Jun/14 ] |
|
This seems to be similar to the txg_sync issue described in |
| Comment by Jian Yu [ 05/Jun/14 ] |
|
In fact, in |
| Comment by Nathaniel Clark [ 06/Jun/14 ] |
|
This may actually be a ZFS issue and should be retested with zfs 0.6.3 when it arrives. |
| Comment by Andreas Dilger [ 18/Jun/14 ] |
|
There was also a failure of this test with ldiskfs in |
| Comment by Sarah Liu [ 21/Jul/14 ] |
|
Hit several timeout failures in lustre-b2_6-rc2 zfs testing, cannot find error messages https://testing.hpdd.intel.com/test_sets/254eb710-0ecb-11e4-9f57-5254006e85c2 |
| Comment by Peter Jones [ 29/Jul/14 ] |
|
That rules out this being fixed in zfs 0.6.3 |
| Comment by Jian Yu [ 24/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/84/ Patch http://review.whamcloud.com/11298 was already in Lustre b2_5 build #84. However, the failure still occurred: |
| Comment by Jian Yu [ 31/Aug/14 ] |
|
Lustre Build: https://build.hpdd.intel.com/job/lustre-b2_5/86/ (2.5.3 RC1) The same failure occurred: |
| Comment by Isaac Huang (Inactive) [ 03/Sep/14 ] |
|
I assume the latest reports were using ZFS 0.6.3 already. Can you elaborate on how the OSS storage are configured:
It appeared that the OSS had 2G memory. Any ZFS tunings? Can you include outputs of 'cat /sys/modules/zfs/parameters/*' i.e. run-time values of the tunings? The write throttle and IO scheduling have been completely rewritten in 0.6.3. The defaults may not be proper for a VM test system. So first I need to understand exactly how the OSS is configured. |
| Comment by Jian Yu [ 03/Sep/14 ] |
|
Hi Isaac, I just added Minh and Joshua to watch this ticket. The OSS and OSTs were configured by autotest system. To answer the above questions, I need look into the test nodes where a ZFS test session is being tested. If you would like, I can trigger a ZFS test session and tell you the test nodes, so that you can log into the nodes and investigate the parameters directly. |
| Comment by Isaac Huang (Inactive) [ 04/Sep/14 ] |
|
I checked stack dumps from the most recent three failures reported by Jian, and found the OSS z_ioctl_iss thread were all stuck in a same place: z_ioctl_iss/0 S 0000000000000001 0 3996 2 0x00000080 Likely it's really stuck in vdev_disk_io_flush() which is probably inlined. It seemed like the syncing txg is about to complete and a WRITE_FLUSH_FUA is to be issued after the uberblocks have been updated. There's also other threads waiting for the txg to be synced. The only place it could get stuck in vdev_disk_io_flush() seemed to be in bio_alloc(), which indicated a deadlock under memory pressure. There's two things to help troubleshoot: |
| Comment by Sarah Liu [ 12/Sep/14 ] |
|
seen this in hard failover test for master branch zfs |
| Comment by Jian Yu [ 12/Sep/14 ] |
I pushed a for-test-only patch on Lustre b2_5 branch to run the previously timed out tests on ZFS with "options zfs zfs_nocacheflush=1": http://review.whamcloud.com/11906 If the deadlock still occurred, I'll ask TEI team for help to implement 2. |
| Comment by Isaac Huang (Inactive) [ 17/Sep/14 ] |
|
My previous analysis might be inaccurate as I neglected the '?' in the stack entry: [<ffffffffa024aa46>] ? vdev_disk_io_start+0x156/0x1b0 [zfs] Which meant that the vdev_disk_io_start() function likely returned and the WRITE_FLUSH_FUA bio was already submitted - the z_ioctl_iss thread was waiting for more work to do. I checked the OSS stack dumps, and the z_* threads were all idle, so there was no more IO left to do. These led me to believe that: Suggestions to troubleshoot stay the same: on the OSS, set zfs_nocacheflush, and gather "zpool events -v" outputs. |
| Comment by Jian Yu [ 25/Sep/14 ] |
|
I set up a test cluster on onyx-23vm[5-8] with the following configuration: NAME=ncli RCLIENTS=onyx-23vm6 mds_HOST=onyx-23vm7 MDSCOUNT=1 MDSSIZE=3145728 MDSDEV1=/dev/vda5 ost_HOST=onyx-23vm8 OSTCOUNT=2 OSTSIZE=8388608 OSTDEV1=/dev/vda5 OSTDEV2=/dev/vda6 SLOW=yes ENABLE_QUOTA=yes FSTYPE=zfs # pdsh -l root -S -w onyx-23vm[5-8] "export PATH=\$PATH:/sbin:/usr/sbin; cat /sys/block/vda/queue/scheduler" onyx-23vm6: [noop] anticipatory deadline cfq onyx-23vm5: [noop] anticipatory deadline cfq onyx-23vm7: [noop] anticipatory deadline cfq onyx-23vm8: [noop] anticipatory deadline cfq Lustre build: https://build.hpdd.intel.com/job/lustre-b2_5/88/ I'm re-running the tests with "deadline" scheduler, which is the default value after the test nodes are provisioned. |
| Comment by Jian Yu [ 25/Sep/14 ] |
sanity-benchmark passed in 15688s. However, performance-sanity test 8 hung. The output of "zpool events -v" on OSS is attached. And here is the test report: https://testing.hpdd.intel.com/test_sets/810760ae-45c4-11e4-aeca-5254006e85c2 Hi Isaac, if you can access onyx cluster, please feel free to use onyx-23vm[5-8]. The testing environment is remained. |
| Comment by Isaac Huang (Inactive) [ 26/Sep/14 ] |
|
Looks like the combination of LVM elimination and noop scheduler made a difference. Can you please repeat the same test again (i.e. noop, and no LVM)? Considering the low pass rate, we'd be able to isolate LVM+deadline as the cause of IO timeout if it passes again. |
| Comment by Jian Yu [ 27/Sep/14 ] |
Sure. Testing was started. |
| Comment by Jian Yu [ 29/Sep/14 ] |
|
Hi Isaac, I ran the same tests on the same test nodes with the same configuration as that of the noop scheduler in the previous comments for two times. Both hung at performance-sanity. Here is the test report of the latest run: https://testing.hpdd.intel.com/test_sessions/b9ebf592-4803-11e4-a4e8-5254006e85c2 So, noop scheduler does not resolve the issue. |
| Comment by Isaac Huang (Inactive) [ 02/Oct/14 ] |
|
The test failure above appeared to be a different issue Lustre: DEBUG MARKER: ===== mdsrate-lookup-10dirs.sh Test preparation: creating 10 dirs with 7840 files.
INFO: task mkdir:23576 blocked for more than 120 seconds.
Not tainted 2.6.32-431.29.2.el6.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mkdir D 0000000000000000 0 23576 23568 0x00000080
ffff88003ca05db8 0000000000000086 ffff880067cb2040 ffff880067cb2040
ffff880067cb2040 ffff880067cb2040 000000103ca05e38 ffff88005a5f9000
ffff880067cb25f8 ffff88003ca05fd8 000000000000fbc8 ffff880067cb25f8
Call Trace:
[<ffffffff811ab390>] ? mntput_no_expire+0x30/0x110
[<ffffffff8152a5be>] __mutex_lock_slowpath+0x13e/0x180
[<ffffffff8152a45b>] mutex_lock+0x2b/0x50
[<ffffffff81196c00>] lookup_create+0x30/0xd0
[<ffffffff8119ae63>] sys_mkdirat+0x83/0x1b0
[<ffffffff811ab390>] ? mntput_no_expire+0x30/0x110
[<ffffffff810e1e07>] ? audit_syscall_entry+0x1d7/0x200
[<ffffffff8119afa8>] sys_mkdir+0x18/0x20
So it still looks positive that the combination of LVM elimination and noop scheduler made a difference. |
| Comment by Nathaniel Clark [ 20/Oct/14 ] |
|
sanity-hsm/24a on review-zfs on master: |
| Comment by Isaac Huang (Inactive) [ 30/Oct/14 ] |
|
Likely it can be fixed by this ZFS patch: I'll test it soon. |
| Comment by Andreas Dilger [ 12/Nov/14 ] |
|
I know that we prefer to use tagged upstream ZFS releases, but if this allows us to pass tests I'd prefer that it be landed into our repository to use it. In that case, it is worthwhile to post in the upstream GitHub ticket that this is needed for Lustre to pass and hopefully Brian will land it more quickly. |
| Comment by Gerrit Updater [ 21/Nov/14 ] |
|
Nathaniel Clark (nathaniel.l.clark@intel.com) uploaded a new patch: http://review.whamcloud.com/12810 |
| Comment by Nathaniel Clark [ 21/Nov/14 ] |
|
Build ZFS with this change: |
| Comment by Andreas Dilger [ 15/Dec/14 ] |
|
Isaac, can you look at this issue again? It seems the upstream Illumos patch that you were looking at didn't fix this problem in ZFS. We are still trying to land that patch for fixing |
| Comment by Isaac Huang (Inactive) [ 09/Jan/15 ] |
|
Nathaniel is working on updating our ZFS/SPL version to 0.6.3-1.2 in It looks positive that the SPL fix would fix many of the ZIO timeouts we've been seeing, including this one and the |
| Comment by Peter Jones [ 11/Mar/15 ] |
|
|
| Comment by Jian Yu [ 11/Mar/15 ] |
|
Patch http://review.whamcloud.com/13322 for Console log on OSS: 10:51:22:txg_sync D 0000000000000001 0 2753 2 0x00000080 10:51:22: ffff8800652c9b90 0000000000000046 00000000ffffffff 00001708d56216b3 10:51:22: 0000000000000000 ffff88006c6eeb40 000000000038fc98 ffffffffab01512e 10:51:22: ffff880063d04638 ffff8800652c9fd8 000000000000fbc8 ffff880063d04638 10:51:22:Call Trace: 10:51:22: [<ffffffff810a6d31>] ? ktime_get_ts+0xb1/0xf0 10:51:22: [<ffffffff81529e83>] io_schedule+0x73/0xc0 10:51:22: [<ffffffffa0143596>] cv_wait_common+0xa6/0x120 [spl] 10:51:22: [<ffffffff8109afa0>] ? autoremove_wake_function+0x0/0x40 10:51:22: [<ffffffffa0143628>] __cv_wait_io+0x18/0x20 [spl] 10:51:22: [<ffffffffa028d81b>] zio_wait+0xfb/0x1c0 [zfs] 10:51:22: [<ffffffffa0220993>] dsl_pool_sync+0xb3/0x3f0 [zfs] 10:51:22: [<ffffffffa02398bf>] spa_sync+0x40f/0xa70 [zfs] 10:51:22: [<ffffffffa0243771>] ? spa_txg_history_set+0xc1/0xf0 [zfs] 10:51:22: [<ffffffffa0246c7d>] txg_sync_thread+0x30d/0x520 [zfs] 10:51:22: [<ffffffff810591a9>] ? set_user_nice+0xc9/0x130 10:51:22: [<ffffffffa0246970>] ? txg_sync_thread+0x0/0x520 [zfs] 10:51:22: [<ffffffffa013ec3f>] thread_generic_wrapper+0x5f/0x70 [spl] 10:51:22: [<ffffffffa013ebe0>] ? thread_generic_wrapper+0x0/0x70 [spl] 10:51:22: [<ffffffff8109abf6>] kthread+0x96/0xa0 10:51:22: [<ffffffff8100c20a>] child_rip+0xa/0x20 10:51:22: [<ffffffff8109ab60>] ? kthread+0x0/0xa0 10:51:22: [<ffffffff8100c200>] ? child_rip+0x0/0x20 |
| Comment by Sarah Liu [ 20/Jul/15 ] |
|
server and client:lustre-master build # 3094 RHEL6.6 zfs https://testing.hpdd.intel.com/test_sets/a8e1a402-2865-11e5-8329-5254006e85c2 |
| Comment by Saurabh Tandan (Inactive) [ 21/Dec/15 ] |
|
Another instance for EL6.7 Server/EL6.7 Client - ZFS txg_sync D 0000000000000000 0 23282 2 0x00000080 ffff880039e73b60 0000000000000046 0000000000000000 0000a3cd3e7c281e ffff88007cd68ae8 ffff880037d9e600 00003b391acd9c96 ffffffffa9941bdc 0000000006f6b1a7 0000000103dd04a2 ffff88007cd69068 ffff880039e73fd8 Call Trace: [<ffffffff815395d3>] io_schedule+0x73/0xc0 [<ffffffffa01a6eaf>] cv_wait_common+0xaf/0x130 [spl] [<ffffffff810a1460>] ? autoremove_wake_function+0x0/0x40 [<ffffffff8106ee1a>] ? __cond_resched+0x2a/0x40 [<ffffffffa01a6f48>] __cv_wait_io+0x18/0x20 [spl] [<ffffffffa02b129b>] zio_wait+0x10b/0x1e0 [zfs] [<ffffffffa0238b1b>] dsl_pool_sync+0x2bb/0x440 [zfs] [<ffffffffa024fb3e>] spa_sync+0x35e/0xb10 [zfs] [<ffffffffa02661f8>] txg_sync_thread+0x3d8/0x670 [zfs] [<ffffffffa0265e20>] ? txg_sync_thread+0x0/0x670 [zfs] [<ffffffffa0265e20>] ? txg_sync_thread+0x0/0x670 [zfs] [<ffffffffa01a2258>] thread_generic_wrapper+0x68/0x80 [spl] [<ffffffffa01a21f0>] ? thread_generic_wrapper+0x0/0x80 [spl] [<ffffffff810a0fce>] kthread+0x9e/0xc0 [<ffffffff8100c28a>] child_rip+0xa/0x20 [<ffffffff810a0f30>] ? kthread+0x0/0xc0 [<ffffffff8100c280>] ? child_rip+0x0/0x20 |
| Comment by Sarah Liu [ 20/Jan/16 ] |
|
another instance on master build#3305 RHEL6.7 |
| Comment by Saurabh Tandan (Inactive) [ 05/Feb/16 ] |
|
Another instance on master for FULL - EL7.1 Server/EL7.1 Client - ZFS, build# 3314 |
| Comment by Saurabh Tandan (Inactive) [ 09/Feb/16 ] |
|
Another instance found for hardfailover : EL6.7 Server/Client - ZFS, tag 2.7.66, master build 3314 Another instance found for hardfailover : EL7 Server/Client - ZFS, tag 2.7.66, master build 3314 Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - ZFS, build# 3314 Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - ZFS, build# 3314 |
| Comment by Alex Zhuravlev [ 10/Feb/16 ] |
|
it would be interesting to try with http://review.whamcloud.com/#/c/18341/ |
| Comment by Saurabh Tandan (Inactive) [ 24/Feb/16 ] |
|
Another instance found on b2_8 for failover testing , build# 6. |
| Comment by Alex Zhuravlev [ 31/May/16 ] |
|
dup of |
| Comment by Sarah Liu [ 10/Mar/17 ] |
|
In master branch, tag-2.9.53 el7 zfs is still experiencing the same error, reopen it and see if anything else is needed oss console 07:34:00:[ 3545.939658] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 07:23:02 (1487748182) 07:34:00:[ 3546.225025] Lustre: DEBUG MARKER: lctl dl | grep obdfilter 07:34:00:[ 3546.764980] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@' 07:34:00:[ 3547.846310] Lustre: Echo OBD driver; http://www.lustre.org/ 07:34:00:[ 4200.108290] INFO: task txg_sync:22316 blocked for more than 120 seconds. 07:34:00:[ 4200.110327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 07:34:00: 07:34:00:[ 4200.112402] txg_sync D 0000000000000001 0 22316 2 0x00000080 07:34:00:[ 4200.116024] ffff88003b79fac0 0000000000000046 ffff88003a438fb0 ffff88003b79ffd8 07:34:00:[ 4200.118769] ffff88003b79ffd8 ffff88003b79ffd8 ffff88003a438fb0 ffff88007fd16c40 07:34:00:[ 4200.120870] 0000000000000000 7fffffffffffffff ffff88003bd09908 0000000000000001 07:34:00:[ 4200.123100] Call Trace: 07:34:00:[ 4200.124786] [<ffffffff8168bac9>] schedule+0x29/0x70 07:34:00:[ 4200.126894] [<ffffffff81689519>] schedule_timeout+0x239/0x2d0 07:34:00:[ 4200.128916] [<ffffffff810c4fe2>] ? default_wake_function+0x12/0x20 07:34:00:[ 4200.130962] [<ffffffff810ba238>] ? __wake_up_common+0x58/0x90 07:34:00:[ 4200.133032] [<ffffffff81060c1f>] ? kvm_clock_get_cycles+0x1f/0x30 07:34:00:[ 4200.135083] [<ffffffff8168b06e>] io_schedule_timeout+0xae/0x130 07:34:00:[ 4200.137129] [<ffffffff810b1416>] ? prepare_to_wait_exclusive+0x56/0x90 07:34:00:[ 4200.139259] [<ffffffff8168b108>] io_schedule+0x18/0x20 07:34:00:[ 4200.141213] [<ffffffffa0679617>] cv_wait_common+0xa7/0x130 [spl] 07:34:00:[ 4200.143303] [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 07:34:00:[ 4200.145316] [<ffffffffa06796f8>] __cv_wait_io+0x18/0x20 [spl] 07:34:00:[ 4200.147464] [<ffffffffa07d351b>] zio_wait+0x10b/0x1f0 [zfs] 07:34:00:[ 4200.149508] [<ffffffffa075ccdf>] dsl_pool_sync+0xbf/0x440 [zfs] 07:34:00:[ 4200.151553] [<ffffffffa0777868>] spa_sync+0x388/0xb50 [zfs] 07:34:00:[ 4200.153602] [<ffffffff810b174b>] ? autoremove_wake_function+0x2b/0x40 07:34:00:[ 4200.155664] [<ffffffff81689c72>] ? mutex_lock+0x12/0x2f 07:34:00:[ 4200.157616] [<ffffffffa07894e5>] txg_sync_thread+0x3c5/0x620 [zfs] 07:34:00:[ 4200.159653] [<ffffffffa0789120>] ? txg_init+0x280/0x280 [zfs] 07:34:00:[ 4200.161605] [<ffffffffa0674851>] thread_generic_wrapper+0x71/0x80 [spl] 07:34:00:[ 4200.163625] [<ffffffffa06747e0>] ? __thread_exit+0x20/0x20 [spl] 07:34:00:[ 4200.165586] [<ffffffff810b064f>] kthread+0xcf/0xe0 07:34:00:[ 4200.167428] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 07:34:00:[ 4200.169377] [<ffffffff81696958>] ret_from_fork+0x58/0x90 07:34:00:[ 4200.171244] [<ffffffff810b0580>] ? kthread_create_on_node+0x140/0x140 07:34:00:[ 4200.173334] INFO: task lctl:3231 blocked for more than 120 seconds. 07:34:00:[ 4200.175136] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 07:34:00:[ 4200.177040] lctl D 0000000000000000 0 3231 3197 0x00000080 07:34:00:[ 4200.178923] ffff88003d3db948 0000000000000082 ffff88001d2f9f60 ffff88003d3dbfd8 07:34:00:[ 4200.181132] ffff88003d3dbfd8 ffff88003d3dbfd8 ffff88001d2f9f60 ffff88003b9c5b50 07:34:00:[ 4200.183052] ffff88003b9c5a10 ffff88003b9c5b58 ffff88003b9c5a38 0000000000000000 07:34:00:[ 4200.185136] Call Trace: 07:34:00:[ 4200.186747] [<ffffffff8168bac9>] schedule+0x29/0x70 07:34:00:[ 4200.188372] [<ffffffffa067967d>] cv_wait_common+0x10d/0x130 [spl] 07:34:00:[ 4200.190210] [<ffffffff810b1720>] ? wake_up_atomic_t+0x30/0x30 07:34:00:[ 4200.191823] [<ffffffffa06796b5>] __cv_wait+0x15/0x20 [spl] 07:34:00:[ 4200.193657] [<ffffffffa07886a3>] txg_wait_open+0xb3/0xf0 [zfs] 07:34:00:[ 4200.195240] [<ffffffffa0743b0d>] dmu_tx_wait+0x34d/0x360 [zfs] 07:34:00:[ 4200.196968] [<ffffffffa0743bb1>] dmu_tx_assign+0x91/0x4b0 [zfs] 07:34:00:[ 4200.198521] [<ffffffffa0f8c05a>] osd_trans_start+0xaa/0x3c0 [osd_zfs] 07:34:00:[ 4200.200290] [<ffffffffa108e2db>] ofd_trans_start+0x6b/0xe0 [ofd] 07:34:00:[ 4200.202029] [<ffffffffa109460b>] ofd_commitrw_write+0x94b/0x1c30 [ofd] 07:34:00:[ 4200.203800] [<ffffffffa109850d>] ofd_commitrw+0x56d/0xa30 [ofd] 07:34:00:[ 4200.205564] [<ffffffffa0b30679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] 07:34:00:[ 4200.207167] [<ffffffffa11895ea>] echo_client_prep_commit.isra.49+0x69a/0xc30 [obdecho] 07:34:00:[ 4200.209038] [<ffffffffa11930cf>] echo_client_iocontrol+0x9cf/0x1c50 [obdecho] 07:34:00:[ 4200.210635] [<ffffffffa0b30679>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] 07:34:00:[ 4200.212403] [<ffffffffa0b1bb1e>] class_handle_ioctl+0x19de/0x2150 [obdclass] 07:34:00:[ 4200.213966] [<ffffffff812a8558>] ? security_capable+0x18/0x20 07:34:00:[ 4200.215613] [<ffffffffa0b005d2>] obd_class_ioctl+0xd2/0x170 [obdclass] 07:34:00:[ 4200.217176] [<ffffffff81212025>] do_vfs_ioctl+0x2d5/0x4b0 07:34:00:[ 4200.218778] [<ffffffff81691ef1>] ? __do_page_fault+0x171/0x450 07:34:00:[ 4200.220285] [<ffffffffa0b1b258>] ? class_handle_ioctl+0x1118/0x2150 [obdclass] 07:34:00:[ 4200.222049] [<ffffffff812122a1>] SyS_ioctl+0xa1/0xc0 07:34:00:[ 4200.223471] [<ffffffff81696a09>] system_call_fastpath+0x16/0x1b 07:34:00:[ 4200.225197] INFO: task lctl:3232 blocked for more than 120 seconds. |
| Comment by Alex Zhuravlev [ 13/Mar/17 ] |
|
Sarah, I can't find ofd_destroy() in the traces, I doubt this is |
| Comment by Alex Zhuravlev [ 16/Apr/20 ] |
|
a duplicate of |