[LU-6649] obdfilter-survey test_1a: lctl in D state Created: 26/May/15 Updated: 25/Mar/19 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | zfs | ||
| Environment: |
lustre-master build #3029 |
||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||
| Description |
|
This issue was created by maloo for sarah_lw <wei3.liu@intel.com> This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/71df9008-fe72-11e4-a865-5254006e85c2. The sub-test test_1a failed with the following error: test failed to respond and timed out similar as 12:57:43:Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey == 12:00:21 (1432036821) 12:57:43:Lustre: DEBUG MARKER: lctl dl | grep obdfilter 12:57:43:Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@' 12:57:43:INFO: task lctl:13285 blocked for more than 120 seconds. 12:57:43: Tainted: P --------------- 2.6.32-504.16.2.el6_lustre.x86_64 #1 12:57:43:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 12:57:43:lctl D 0000000000000000 0 13285 13277 0x00000080 12:57:43: ffff880070fe3768 0000000000000086 0000000000000000 ffffffff81064a2e 12:57:43: ffff8800532a8b10 ffffffff00000000 000014516be02fbb 0000000000000001 12:57:43: ffff880070fe3738 0000000101504d82 ffff8800541bc5f8 ffff880070fe3fd8 12:57:43:Call Trace: 12:57:43: [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0 12:57:43: [<ffffffff8109edfe>] ? prepare_to_wait_exclusive+0x4e/0x80 12:57:43: [<ffffffffa019e78d>] cv_wait_common+0x11d/0x130 [spl] 12:57:43: [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40 12:57:43: [<ffffffffa019e7f5>] __cv_wait+0x15/0x20 [spl] 12:57:43: [<ffffffffa02556db>] txg_wait_open+0x8b/0xd0 [zfs] 12:57:43: [<ffffffffa0213f27>] dmu_tx_wait+0x3f7/0x400 [zfs] 12:57:43: [<ffffffffa02285da>] ? dsl_dir_tempreserve_space+0xca/0x190 [zfs] 12:57:43: [<ffffffffa0214121>] dmu_tx_assign+0xa1/0x570 [zfs] 12:57:43: [<ffffffffa1c51b3d>] osd_trans_start+0xed/0x430 [osd_zfs] 12:57:43: [<ffffffffa1af3f0c>] ofd_trans_start+0x7c/0x100 [ofd] 12:57:43: [<ffffffffa1afb7a3>] ofd_commitrw_write+0x543/0x1050 [ofd] 12:57:43: [<ffffffffa1afc862>] ofd_commitrw+0x5b2/0xb00 [ofd] 12:57:43: [<ffffffffa177211f>] echo_client_brw_ioctl+0xccf/0x1430 [obdecho] 12:57:43: [<ffffffffa177472b>] echo_client_iocontrol+0x64b/0x29e0 [obdecho] 12:57:43: [<ffffffff810b2a3d>] ? get_futex_key+0x18d/0x2d0 12:57:43: [<ffffffff81174f6c>] ? __kmalloc+0x21c/0x230 12:57:43: [<ffffffffa119ef91>] ? obd_ioctl_getdata+0xe1/0x1140 [obdclass] 12:57:43: [<ffffffffa11b703c>] class_handle_ioctl+0x163c/0x21c0 [obdclass] 12:57:43: [<ffffffff810b4d60>] ? do_futex+0x100/0xae0 12:57:43: [<ffffffffa119e2ab>] obd_class_ioctl+0x4b/0x190 [obdclass] 12:57:43: [<ffffffff811a3ed2>] vfs_ioctl+0x22/0xa0 12:57:43: [<ffffffff811a4074>] do_vfs_ioctl+0x84/0x580 12:57:43: [<ffffffff810b57bb>] ? sys_futex+0x7b/0x170 12:57:43: [<ffffffff811a45f1>] sys_ioctl+0x81/0xa0 12:57:43: [<ffffffff810e5f9e>] ? __audit_syscall_exit+0x25e/0x290 12:57:43: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 12:57:43:INFO: task lctl:13286 blocked for more than 120 seconds. 12:57:43: Tainted: P --------------- 2.6.32-504.16.2.el6_lustre.x86_64 #1 12:57:43:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 12:57:43:lctl D 0000000000000001 0 13286 13277 0x00000080 12:57:43: ffff8800477a5768 0000000000000086 0000000000000000 ffffffff81064a2e 12:57:43: ffff8800532a8b10 ffffffff00000000 0000146709f18046 0000000000000001 12:57:43: ffff8800477a5738 000000010151b82d ffff88006bee1ad8 ffff8800477a5fd8 |
| Comments |
| Comment by Saurabh Tandan (Inactive) [ 18/Dec/15 ] |
|
Another instance for EL6.7 Server/EL6.7 Client - ZFS |
| Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ] |
|
Another instance found for interop : 2.5.5 Server/EL6.7 Client |
| Comment by Saurabh Tandan (Inactive) [ 04/Feb/16 ] |
|
Another instance for FULL - EL6.7 Server/EL6.7 Client - ZFS , master, build# 3314. Another instance on master for FULL - EL7.1 Server/EL7.1 Client - ZFS, build# 3314 |
| Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ] |
|
Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - ZFS, build# 3314 Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - ZFS, build# 3314 Another instance found for Full tag 2.7.66 -EL6.7 Server/SLES11 SP3 Client, build# 3316 |
| Comment by Niu Yawei (Inactive) [ 25/Oct/16 ] |
|
Hit on master: https://testing.hpdd.intel.com/test_sets/b809a044-99cd-11e6-a018-5254006e85c2 It failed on test_1c this time. |
| Comment by Niu Yawei (Inactive) [ 25/Oct/16 ] |
|
I think the root cause should be same to |
| Comment by James Casper [ 24/May/17 ] |
|
2.9.57, b3575: |
| Comment by Sarah Liu [ 07/Jun/17 ] |
|
I suspect the error found on master is the same as |
| Comment by Sarah Liu [ 20/May/18 ] |
|
+1 on b2_10 https://testing.hpdd.intel.com/test_sets/bea30518-5c17-11e8-b303-52540065bddc |
| Comment by James Nunez (Inactive) [ 14/Aug/18 ] |
|
An updated stack trace for 2.10.5 RC1 at https://testing.whamcloud.com/test_sets/0c4797ee-9bb9-11e8-8ee3-52540065bddc. The OSS console has [35623.218415] Lustre: Echo OBD driver; http://www.lustre.org/ [37200.342923] INFO: task lctl:28554 blocked for more than 120 seconds. [37200.343656] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [37200.344422] lctl D ffff8f117c520000 0 28554 28547 0x00000080 [37200.345237] Call Trace: [37200.345574] [<ffffffffb8314029>] schedule+0x29/0x70 [37200.346111] [<ffffffffb8311999>] schedule_timeout+0x239/0x2c0 [37200.346763] [<ffffffffb7c6814e>] ? kvm_clock_get_cycles+0x1e/0x20 [37200.347409] [<ffffffffb7cf7ed2>] ? ktime_get_ts64+0x52/0xf0 [37200.347992] [<ffffffffb831353d>] io_schedule_timeout+0xad/0x130 [37200.348625] [<ffffffffb7cbc1c6>] ? prepare_to_wait_exclusive+0x56/0x90 [37200.349268] [<ffffffffb83135d8>] io_schedule+0x18/0x20 [37200.350017] [<ffffffffc026b192>] cv_wait_common+0xb2/0x150 [spl] [37200.350591] [<ffffffffb7cbc610>] ? wake_up_atomic_t+0x30/0x30 [37200.351167] [<ffffffffc026b268>] __cv_wait_io+0x18/0x20 [spl] [37200.352006] [<ffffffffc042c023>] zio_wait+0x113/0x1c0 [zfs] [37200.352559] [<ffffffffc03771f4>] dmu_buf_hold_array_by_dnode+0x154/0x4a0 [zfs] [37200.353317] [<ffffffffc03775a9>] dmu_buf_hold_array_by_bonus+0x69/0x90 [zfs] [37200.354207] [<ffffffffc10144f2>] osd_bufs_get+0x412/0xc60 [osd_zfs] [37200.354857] [<ffffffffc11517fb>] ofd_preprw+0x6bb/0x1170 [ofd] [37200.355505] [<ffffffffb7d9934e>] ? __get_free_pages+0xe/0x40 [37200.356074] [<ffffffffb7df4f9e>] ? kmalloc_order_trace+0x2e/0xa0 [37200.356764] [<ffffffffb7df8b41>] ? __kmalloc+0x211/0x230 [37200.357300] [<ffffffffc122217a>] echo_client_prep_commit.isra.49+0x33a/0xc30 [obdecho] [37200.358088] [<ffffffffc1229ebf>] echo_client_iocontrol+0x95f/0x1be0 [obdecho] [37200.359298] [<ffffffffc0b7f7b9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [37200.360060] [<ffffffffc0b6a619>] class_handle_ioctl+0x1939/0x1dd0 [obdclass] [37200.360728] [<ffffffffb7dc7c3d>] ? handle_mm_fault+0x39d/0x9b0 [37200.361369] [<ffffffffb7ed0b1e>] ? security_capable+0x1e/0x20 [37200.361938] [<ffffffffc0b4f5d2>] obd_class_ioctl+0xd2/0x170 [obdclass] [37200.362631] [<ffffffffb7e30350>] do_vfs_ioctl+0x350/0x560 [37200.363176] [<ffffffffb831b56c>] ? __do_page_fault+0x1bc/0x4f0 [37200.363843] [<ffffffffb7e30601>] SyS_ioctl+0xa1/0xc0 [37200.364326] [<ffffffffb83206d5>] ? system_call_after_swapgs+0xa2/0x146 [37200.364949] [<ffffffffb8320795>] system_call_fastpath+0x1c/0x21 [37200.365593] [<ffffffffb83206e1>] ? system_call_after_swapgs+0xae/0x146 [37200.366232] INFO: task lctl:28556 blocked for more than 120 seconds. [37200.366905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |