[LU-6649] obdfilter-survey test_1a: lctl in D state Created: 26/May/15  Updated: 25/Mar/19

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0, Lustre 2.10.0, Lustre 2.11.0, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.10.7
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Maloo Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: zfs
Environment:

lustre-master build #3029


Issue Links:
Related
is related to LU-10872 obdfilter-survey test 1a hangs in lctl Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

This issue was created by maloo for sarah_lw <wei3.liu@intel.com>

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/71df9008-fe72-11e4-a865-5254006e85c2.

The sub-test test_1a failed with the following error:

test failed to respond and timed out

similar as LU-5775
OST console:

12:57:43:Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey == 12:00:21 (1432036821)
12:57:43:Lustre: DEBUG MARKER: lctl dl | grep obdfilter
12:57:43:Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep tcp | cut -f 1 -d '@'
12:57:43:INFO: task lctl:13285 blocked for more than 120 seconds.
12:57:43:      Tainted: P           ---------------    2.6.32-504.16.2.el6_lustre.x86_64 #1
12:57:43:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
12:57:43:lctl          D 0000000000000000     0 13285  13277 0x00000080
12:57:43: ffff880070fe3768 0000000000000086 0000000000000000 ffffffff81064a2e
12:57:43: ffff8800532a8b10 ffffffff00000000 000014516be02fbb 0000000000000001
12:57:43: ffff880070fe3738 0000000101504d82 ffff8800541bc5f8 ffff880070fe3fd8
12:57:43:Call Trace:
12:57:43: [<ffffffff81064a2e>] ? try_to_wake_up+0x24e/0x3e0
12:57:43: [<ffffffff8109edfe>] ? prepare_to_wait_exclusive+0x4e/0x80
12:57:43: [<ffffffffa019e78d>] cv_wait_common+0x11d/0x130 [spl]
12:57:43: [<ffffffff8109ebb0>] ? autoremove_wake_function+0x0/0x40
12:57:43: [<ffffffffa019e7f5>] __cv_wait+0x15/0x20 [spl]
12:57:43: [<ffffffffa02556db>] txg_wait_open+0x8b/0xd0 [zfs]
12:57:43: [<ffffffffa0213f27>] dmu_tx_wait+0x3f7/0x400 [zfs]
12:57:43: [<ffffffffa02285da>] ? dsl_dir_tempreserve_space+0xca/0x190 [zfs]
12:57:43: [<ffffffffa0214121>] dmu_tx_assign+0xa1/0x570 [zfs]
12:57:43: [<ffffffffa1c51b3d>] osd_trans_start+0xed/0x430 [osd_zfs]
12:57:43: [<ffffffffa1af3f0c>] ofd_trans_start+0x7c/0x100 [ofd]
12:57:43: [<ffffffffa1afb7a3>] ofd_commitrw_write+0x543/0x1050 [ofd]
12:57:43: [<ffffffffa1afc862>] ofd_commitrw+0x5b2/0xb00 [ofd]
12:57:43: [<ffffffffa177211f>] echo_client_brw_ioctl+0xccf/0x1430 [obdecho]
12:57:43: [<ffffffffa177472b>] echo_client_iocontrol+0x64b/0x29e0 [obdecho]
12:57:43: [<ffffffff810b2a3d>] ? get_futex_key+0x18d/0x2d0
12:57:43: [<ffffffff81174f6c>] ? __kmalloc+0x21c/0x230
12:57:43: [<ffffffffa119ef91>] ? obd_ioctl_getdata+0xe1/0x1140 [obdclass]
12:57:43: [<ffffffffa11b703c>] class_handle_ioctl+0x163c/0x21c0 [obdclass]
12:57:43: [<ffffffff810b4d60>] ? do_futex+0x100/0xae0
12:57:43: [<ffffffffa119e2ab>] obd_class_ioctl+0x4b/0x190 [obdclass]
12:57:43: [<ffffffff811a3ed2>] vfs_ioctl+0x22/0xa0
12:57:43: [<ffffffff811a4074>] do_vfs_ioctl+0x84/0x580
12:57:43: [<ffffffff810b57bb>] ? sys_futex+0x7b/0x170
12:57:43: [<ffffffff811a45f1>] sys_ioctl+0x81/0xa0
12:57:43: [<ffffffff810e5f9e>] ? __audit_syscall_exit+0x25e/0x290
12:57:43: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
12:57:43:INFO: task lctl:13286 blocked for more than 120 seconds.
12:57:43:      Tainted: P           ---------------    2.6.32-504.16.2.el6_lustre.x86_64 #1
12:57:43:"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
12:57:43:lctl          D 0000000000000001     0 13286  13277 0x00000080
12:57:43: ffff8800477a5768 0000000000000086 0000000000000000 ffffffff81064a2e
12:57:43: ffff8800532a8b10 ffffffff00000000 0000146709f18046 0000000000000001
12:57:43: ffff8800477a5738 000000010151b82d ffff88006bee1ad8 ffff8800477a5fd8


 Comments   
Comment by Saurabh Tandan (Inactive) [ 18/Dec/15 ]

Another instance for EL6.7 Server/EL6.7 Client - ZFS
Master, build# 3270
https://testing.hpdd.intel.com/test_sets/a16f9ef6-a275-11e5-bdef-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 19/Jan/16 ]

Another instance found for interop : 2.5.5 Server/EL6.7 Client
Server: 2.5.5, b2_5_fe/62
Client: master, build# 3303, RHEL 6.7
https://testing.hpdd.intel.com/test_sets/1676bc94-bb25-11e5-861c-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 04/Feb/16 ]

Another instance for FULL - EL6.7 Server/EL6.7 Client - ZFS , master, build# 3314.
https://testing.hpdd.intel.com/test_sets/a6829740-cb47-11e5-a59a-5254006e85c2

Another instance on master for FULL - EL7.1 Server/EL7.1 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/e76d64e2-cb88-11e5-b49e-5254006e85c2

Comment by Saurabh Tandan (Inactive) [ 10/Feb/16 ]

Another instance found for Full tag 2.7.66 - EL6.7 Server/EL6.7 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/a6829740-cb47-11e5-a59a-5254006e85c2

Another instance found for Full tag 2.7.66 -EL7.1 Server/EL7.1 Client - ZFS, build# 3314
https://testing.hpdd.intel.com/test_sets/e76d64e2-cb88-11e5-b49e-5254006e85c2

Another instance found for Full tag 2.7.66 -EL6.7 Server/SLES11 SP3 Client, build# 3316
https://testing.hpdd.intel.com/test_sets/fd4a8d5a-cce9-11e5-8b0e-5254006e85c2

Comment by Niu Yawei (Inactive) [ 25/Oct/16 ]

Hit on master: https://testing.hpdd.intel.com/test_sets/b809a044-99cd-11e6-a018-5254006e85c2

It failed on test_1c this time.

Comment by Niu Yawei (Inactive) [ 25/Oct/16 ]

I think the root cause should be same to LU-5242.

Comment by James Casper [ 24/May/17 ]

2.9.57, b3575:
https://testing.hpdd.intel.com/test_sessions/edde2a3e-9ae8-434a-8170-b64e9e85529c

Comment by Sarah Liu [ 07/Jun/17 ]

I suspect the error found on master is the same as LU-9247

Comment by Sarah Liu [ 20/May/18 ]

+1 on b2_10 https://testing.hpdd.intel.com/test_sets/bea30518-5c17-11e8-b303-52540065bddc

Comment by James Nunez (Inactive) [ 14/Aug/18 ]

An updated stack trace for 2.10.5 RC1 at https://testing.whamcloud.com/test_sets/0c4797ee-9bb9-11e8-8ee3-52540065bddc. The OSS console has

[35623.218415] Lustre: Echo OBD driver; http://www.lustre.org/
[37200.342923] INFO: task lctl:28554 blocked for more than 120 seconds.
[37200.343656] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[37200.344422] lctl            D ffff8f117c520000     0 28554  28547 0x00000080
[37200.345237] Call Trace:
[37200.345574]  [<ffffffffb8314029>] schedule+0x29/0x70
[37200.346111]  [<ffffffffb8311999>] schedule_timeout+0x239/0x2c0
[37200.346763]  [<ffffffffb7c6814e>] ? kvm_clock_get_cycles+0x1e/0x20
[37200.347409]  [<ffffffffb7cf7ed2>] ? ktime_get_ts64+0x52/0xf0
[37200.347992]  [<ffffffffb831353d>] io_schedule_timeout+0xad/0x130
[37200.348625]  [<ffffffffb7cbc1c6>] ? prepare_to_wait_exclusive+0x56/0x90
[37200.349268]  [<ffffffffb83135d8>] io_schedule+0x18/0x20
[37200.350017]  [<ffffffffc026b192>] cv_wait_common+0xb2/0x150 [spl]
[37200.350591]  [<ffffffffb7cbc610>] ? wake_up_atomic_t+0x30/0x30
[37200.351167]  [<ffffffffc026b268>] __cv_wait_io+0x18/0x20 [spl]
[37200.352006]  [<ffffffffc042c023>] zio_wait+0x113/0x1c0 [zfs]
[37200.352559]  [<ffffffffc03771f4>] dmu_buf_hold_array_by_dnode+0x154/0x4a0 [zfs]
[37200.353317]  [<ffffffffc03775a9>] dmu_buf_hold_array_by_bonus+0x69/0x90 [zfs]
[37200.354207]  [<ffffffffc10144f2>] osd_bufs_get+0x412/0xc60 [osd_zfs]
[37200.354857]  [<ffffffffc11517fb>] ofd_preprw+0x6bb/0x1170 [ofd]
[37200.355505]  [<ffffffffb7d9934e>] ? __get_free_pages+0xe/0x40
[37200.356074]  [<ffffffffb7df4f9e>] ? kmalloc_order_trace+0x2e/0xa0
[37200.356764]  [<ffffffffb7df8b41>] ? __kmalloc+0x211/0x230
[37200.357300]  [<ffffffffc122217a>] echo_client_prep_commit.isra.49+0x33a/0xc30 [obdecho]
[37200.358088]  [<ffffffffc1229ebf>] echo_client_iocontrol+0x95f/0x1be0 [obdecho]
[37200.359298]  [<ffffffffc0b7f7b9>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[37200.360060]  [<ffffffffc0b6a619>] class_handle_ioctl+0x1939/0x1dd0 [obdclass]
[37200.360728]  [<ffffffffb7dc7c3d>] ? handle_mm_fault+0x39d/0x9b0
[37200.361369]  [<ffffffffb7ed0b1e>] ? security_capable+0x1e/0x20
[37200.361938]  [<ffffffffc0b4f5d2>] obd_class_ioctl+0xd2/0x170 [obdclass]
[37200.362631]  [<ffffffffb7e30350>] do_vfs_ioctl+0x350/0x560
[37200.363176]  [<ffffffffb831b56c>] ? __do_page_fault+0x1bc/0x4f0
[37200.363843]  [<ffffffffb7e30601>] SyS_ioctl+0xa1/0xc0
[37200.364326]  [<ffffffffb83206d5>] ? system_call_after_swapgs+0xa2/0x146
[37200.364949]  [<ffffffffb8320795>] system_call_fastpath+0x1c/0x21
[37200.365593]  [<ffffffffb83206e1>] ? system_call_after_swapgs+0xae/0x146
[37200.366232] INFO: task lctl:28556 blocked for more than 120 seconds.
[37200.366905] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Generated at Sat Feb 10 02:02:02 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.