Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.11.0, Lustre 2.12.0, Lustre 2.10.3, Lustre 2.10.4, Lustre 2.10.5, Lustre 2.13.0, Lustre 2.10.6, Lustre 2.10.7, Lustre 2.15.2
-
None
-
ZFS
-
3
-
9223372036854775807
Description
obdfilter-survey test_1a hangs. The last thing seen in the client test_log is
== obdfilter-survey test 1a: Object Storage Targets survey =========================================== 12:43:34 (1522586614) Unable to detect ip address for host: '' + NETTYPE=tcp thrlo=8 nobjhi=1 thrhi=16 size=1024 case=disk rslt_loc=/tmp targets="10.9.6.25:lustre-OST0000 10.9.6.25:lustre-OST0001 10.9.6.25:lustre-OST0002 10.9.6.25:lustre-OST0003 10.9.6.25:lustre-OST0004 10.9.6.25:lustre-OST0005 10.9.6.25:lustre-OST0006 10.9.6.25:lustre-OST0007" /usr/bin/obdfilter-survey Warning: Permanently added '10.9.6.25' (ECDSA) to the list of known hosts. Sun Apr 1 12:43:46 UTC 2018 Obdfilter-survey for case=disk from trevis-45vm5.trevis.hpdd.intel.com ost 8 sz 8388608K rsz 1024K obj 8 thr 64 write 8.68 [ 0.00, 74.99] rewrite 8.34 [ 0.00, 36.00] read 28.79 [ 0.00, 66.99] ost 8 sz 8388608K rsz 1024K obj 8 thr 128 write 10.07 [ 0.00, 71.99] rewrite
The console logs for the OSS show that lctl is hung.
[45533.585190] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 12:43:34 \(1522586614\) [45533.839472] Lustre: DEBUG MARKER: == obdfilter-survey test 1a: Object Storage Targets survey =========================================== 12:43:34 (1522586614) [45534.132589] LustreError: 0-0: lustre-OST0000: can't enable quota enforcement since space accounting isn't functional. Please run tunefs.lustre --quota on an unmounted filesystem if not done already [45534.139448] LustreError: Skipped 15 previous similar messages [45535.043556] Lustre: Echo OBD driver; http://www.lustre.org/ [47640.146175] INFO: task lctl:15920 blocked for more than 120 seconds. [47640.149032] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [47640.151858] lctl D ffff880057268fd0 0 15920 15879 0x00000080 [47640.154680] Call Trace: [47640.157152] [<ffffffffc08cef72>] ? zio_nowait+0xc2/0x160 [zfs] [47640.159858] [<ffffffff816b40e9>] schedule+0x29/0x70 [47640.162388] [<ffffffff816b1a49>] schedule_timeout+0x239/0x2c0 [47640.164992] [<ffffffff81063f5e>] ? kvm_clock_get_cycles+0x1e/0x20 [47640.167590] [<ffffffff816b35ed>] io_schedule_timeout+0xad/0x130 [47640.170180] [<ffffffff810b4cb6>] ? prepare_to_wait_exclusive+0x56/0x90 [47640.172812] [<ffffffff816b3688>] io_schedule+0x18/0x20 [47640.175437] [<ffffffffc0726502>] cv_wait_common+0xb2/0x150 [spl] [47640.178083] [<ffffffff810b4fc0>] ? wake_up_atomic_t+0x30/0x30 [47640.180646] [<ffffffffc07265f8>] __cv_wait_io+0x18/0x20 [spl] [47640.183211] [<ffffffffc08ce833>] zio_wait+0x113/0x1c0 [zfs] [47640.185714] [<ffffffffc08184d4>] dmu_buf_hold_array_by_dnode+0x154/0x4a0 [zfs] [47640.188421] [<ffffffffc0818889>] dmu_buf_hold_array_by_bonus+0x69/0x90 [zfs] [47640.191206] [<ffffffffc10d8782>] osd_bufs_get+0x422/0xd00 [osd_zfs] [47640.193941] [<ffffffffc12160cb>] ofd_preprw+0x6bb/0x1170 [ofd] [47640.196494] [<ffffffff8118bb0e>] ? __get_free_pages+0xe/0x40 [47640.198992] [<ffffffff811e146e>] ? kmalloc_order_trace+0x2e/0xa0 [47640.201483] [<ffffffff811e5011>] ? __kmalloc+0x211/0x230 [47640.203908] [<ffffffffc12fc134>] echo_client_prep_commit.isra.51+0x2f4/0xcb0 [obdecho] [47640.206558] [<ffffffffc1303f7f>] echo_client_iocontrol+0x95f/0x1bb0 [obdecho] [47640.209174] [<ffffffffc0c95229>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [47640.211622] [<ffffffffc0c7fa93>] class_handle_ioctl+0x18e3/0x1df0 [obdclass] [47640.214026] [<ffffffff811b1f16>] ? do_read_fault.isra.44+0xe6/0x130 [47640.216353] [<ffffffff812b72be>] ? security_capable+0x1e/0x20 [47640.218605] [<ffffffffc0c647f2>] obd_class_ioctl+0xd2/0x170 [obdclass] [47640.220979] [<ffffffff81219e90>] do_vfs_ioctl+0x350/0x560 [47640.223400] [<ffffffff816bb521>] ? __do_page_fault+0x171/0x450 [47640.226015] [<ffffffff8121a141>] SyS_ioctl+0xa1/0xc0 [47640.228072] [<ffffffff816c0655>] ? system_call_after_swapgs+0xa2/0x146 [47640.230250] [<ffffffff816c0715>] system_call_fastpath+0x1c/0x21 [47640.232290] [<ffffffff816c0661>] ? system_call_after_swapgs+0xae/0x146
This stack trace is very similar to the ones see in LU-6649 and LU-5775.
So far, this has only been seen on ZFS testing.
Logs for this failure are at
2.11.0-RC3 - https://testing.hpdd.intel.com/test_sets/33a1e65c-35be-11e8-8f8a-52540065bddc
2.11.0-RC1 - https://testing.hpdd.intel.com/test_sessions/854854b1-df4e-4c3a-82ee-67efd0b2e5da
2.10.3 - https://testing.hpdd.intel.com/test_sets/e688805a-2cf4-11e8-b74b-52540065bddc
2.10.3 – https://testing.hpdd.intel.com/test_sessions/32fecab7-704a-44b5-9645-5b9c10bc41e9
Attachments
Issue Links
- is related to
-
LU-16395 obdfilter-survey: Error: 'Timeout occurred after 342 minutes, last suite running was obdfilter-survey'
- Open
- is related to
-
LU-6649 obdfilter-survey test_1a: lctl in D state
- Open
-
LU-11707 obdfilter-survey test 1c hangs with lctl blocked
- Open
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...