Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.11.0, Lustre 2.10.2, Lustre 2.10.3, Lustre 2.10.7, Lustre 2.12.3
-
ZFS 0.7.x
-
3
-
9223372036854775807
Description
ost-pools test 22 hangs when testing with ZFS 0.7.0. The last lines in the test_log before the hang are:
loop for testpool2 complete total: 100 open/close in 0.09 seconds: 1061.34 ops/second CMD: onyx-35vm9 lctl pool_remove lustre.testpool OST[0001-0001/1] onyx-35vm9: OST lustre-OST0001_UUID removed from pool lustre.testpool Pool testpool, iteration 2 CMD: onyx-35vm9 lctl pool_remove lustre.testpool OST[0001-0001/1] onyx-35vm9: OST lustre-OST0001_UUID removed from pool lustre.testpool Pool testpool, iteration 3 CMD: onyx-35vm9 lctl pool_remove lustre.testpool OST[0001-0001/1] onyx-35vm9: OST lustre-OST0001_UUID removed from pool lustre.testpool Pool testpool, iteration 4
In the MDS console logs, we see:
04:17:47:[11669.022658] Lustre: DEBUG MARKER: lctl pool_remove lustre.testpool OST[0001-0001/1] 04:17:47:[11677.206802] Lustre: DEBUG MARKER: lctl pool_add lustre.testpool OST[0000-0001/1] 04:17:47:[11687.391683] Lustre: DEBUG MARKER: lctl pool_remove lustre.testpool OST[0001-0001/1] 04:17:47:[11693.557227] WARNING: Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended. 04:17:47:[11693.557227] 04:17:47:[11697.575911] Lustre: DEBUG MARKER: lctl pool_add lustre.testpool OST[0000-0001/1] 04:17:47:[15221.223579] SysRq : Show State 04:17:47:[15221.224071] task PC stack pid father 04:17:47:[15221.224729] systemd S 0000000000000000 0 1 0 0x00000000 04:17:47:[15221.225617] ffff88007c7bbdb8 0000000000000082 ffff88007c7b0000 ffff88007c7bbfd8 04:17:47:[15221.226469] ffff88007c7bbfd8 ffff88007c7bbfd8 ffff88007c7b0000 0000000000000000 04:17:47:[15221.227325] 0000000000000000 ffff8800369d1a20 ffff88007c7b0000 0000000000000000 04:17:47:[15221.228217] Call Trace: 04:17:47:[15221.228581] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15221.229094] [<ffffffff8168bb4d>] schedule_hrtimeout_range_clock+0x12d/0x150 04:17:47:[15221.229864] [<ffffffff812486f9>] ? ep_scan_ready_list.isra.7+0x1b9/0x1f0 04:17:47:[15221.230669] [<ffffffff8168bb83>] schedule_hrtimeout_range+0x13/0x20 04:17:47:[15221.231363] [<ffffffff8124898e>] ep_poll+0x23e/0x360 04:17:47:[15221.231927] [<ffffffff81200873>] ? __fput+0x183/0x260 04:17:47:[15221.232481] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 04:17:47:[15221.233121] [<ffffffff81249bfd>] SyS_epoll_wait+0xed/0x120 04:17:47:[15221.233737] [<ffffffff81697a49>] system_call_fastpath+0x16/0x1b … 04:17:47:[15222.536161] cfs_rh_00 S ffff88007a8667b0 0 2617 2 0x00000080 04:17:47:[15222.536916] ffff8800793ebe28 0000000000000046 ffff88000001bec0 ffff8800793ebfd8 04:17:47:[15222.537791] ffff8800793ebfd8 ffff8800793ebfd8 ffff88000001bec0 ffff88007a866780 04:17:47:[15222.538646] ffff88007a866798 ffff88007a866790 0000000000000000 ffff88007a8667b0 04:17:47:[15222.539495] Call Trace: 04:17:47:[15222.539749] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.540295] [<ffffffffa05cab58>] cfs_wi_scheduler+0x2f8/0x440 [libcfs] 04:17:47:[15222.541047] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30 04:17:47:[15222.541647] [<ffffffffa05ca860>] ? cfs_wi_schedule+0x190/0x190 [libcfs] 04:17:47:[15222.542384] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.542905] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.543561] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.544161] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.544827] cfs_rh_01 S ffff88007a8667b0 0 2618 2 0x00000080 04:17:47:[15222.545646] ffff88007b4ffe28 0000000000000046 ffff880000018fb0 ffff88007b4fffd8 04:17:47:[15222.546465] ffff88007b4fffd8 ffff88007b4fffd8 ffff880000018fb0 ffff88007a866780 04:17:47:[15222.547284] ffff88007a866798 ffff88007a866790 0000000000000000 ffff88007a8667b0 04:17:47:[15222.548152] Call Trace: 04:17:47:[15222.548437] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.549020] [<ffffffffa05cab58>] cfs_wi_scheduler+0x2f8/0x440 [libcfs] 04:17:47:[15222.549696] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30 04:17:47:[15222.550348] [<ffffffffa05ca860>] ? cfs_wi_schedule+0x190/0x190 [libcfs] 04:17:47:[15222.551049] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.551554] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.552281] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.552858] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.553513] spl_system_task S ffff8800791e43c0 0 3362 2 0x00000080 04:17:47:[15222.554288] ffff88007bae3e00 0000000000000046 ffff88007a24af10 ffff88007bae3fd8 04:17:47:[15222.555149] ffff88007bae3fd8 ffff88007bae3fd8 ffff88007a24af10 ffff8800791e4300 04:17:47:[15222.556011] ffff88007a9c6de0 ffff8800791e4390 ffff8800366fbd30 ffff8800791e43c0 04:17:47:[15222.556870] Call Trace: 04:17:47:[15222.557125] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.557656] [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl] 04:17:47:[15222.558369] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 04:17:47:[15222.558965] [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl] 04:17:47:[15222.559630] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.560181] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.560842] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.561451] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.562133] spl_delay_taskq S ffff88007b8fe4c0 0 3363 2 0x00000080 04:17:47:[15222.562894] ffff88007a907e00 0000000000000046 ffff88007a24bec0 ffff88007a907fd8 04:17:47:[15222.563748] ffff88007a907fd8 ffff88007a907fd8 ffff88007a24bec0 ffff88007b8fe400 04:17:47:[15222.564605] ffff88007a9c6f00 ffff88007b8fe490 0000000000000000 ffff88007b8fe4c0 04:17:47:[15222.565474] Call Trace: 04:17:47:[15222.565733] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.566242] [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl] 04:17:47:[15222.566941] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 04:17:47:[15222.567515] [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl] 04:17:47:[15222.568229] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.568761] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.569429] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.570037] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.570703] spl_dynamic_tas S ffff88007b8fe2c0 0 3364 2 0x00000080 04:17:47:[15222.571482] ffff88007b28fe00 0000000000000046 ffff88007a248000 ffff88007b28ffd8 04:17:47:[15222.572328] ffff88007b28ffd8 ffff88007b28ffd8 ffff88007a248000 ffff88007b8fe200 04:17:47:[15222.573178] ffff88007a9c6c60 ffff88007b8fe290 ffff8800668c9670 ffff88007b8fe2c0 04:17:47:[15222.574045] Call Trace: 04:17:47:[15222.574314] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.574893] [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl] 04:17:47:[15222.575505] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 04:17:47:[15222.576122] [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl] 04:17:47:[15222.576785] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.577352] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.578036] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.578596] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.579309] spl_kmem_cache S ffff88007b8feec0 0 3365 2 0x00000080 04:17:47:[15222.580047] ffff88007ba3fe00 0000000000000046 ffff88007a24ce70 ffff88007ba3ffd8 04:17:47:[15222.580902] ffff88007ba3ffd8 ffff88007ba3ffd8 ffff88007a24ce70 ffff88007b8fee00 04:17:47:[15222.581750] ffff88007a9c6a20 ffff88007b8fee90 ffff8800387fe800 ffff88007b8feec0 04:17:47:[15222.582597] Call Trace: 04:17:47:[15222.582853] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.583358] [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl] 04:17:47:[15222.584057] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 04:17:47:[15222.584645] [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl] 04:17:47:[15222.585372] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.585891] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.586549] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.587153] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.587820] zvol S ffff88007b5abbc0 0 3366 2 0x00000080 04:17:47:[15222.588648] ffff88007b29be00 0000000000000046 ffff88007a249f60 ffff88007b29bfd8 04:17:47:[15222.589473] ffff88007b29bfd8 ffff88007b29bfd8 ffff88007a249f60 ffff88007b5abb00 04:17:47:[15222.590300] ffff8800364fa6c0 ffff88007b5abb90 0000000000000000 ffff88007b5abbc0 04:17:47:[15222.591167] Call Trace: 04:17:47:[15222.591437] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.592015] [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl] 04:17:47:[15222.592643] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 04:17:47:[15222.593264] [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl] 04:17:47:[15222.593951] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.594450] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.595156] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.595709] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.596425] arc_prune S ffff880079bc1ac0 0 3372 2 0x00000080 04:17:47:[15222.597166] ffff88007b8a7e00 0000000000000046 ffff88007a9d8000 ffff88007b8a7fd8 04:17:47:[15222.598033] ffff88007b8a7fd8 ffff88007b8a7fd8 ffff88007a9d8000 ffff880079bc1a00 04:17:47:[15222.598901] ffff8800796aad80 ffff880079bc1a90 0000000000000000 ffff880079bc1ac0 04:17:47:[15222.599755] Call Trace: 04:17:47:[15222.600008] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.600537] [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl] 04:17:47:[15222.601256] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 04:17:47:[15222.601857] [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl] 04:17:47:[15222.602514] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.603059] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.603722] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.604329] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.605006] arc_reclaim S ffffffffa0914a08 0 3373 2 0x00000080 04:17:47:[15222.605764] ffff88007b933cf8 0000000000000046 ffff88007a9dedd0 ffff88007b933fd8 04:17:47:[15222.606622] ffff88007b933fd8 ffff88007b933fd8 ffff88007a9dedd0 ffff88007b933dc8 04:17:47:[15222.607473] 00000000000186a0 0000000000000001 ffffffffa0914a88 ffffffffa0914a08 04:17:47:[15222.608292] Call Trace: 04:17:47:[15222.608635] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.609129] [<ffffffff8168bad2>] schedule_hrtimeout_range_clock+0xb2/0x150 04:17:47:[15222.609843] [<ffffffff810b4640>] ? hrtimer_get_res+0x50/0x50 04:17:47:[15222.610414] [<ffffffff8168bac6>] ? schedule_hrtimeout_range_clock+0xa6/0x150 04:17:47:[15222.611121] [<ffffffff8168bb83>] schedule_hrtimeout_range+0x13/0x20 04:17:47:[15222.611769] [<ffffffffa06236ab>] cv_timedwait_hires_common+0x12b/0x220 [spl] 04:17:47:[15222.612503] [<ffffffff810b1b20>] ? wake_up_atomic_t+0x30/0x30 04:17:47:[15222.613080] [<ffffffffa06237d4>] cv_timedwait_sig_hires+0x14/0x20 [spl] 04:17:47:[15222.613844] [<ffffffffa071ad31>] arc_reclaim_thread+0x121/0x260 [zfs] 04:17:47:[15222.614529] [<ffffffffa071ac10>] ? arc_shrink+0xc0/0xc0 [zfs] 04:17:47:[15222.615337] [<ffffffffa061df91>] thread_generic_wrapper+0x71/0x80 [spl] 04:17:47:[15222.616077] [<ffffffffa061df20>] ? __thread_exit+0x20/0x20 [spl] 04:17:47:[15222.616766] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.617319] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.618001] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.618548] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.619262] dbu_evict S ffff88003641fbc0 0 3374 2 0x00000080 04:17:47:[15222.620014] ffff88000005be00 0000000000000046 ffff88007a9daf10 ffff88000005bfd8 04:17:47:[15222.620866] ffff88000005bfd8 ffff88000005bfd8 ffff88007a9daf10 ffff88003641fb00 04:17:47:[15222.621722] ffff8800796aad20 ffff88003641fb90 ffff88004caa2070 ffff88003641fbc0 04:17:47:[15222.622573] Call Trace: 04:17:47:[15222.622826] [<ffffffff8168ca49>] schedule+0x29/0x70 04:17:47:[15222.623336] [<ffffffffa061f0a9>] taskq_thread+0x429/0x470 [spl] 04:17:47:[15222.624031] [<ffffffff810c54e0>] ? wake_up_state+0x20/0x20 04:17:47:[15222.624606] [<ffffffffa061ec80>] ? taskq_thread_spawn+0x60/0x60 [spl] 04:17:47:[15222.625315] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 04:17:47:[15222.625832] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 04:17:47:[15222.626491] [<ffffffff81697998>] ret_from_fork+0x58/0x90 04:17:47:[15222.627083] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
From ost-pools.sh,
1196 test_22() { 1197 local POOL_ROOT=${POOL_ROOT:-$DIR/$tdir} 1198 [[ $OSTCOUNT -le 1 ]] && skip_env "Need at least 2 OSTs" && return 1199 1200 local numfiles=100 1201 1202 create_pool_nofail $POOL 1203 add_pool $POOL "OST0000" "$FSNAME-OST0000_UUID " 1204 create_pool_nofail $POOL2 1205 add_pool $POOL2 "OST0000" "$FSNAME-OST0000_UUID " 1206 1207 add_loop $POOL 1 & 1208 add_loop $POOL2 2 & 1209 sleep 5 1210 create_dir $POOL_ROOT $POOL 1211 createmany -o $POOL_ROOT/${tfile} $numfiles || 1212 error "createmany $POOL_ROOT/${tfile} failed!" 1213 wait 1214 1215 return 0 1216 } 1217 run_test 22 "Simultaneous manipulation of a pool"
There’s only one instance of this failure. Logs are at:
https://testing.hpdd.intel.com/test_sets/d0619996-799f-11e7-9b69-5254006e85c2
Attachments
Issue Links
- is duplicated by
-
LU-10342 sanity test_27F timeout
- Resolved
-
LU-10832 sanity-flr test_200: Timeout occurred after 69 mins
- Resolved
-
LU-11217 conf-sanity: test_32c client crashed
- Resolved
-
LU-12281 replay-single: ZFS mmp crashed
- Resolved
-
LU-9991 sanity-hsm test 31c fails with 'copytools failed to stop'
- Closed
-
LU-10109 sanityn test_55c: Timeout occurred after 183 mins
- Closed
-
LU-10585 parallel-scale test_compilebench: LustreError: 17907:0:(osc_cache.c:955:osc_extent_wait())
- Closed
- is related to
-
LU-10956 sanity-pfl test_3: Kernel panic - not syncing: Pool has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic
- Open
-
LU-12258 sanity test_101d timeout when doing rolling upgrade OSS from 2.10.7 to 2.12.1 with ZFS
- Open
-
LU-11704 sanity-lfsck test 2a hangs with ‘Pool 'lustre-mdt1' has encountered an uncorrectable I/O failure and has been suspended.’
- Resolved
-
LU-12393 insanity: Kernel panic - not syncing: Pool 'lustre-ost6' has encountered an uncorrectable I/O failure and the failure mode property for this pool is set to panic.
- Resolved
- is related to
-
LU-10441 sanityn: test 36 timeout
- Open
-
LU-8619 Lustre to build against ZFS 0.7.0
- Resolved
- links to
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...