Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0, Lustre 2.4.1
-
Lustre 2.4.0-RC1_3chaos. https://github.com/chaos/lustre/tree/2.4.0-RC1_3chaos, ZFS servers
-
3
-
8471
Description
It looks like LU-2576 is back again. The problem went away for a while, seemingly thanks to the patch from LU-2576. I note that a later fix for stack overflow in LU-2859 changed the same lines where the fix was applied, so perhaps that reintroduced the problem?
We are seeing hangs during writes on BG/Q hardware. We find tasks that appear to be stuck sleeping indefinitely here:
2013-05-29 16:15:36.363547 sysiod S 00000fffa85363bc 0 4111 3070 0x00000000 2013-05-29 16:15:36.363582 Call Trace: 2013-05-29 16:15:36.363617 [c0000002e1e12440] [c000000302e95cc8] 0xc000000302e95cc8 (unreliable) 2013-05-29 16:15:36.363653 [c0000002e1e12610] [c000000000008de0] .__switch_to+0xc4/0x100 2013-05-29 16:15:36.363688 [c0000002e1e126a0] [c00000000044dc68] .schedule+0x858/0x9c0 2013-05-29 16:15:36.363723 [c0000002e1e12950] [80000000004820a0] .cfs_waitq_wait+0x10/0x30 [libcfs] 2013-05-29 16:15:36.363758 [c0000002e1e129c0] [80000000015b3ccc] .osc_enter_cache+0xb6c/0x1410 [osc] 2013-05-29 16:15:36.363793 [c0000002e1e12ba0] [80000000015bbf30] .osc_queue_async_io+0xcd0/0x2690 [osc] 2013-05-29 16:15:36.363828 [c0000002e1e12db0] [8000000001598598] .osc_page_cache_add+0xf8/0x2a0 [osc] 2013-05-29 16:15:36.363863 [c0000002e1e12e70] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass] 2013-05-29 16:15:36.363898 [c0000002e1e12fa0] [800000000179ed28] .lov_page_cache_add+0xc8/0x340 [lov] 2013-05-29 16:15:36.363934 [c0000002e1e13070] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass] 2013-05-29 16:15:36.363968 [c0000002e1e131a0] [8000000001d2ac74] .vvp_io_commit_write+0x464/0x910 [lustre] 2013-05-29 16:15:36.364003 [c0000002e1e132c0] [8000000000a1df6c] .cl_io_commit_write+0x11c/0x2d0 [obdclass] 2013-05-29 16:15:36.364038 [c0000002e1e13380] [8000000001cebc00] .ll_commit_write+0x120/0x3e0 [lustre] 2013-05-29 16:15:36.364074 [c0000002e1e13450] [8000000001d0f134] .ll_write_end+0x34/0x80 [lustre] 2013-05-29 16:15:36.364109 [c0000002e1e134e0] [c000000000097238] .generic_file_buffered_write+0x1f4/0x388 2013-05-29 16:15:36.364143 [c0000002e1e13620] [c000000000097928] .__generic_file_aio_write+0x374/0x3d8 2013-05-29 16:15:36.364178 [c0000002e1e13720] [c000000000097a04] .generic_file_aio_write+0x78/0xe8 2013-05-29 16:15:36.364213 [c0000002e1e137d0] [8000000001d2df00] .vvp_io_write_start+0x170/0x3b0 [lustre] 2013-05-29 16:15:36.364248 [c0000002e1e138a0] [8000000000a1849c] .cl_io_start+0xcc/0x220 [obdclass] 2013-05-29 16:15:36.364283 [c0000002e1e13940] [8000000000a202a4] .cl_io_loop+0x194/0x2c0 [obdclass] 2013-05-29 16:15:36.364317 [c0000002e1e139f0] [8000000001ca0780] .ll_file_io_generic+0x4f0/0x850 [lustre] 2013-05-29 16:15:36.364352 [c0000002e1e13b30] [8000000001ca0f64] .ll_file_aio_write+0x1d4/0x3a0 [lustre] 2013-05-29 16:15:36.364387 [c0000002e1e13c00] [8000000001ca1280] .ll_file_write+0x150/0x320 [lustre] 2013-05-29 16:15:36.364422 [c0000002e1e13ce0] [c0000000000d4328] .vfs_write+0xd0/0x1c4 2013-05-29 16:15:36.364458 [c0000002e1e13d80] [c0000000000d4518] .SyS_write+0x54/0x98 2013-05-29 16:15:36.364492 [c0000002e1e13e30] [c000000000000580] syscall_exit+0x0/0x2c
This was with Lustre 2.4.0-RC1_3chaos.