Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Lustre 2.4.0, Lustre 2.4.1
-
Lustre 2.4.0-RC1_3chaos. https://github.com/chaos/lustre/tree/2.4.0-RC1_3chaos, ZFS servers
-
3
-
8471
Description
It looks like LU-2576 is back again. The problem went away for a while, seemingly thanks to the patch from LU-2576. I note that a later fix for stack overflow in LU-2859 changed the same lines where the fix was applied, so perhaps that reintroduced the problem?
We are seeing hangs during writes on BG/Q hardware. We find tasks that appear to be stuck sleeping indefinitely here:
2013-05-29 16:15:36.363547 sysiod S 00000fffa85363bc 0 4111 3070 0x00000000 2013-05-29 16:15:36.363582 Call Trace: 2013-05-29 16:15:36.363617 [c0000002e1e12440] [c000000302e95cc8] 0xc000000302e95cc8 (unreliable) 2013-05-29 16:15:36.363653 [c0000002e1e12610] [c000000000008de0] .__switch_to+0xc4/0x100 2013-05-29 16:15:36.363688 [c0000002e1e126a0] [c00000000044dc68] .schedule+0x858/0x9c0 2013-05-29 16:15:36.363723 [c0000002e1e12950] [80000000004820a0] .cfs_waitq_wait+0x10/0x30 [libcfs] 2013-05-29 16:15:36.363758 [c0000002e1e129c0] [80000000015b3ccc] .osc_enter_cache+0xb6c/0x1410 [osc] 2013-05-29 16:15:36.363793 [c0000002e1e12ba0] [80000000015bbf30] .osc_queue_async_io+0xcd0/0x2690 [osc] 2013-05-29 16:15:36.363828 [c0000002e1e12db0] [8000000001598598] .osc_page_cache_add+0xf8/0x2a0 [osc] 2013-05-29 16:15:36.363863 [c0000002e1e12e70] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass] 2013-05-29 16:15:36.363898 [c0000002e1e12fa0] [800000000179ed28] .lov_page_cache_add+0xc8/0x340 [lov] 2013-05-29 16:15:36.363934 [c0000002e1e13070] [8000000000a04248] .cl_page_cache_add+0xf8/0x420 [obdclass] 2013-05-29 16:15:36.363968 [c0000002e1e131a0] [8000000001d2ac74] .vvp_io_commit_write+0x464/0x910 [lustre] 2013-05-29 16:15:36.364003 [c0000002e1e132c0] [8000000000a1df6c] .cl_io_commit_write+0x11c/0x2d0 [obdclass] 2013-05-29 16:15:36.364038 [c0000002e1e13380] [8000000001cebc00] .ll_commit_write+0x120/0x3e0 [lustre] 2013-05-29 16:15:36.364074 [c0000002e1e13450] [8000000001d0f134] .ll_write_end+0x34/0x80 [lustre] 2013-05-29 16:15:36.364109 [c0000002e1e134e0] [c000000000097238] .generic_file_buffered_write+0x1f4/0x388 2013-05-29 16:15:36.364143 [c0000002e1e13620] [c000000000097928] .__generic_file_aio_write+0x374/0x3d8 2013-05-29 16:15:36.364178 [c0000002e1e13720] [c000000000097a04] .generic_file_aio_write+0x78/0xe8 2013-05-29 16:15:36.364213 [c0000002e1e137d0] [8000000001d2df00] .vvp_io_write_start+0x170/0x3b0 [lustre] 2013-05-29 16:15:36.364248 [c0000002e1e138a0] [8000000000a1849c] .cl_io_start+0xcc/0x220 [obdclass] 2013-05-29 16:15:36.364283 [c0000002e1e13940] [8000000000a202a4] .cl_io_loop+0x194/0x2c0 [obdclass] 2013-05-29 16:15:36.364317 [c0000002e1e139f0] [8000000001ca0780] .ll_file_io_generic+0x4f0/0x850 [lustre] 2013-05-29 16:15:36.364352 [c0000002e1e13b30] [8000000001ca0f64] .ll_file_aio_write+0x1d4/0x3a0 [lustre] 2013-05-29 16:15:36.364387 [c0000002e1e13c00] [8000000001ca1280] .ll_file_write+0x150/0x320 [lustre] 2013-05-29 16:15:36.364422 [c0000002e1e13ce0] [c0000000000d4328] .vfs_write+0xd0/0x1c4 2013-05-29 16:15:36.364458 [c0000002e1e13d80] [c0000000000d4518] .SyS_write+0x54/0x98 2013-05-29 16:15:36.364492 [c0000002e1e13e30] [c000000000000580] syscall_exit+0x0/0x2c
This was with Lustre 2.4.0-RC1_3chaos.
Attachments
Issue Links
Activity
Comment |
[ Hi Team,
Clients are in hang state and are seeing these corresponding errors on the clients: [Thu Oct 20 04:03:17 2022] LustreError: 11-0: data-MDT0000-mdc-ffff9f51f6751000: operation mds_close to node 172.27.0.45@o2ib failed: rc = -107 [Thu Oct 20 04:03:17 2022] Lustre: data-MDT0000-mdc-ffff9f51f6751000: Connection to data-MDT0000 (at 172.27.0.45@o2ib) was lost; in progress operations using this service will wait for recovery to complete [Thu Oct 20 04:03:17 2022] LustreError: Skipped 2 previous similar messages [Thu Oct 20 04:03:19 2022] LustreError: 11-0: data-MDT0000-mdc-ffff9f51f6751000: operation ldlm_enqueue to node 172.27.0.45@o2ib failed: rc = -107 [Thu Oct 20 04:03:19 2022] LustreError: Skipped 1 previous similar message [Thu Oct 20 04:03:20 2022] LustreError: 11-0: data-MDT0000-mdc-ffff9f51f6751000: operation ldlm_enqueue to node 172.27.0.45@o2ib failed: rc = -107 [Thu Oct 20 04:03:20 2022] LustreError: Skipped 2 previous similar messages [Thu Oct 20 04:07:05 2022] LustreError: 9131:0:(client.c:3067:ptlrpc_replay_interpret()) @@@ status 301, old was 0 req@ffff9f3c2ff59680 x1747146287852800/t347910890929(347910890929) o101->data-MDT0000-mdc-ffff9f51f6751000@172.27.0.45@o2ib:12/10 lens 864/568 e 0 to 0 dl 1666257015 ref 2 fl Interpret:RP/4/0 rc 301/301 [Thu Oct 20 05:56:33 2022] LustreError: 65849:0:(file.c:4601:ll_inode_revalidate_fini()) data: revalidate FID [0x200000007:0x1:0x0] error: rc = -4 Regards, Hithesh Kumar ] |
Fix Version/s | New: Lustre 2.4.2 [ 10605 ] |
Fix Version/s | Original: Lustre 2.4.0 [ 10154 ] |
Resolution | New: Fixed [ 1 ] | |
Status | Original: Open [ 1 ] | New: Resolved [ 5 ] |
Fix Version/s | New: Lustre 2.5.0 [ 10295 ] |
Attachment | New: rzuseqio13_drop_caches.txt.bz2 [ 13061 ] | |
Attachment | New: rzuseqio14_drop_caches.txt.bz2 [ 13062 ] | |
Attachment | New: rzuseqio15_drop_caches.txt.bz2 [ 13063 ] | |
Attachment | New: rzuseqio16_drop_caches.txt.bz2 [ 13064 ] |
Attachment | New: rzuseqio15_console.txt.bz2 [ 13038 ] |
Labels | New: mq313 |
Affects Version/s | New: Lustre 2.4.1 [ 10294 ] |
Priority | Original: Blocker [ 1 ] | New: Critical [ 2 ] |