Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.4.0
-
Sequoia, lustre 2.3.58-5chaos (github.com/chaos/lustre)
-
3
-
6154
Description
On a number of Sequoia IONs (ppc64 lustre clients) we hit an mdc hang unde mdc_enqueue(). Many processes are stuck in do_path_loopup() like this:
2013-01-16 13:33:21.160679 {DefaultControlEventListener} [mmcs]{474}.0.0: sysiod D 00000fffaea576bc 0 15111 3083 0x00000002 2013-01-16 13:33:21.160730 {DefaultControlEventListener} [mmcs]{474}.0.0: Call Trace: 2013-01-16 13:33:21.160782 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416aa30] [8000000004012b20] nscheds+0x21dc/0xfffffffffffec004 [ko2iblnd] (unreliable) 2013-01-16 13:33:21.160835 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416ac00] [c000000000009b2c] .__switch_to+0xc4/0x100 2013-01-16 13:33:21.160886 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416ac90] [c00000000042a418] .schedule+0x7d4/0x944 2013-01-16 13:33:21.160937 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416af40] [c00000000042b4a8] .__mutex_lock_slowpath+0x208/0x390 2013-01-16 13:33:21.160988 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b040] [c00000000042bf44] .mutex_lock+0x38/0x58 2013-01-16 13:33:21.161040 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b0c0] [8000000005ac1830] .mdc_enqueue+0x6e0/0x1c60 [mdc] 2013-01-16 13:33:21.161092 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b280] [8000000005ac3060] .mdc_intent_lock+0x2b0/0x820 [mdc] 2013-01-16 13:33:21.161158 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b400] [8000000007411f64] .lmv_intent_open+0x4e4/0x1940 [lmv] 2013-01-16 13:33:21.161232 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b590] [80000000074136dc] .lmv_intent_lock+0x31c/0x4b0 [lmv] 2013-01-16 13:33:21.161314 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b690] [80000000068d3a3c] .ll_revalidate_it+0x41c/0x2220 [lustre] 2013-01-16 13:33:21.161410 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b7f0] [80000000068d59ec] .ll_revalidate_nd+0x1ac/0x560 [lustre] 2013-01-16 13:33:21.161539 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b8b0] [c0000000000dc318] .do_lookup+0x180/0x218 2013-01-16 13:33:21.161626 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416b970] [c0000000000dee8c] .__link_path_walk+0xb90/0x11d4 2013-01-16 13:33:21.161694 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416ba90] [c0000000000df830] .path_walk+0x98/0x148 2013-01-16 13:33:21.161752 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416bb30] [c0000000000dfa34] .do_path_lookup+0x38/0x78 2013-01-16 13:33:21.161817 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416bbc0] [c0000000000e08a4] .do_filp_open+0x11c/0xadc 2013-01-16 13:33:21.161880 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416bd80] [c0000000000cf498] .do_sys_open+0x8c/0x18c 2013-01-16 13:33:21.161942 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e416be30] [c000000000000580] syscall_exit+0x0/0x2c
because this processes is sleeping while holding the lock:
2013-01-16 13:33:21.078061 {DefaultControlEventListener} [mmcs]{474}.0.0: sysiod S 00000fff965b7148 0 15097 3083 0x00000000 2013-01-16 13:33:21.078112 {DefaultControlEventListener} [mmcs]{474}.0.0: Call Trace: 2013-01-16 13:33:21.078163 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2630] [c000000000027d1c] .resched_task+0x9c/0xb4 (unreliable) 2013-01-16 13:33:21.078214 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2800] [c000000000009b2c] .__switch_to+0xc4/0x100 2013-01-16 13:33:21.078266 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2890] [c00000000042a418] .schedule+0x7d4/0x944 2013-01-16 13:33:21.078317 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2b40] [8000000000ae2090] .cfs_waitq_wait+0x10/0x30 [libcfs] 2013-01-16 13:33:21.078368 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2bb0] [8000000003b56c3c] .ldlm_completion_ast+0x6dc/0xe30 [ptlrpc] 2013-01-16 13:33:21.078420 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2cf0] [8000000003b542c4] .ldlm_cli_enqueue_fini+0xa34/0x16b0 [ptlrpc] 2013-01-16 13:33:21.078472 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2e20] [8000000003b553b4] .ldlm_cli_enqueue+0x474/0x970 [ptlrpc] 2013-01-16 13:33:21.078524 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae2f60] [8000000005ac18ac] .mdc_enqueue+0x75c/0x1c60 [mdc] 2013-01-16 13:33:21.078576 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3120] [80000000073ffbf4] .lmv_enqueue+0x504/0x19f0 [lmv] 2013-01-16 13:33:21.078627 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3280] [80000000068de4a4] .ll_get_dir_page+0x354/0x1ee0 [lustre] 2013-01-16 13:33:21.078679 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3450] [8000000006956a28] .do_statahead_enter+0x6d8/0x21f0 [lustre] 2013-01-16 13:33:21.078730 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3610] [800000000693c84c] .ll_lookup_it+0xb8c/0xfc0 [lustre] 2013-01-16 13:33:21.078782 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3770] [800000000693cd50] .ll_lookup_nd+0xd0/0x580 [lustre] 2013-01-16 13:33:21.078834 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3830] [c0000000000dc2b8] .do_lookup+0x120/0x218 2013-01-16 13:33:21.078885 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae38f0] [c0000000000dee8c] .__link_path_walk+0xb90/0x11d4 2013-01-16 13:33:21.078936 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3a10] [c0000000000df830] .path_walk+0x98/0x148 2013-01-16 13:33:21.078988 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3ab0] [c0000000000dfa34] .do_path_lookup+0x38/0x78 2013-01-16 13:33:21.079039 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3b40] [c0000000000e05e0] .user_path_at+0x68/0xc0 2013-01-16 13:33:21.079090 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3c90] [c0000000000d5e30] .vfs_fstatat+0x5c/0xa8 2013-01-16 13:33:21.079141 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3d30] [c0000000000d5f00] .SyS_newfstatat+0x2c/0x58 2013-01-16 13:33:21.079193 {DefaultControlEventListener} [mmcs]{474}.0.0: [c0000003e0ae3e30] [c000000000000580] syscall_exit+0x0/0x2c
It could be that this situation was a result of an eviction event. But note also that the client claimed to be fully recovered with the mds. See attached console log with sysrq-t output.
Also note that sending a "kill -9" to process 15097 did break the system out of its hang.