Details
-
Bug
-
Resolution: Duplicate
-
Blocker
-
None
-
Lustre 2.4.0
-
3
-
5698
Description
We've had reports on Seqoia of many clients getting stuck during reads. I had a chance to dump the stacks on a client in this state and saw many threads which appeared to be stuck on the lov->lo_type_gaurd in a cfs_down_read. Here's an example stack:
2012-11-28 09:33:05.886131 {DefaultControlEventListener} [mmcs]{716}.0.0: sysiod D 00000fffa66863e0 0 8855 3105 0x00000002 2012-11-28 09:33:05.886181 {DefaultControlEventListener} [mmcs]{716}.0.0: Call Trace: 2012-11-28 09:33:05.886232 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7160] [c0000003ec8e71f0] 0xc0000003ec8e71f0 (unreliable) 2012-11-28 09:33:05.886282 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7330] [c000000000008de0] .__switch_to+0xc4/0x100 2012-11-28 09:33:05.886333 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e73c0] [c00000000042b0e0] .schedule+0x858/0x9c0 2012-11-28 09:33:05.886384 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7670] [c00000000042dcac] .rwsem_down_failed_common+0x270/0x2b8 2012-11-28 09:33:05.886435 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7730] [c00000000042dd60] .rwsem_down_read_failed+0x2c/0x44 2012-11-28 09:33:05.886486 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e77d0] [c00000000042cee8] .down_read+0x30/0x44 2012-11-28 09:33:05.886537 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7840] [80000000051460d8] .lov_lsm_addref+0x48/0x200 [lov] 2012-11-28 09:33:05.886587 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e78e0] [8000000005146e94] .lov_io_init+0x84/0x160 [lov] 2012-11-28 09:33:05.886638 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7980] [800000000247aea4] .cl_io_init0+0x104/0x260 [obdclass] 2012-11-28 09:33:05.886689 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7a30] [800000000695a09c] .ll_file_io_generic+0x11c/0x670 [lustre] 2012-11-28 09:33:05.886740 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7b30] [800000000695b134] .ll_file_aio_read+0x1d4/0x3a0 [lustre] 2012-11-28 09:33:05.886790 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7c00] [800000000695b450] .ll_file_read+0x150/0x320 [lustre] 2012-11-28 09:33:05.886841 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7ce0] [c0000000000d21a0] .vfs_read+0xd0/0x1c4 2012-11-28 09:33:05.886893 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7d80] [c0000000000d2390] .SyS_read+0x54/0x98 2012-11-28 09:33:05.886943 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7e30] [c000000000000580] syscall_exit+0x0/0x2c
My initial guess is there is another thread holding a write lock on that semaphore, but I haven't able to pin down a thread holding the write lock looking at the stacks. We also don't have access to crash or kdumps enabled, so I can't directly inspect the mutex.
I've attached the stacks for all processes on the system (sysrq-t), unfortunately I forgot to dump the running processes (sysrq-l).
Attachments
Issue Links
- is related to
-
LU-1876 Layout Lock Server Patch Landings to Master
- Resolved