[LU-2404] Many threads stuck on cfs_down_read of lov->lo_type_guard Created: 28/Nov/12 Updated: 03/Dec/12 Resolved: 03/Dec/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Prakash Surya (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | sequoia | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Severity: | 3 | ||||||||
| Rank (Obsolete): | 5698 | ||||||||
| Description |
|
We've had reports on Seqoia of many clients getting stuck during reads. I had a chance to dump the stacks on a client in this state and saw many threads which appeared to be stuck on the lov->lo_type_gaurd in a cfs_down_read. Here's an example stack: 2012-11-28 09:33:05.886131 {DefaultControlEventListener} [mmcs]{716}.0.0: sysiod D 00000fffa66863e0 0 8855 3105 0x00000002
2012-11-28 09:33:05.886181 {DefaultControlEventListener} [mmcs]{716}.0.0: Call Trace:
2012-11-28 09:33:05.886232 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7160] [c0000003ec8e71f0] 0xc0000003ec8e71f0 (unreliable)
2012-11-28 09:33:05.886282 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7330] [c000000000008de0] .__switch_to+0xc4/0x100
2012-11-28 09:33:05.886333 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e73c0] [c00000000042b0e0] .schedule+0x858/0x9c0
2012-11-28 09:33:05.886384 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7670] [c00000000042dcac] .rwsem_down_failed_common+0x270/0x2b8
2012-11-28 09:33:05.886435 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7730] [c00000000042dd60] .rwsem_down_read_failed+0x2c/0x44
2012-11-28 09:33:05.886486 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e77d0] [c00000000042cee8] .down_read+0x30/0x44
2012-11-28 09:33:05.886537 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7840] [80000000051460d8] .lov_lsm_addref+0x48/0x200 [lov]
2012-11-28 09:33:05.886587 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e78e0] [8000000005146e94] .lov_io_init+0x84/0x160 [lov]
2012-11-28 09:33:05.886638 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7980] [800000000247aea4] .cl_io_init0+0x104/0x260 [obdclass]
2012-11-28 09:33:05.886689 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7a30] [800000000695a09c] .ll_file_io_generic+0x11c/0x670 [lustre]
2012-11-28 09:33:05.886740 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7b30] [800000000695b134] .ll_file_aio_read+0x1d4/0x3a0 [lustre]
2012-11-28 09:33:05.886790 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7c00] [800000000695b450] .ll_file_read+0x150/0x320 [lustre]
2012-11-28 09:33:05.886841 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7ce0] [c0000000000d21a0] .vfs_read+0xd0/0x1c4
2012-11-28 09:33:05.886893 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7d80] [c0000000000d2390] .SyS_read+0x54/0x98
2012-11-28 09:33:05.886943 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7e30] [c000000000000580] syscall_exit+0x0/0x2c
My initial guess is there is another thread holding a write lock on that semaphore, but I haven't able to pin down a thread holding the write lock looking at the stacks. We also don't have access to crash or kdumps enabled, so I can't directly inspect the mutex. I've attached the stacks for all processes on the system (sysrq-t), unfortunately I forgot to dump the running processes (sysrq-l). |
| Comments |
| Comment by Peter Jones [ 29/Nov/12 ] |
|
Alex, what do you think about this one? |
| Comment by Peter Jones [ 30/Nov/12 ] |
|
Jinshan Andreas suggested that you should review this one Peter |
| Comment by Jinshan Xiong (Inactive) [ 30/Nov/12 ] |
|
This problem should have been fixed at Let's leave this ticket open until it's verified in the next test |
| Comment by Prakash Surya (Inactive) [ 03/Dec/12 ] |
|
Thanks Jinshan. I'll pull that into our branch. I don't have a solid reproducer for this issue, so I'm OK if you want to close it. It can always be reopened if it is seen with the fix applied. |
| Comment by Peter Jones [ 03/Dec/12 ] |
|
ok thanks Prakash! |