[LU-2404] Many threads stuck on cfs_down_read of lov->lo_type_guard Created: 28/Nov/12  Updated: 03/Dec/12  Resolved: 03/Dec/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Prakash Surya (Inactive) Assignee: Jinshan Xiong (Inactive)
Resolution: Duplicate Votes: 0
Labels: sequoia

Attachments: File RB2-ID-J04.log    
Issue Links:
Related
is related to LU-1876 Layout Lock Server Patch Landings to ... Resolved
Severity: 3
Rank (Obsolete): 5698

 Description   

We've had reports on Seqoia of many clients getting stuck during reads. I had a chance to dump the stacks on a client in this state and saw many threads which appeared to be stuck on the lov->lo_type_gaurd in a cfs_down_read. Here's an example stack:

2012-11-28 09:33:05.886131 {DefaultControlEventListener} [mmcs]{716}.0.0: sysiod        D 00000fffa66863e0     0  8855   3105 0x00000002
2012-11-28 09:33:05.886181 {DefaultControlEventListener} [mmcs]{716}.0.0: Call Trace:
2012-11-28 09:33:05.886232 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7160] [c0000003ec8e71f0] 0xc0000003ec8e71f0 (unreliable)
2012-11-28 09:33:05.886282 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7330] [c000000000008de0] .__switch_to+0xc4/0x100
2012-11-28 09:33:05.886333 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e73c0] [c00000000042b0e0] .schedule+0x858/0x9c0
2012-11-28 09:33:05.886384 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7670] [c00000000042dcac] .rwsem_down_failed_common+0x270/0x2b8
2012-11-28 09:33:05.886435 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7730] [c00000000042dd60] .rwsem_down_read_failed+0x2c/0x44
2012-11-28 09:33:05.886486 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e77d0] [c00000000042cee8] .down_read+0x30/0x44
2012-11-28 09:33:05.886537 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7840] [80000000051460d8] .lov_lsm_addref+0x48/0x200 [lov]
2012-11-28 09:33:05.886587 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e78e0] [8000000005146e94] .lov_io_init+0x84/0x160 [lov]
2012-11-28 09:33:05.886638 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7980] [800000000247aea4] .cl_io_init0+0x104/0x260 [obdclass]
2012-11-28 09:33:05.886689 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7a30] [800000000695a09c] .ll_file_io_generic+0x11c/0x670 [lustre]
2012-11-28 09:33:05.886740 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7b30] [800000000695b134] .ll_file_aio_read+0x1d4/0x3a0 [lustre]
2012-11-28 09:33:05.886790 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7c00] [800000000695b450] .ll_file_read+0x150/0x320 [lustre]
2012-11-28 09:33:05.886841 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7ce0] [c0000000000d21a0] .vfs_read+0xd0/0x1c4
2012-11-28 09:33:05.886893 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7d80] [c0000000000d2390] .SyS_read+0x54/0x98
2012-11-28 09:33:05.886943 {DefaultControlEventListener} [mmcs]{716}.0.0: [c0000003ec8e7e30] [c000000000000580] syscall_exit+0x0/0x2c

My initial guess is there is another thread holding a write lock on that semaphore, but I haven't able to pin down a thread holding the write lock looking at the stacks. We also don't have access to crash or kdumps enabled, so I can't directly inspect the mutex.

I've attached the stacks for all processes on the system (sysrq-t), unfortunately I forgot to dump the running processes (sysrq-l).



 Comments   
Comment by Peter Jones [ 29/Nov/12 ]

Alex, what do you think about this one?

Comment by Peter Jones [ 30/Nov/12 ]

Jinshan

Andreas suggested that you should review this one

Peter

Comment by Jinshan Xiong (Inactive) [ 30/Nov/12 ]

This problem should have been fixed at LU-1876 and the commit number is: ecaba99677b28536f9c376b2b835b554a7792668.

Let's leave this ticket open until it's verified in the next test

Comment by Prakash Surya (Inactive) [ 03/Dec/12 ]

Thanks Jinshan. I'll pull that into our branch. I don't have a solid reproducer for this issue, so I'm OK if you want to close it. It can always be reopened if it is seen with the fix applied.

Comment by Peter Jones [ 03/Dec/12 ]

ok thanks Prakash!

Generated at Sat Feb 10 01:24:55 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.