[LU-3504] MDS: All cores spinning on ldlm lock in lock_res_and_lock - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.0
Labels:
- sequoia
- zfs

Severity:
3
Rank (Obsolete):
8822

Description

Grove's MDS had some trouble yesterday when it got into a situation where all available CPUs were spinning in lock_res_and_lock, presumably on the same ldlm_lock. The system was crashed to gather a core dump. Upon Lustre starting back up, it went through recovery and then got back into the same state where all cores were spinning in lock_res_and_lock. Rebooting again, and bringing lustre up with the "abort_recov" option got the system back into a usable state.

The core dump captured show all CPUs spinning with one of the following back traces:

PID: 25091 TASK: ... CPU: 0 COMMAND: "mdt00_033"                                
...                                                                                
--- <IRQ STACK> ---                                                                
#17 [...] ret_from_intr                                                            
    [exception RIP: _spin_lock+30]                                                 
#18 [...] lock_res_and_lock+0x30 at ... [ptlrpc]                                   
#19 [...] ldlm_handle_enqueue0+0x907 at ... [ptlrpc]                               
#20 [...] mdt_enqueue+0x46 at ... [mdt]                                            
#21 [...] mdt_handle_common+0x648 at ... [mdt]                                     
#22 [...] mds_regular_handle+0x15 at ... [mdt]                                     
#23 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
#24 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
#25 [...] child_rip+0xa                                                            
                                                                                   
PID: 25291 TASK: ... CPU:1 COMMAND: "mdt00_088"                                    
...                                                                                
--- <NMI exception stack> ---                                                      
#6  [...] _spin_lock+0x1e                                                          
#7  [...] lock_res_and_lock+0x30 at ... [ptlrpc]                                   
#8  [...] ldlm_lock_enqueue+0x11d at ... [ptlrpc]                                  
#9  [...] ldlm_handle_enqueue+0x4ef at ... [ptlrpc]                                
#10 [...] mdt_enqueue+0x46 at ... [mdt]                                            
#11 [...] mdt_handle_common+0x648 at ... [mdt]                                     
#12 [...] mds_regular_handle+0x15 at ... [mdt]                                     
#13 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
#14 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
#15 [...] child_rip+0xa

Attachments

Issue Links

is related to

LU-4801 spin lock contention in lock_res_and_lock

Resolved

is related to

LU-2835 mds crash, cfs_hash_bd_del_locked()) ASSERTION( bd->bd_bucket->hsb_count > 0 ) failed

Resolved

Activity

People

Assignee:: Bruno Faccini (Inactive)

Reporter:: Prakash Surya (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 25/Jun/13 6:27 PM

Updated:: 21/Mar/14 7:52 PM

Resolved:: 29/Aug/13 1:49 PM