Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3504

MDS: All cores spinning on ldlm lock in lock_res_and_lock

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.0
    • 3
    • 8822

    Description

      Grove's MDS had some trouble yesterday when it got into a situation where all available CPUs were spinning in lock_res_and_lock, presumably on the same ldlm_lock. The system was crashed to gather a core dump. Upon Lustre starting back up, it went through recovery and then got back into the same state where all cores were spinning in lock_res_and_lock. Rebooting again, and bringing lustre up with the "abort_recov" option got the system back into a usable state.

      The core dump captured show all CPUs spinning with one of the following back traces:

      PID: 25091 TASK: ... CPU: 0 COMMAND: "mdt00_033"                                
      ...                                                                                
      --- <IRQ STACK> ---                                                                
      #17 [...] ret_from_intr                                                            
          [exception RIP: _spin_lock+30]                                                 
      #18 [...] lock_res_and_lock+0x30 at ... [ptlrpc]                                   
      #19 [...] ldlm_handle_enqueue0+0x907 at ... [ptlrpc]                               
      #20 [...] mdt_enqueue+0x46 at ... [mdt]                                            
      #21 [...] mdt_handle_common+0x648 at ... [mdt]                                     
      #22 [...] mds_regular_handle+0x15 at ... [mdt]                                     
      #23 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
      #24 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
      #25 [...] child_rip+0xa                                                            
                                                                                         
      PID: 25291 TASK: ... CPU:1 COMMAND: "mdt00_088"                                    
      ...                                                                                
      --- <NMI exception stack> ---                                                      
      #6  [...] _spin_lock+0x1e                                                          
      #7  [...] lock_res_and_lock+0x30 at ... [ptlrpc]                                   
      #8  [...] ldlm_lock_enqueue+0x11d at ... [ptlrpc]                                  
      #9  [...] ldlm_handle_enqueue+0x4ef at ... [ptlrpc]                                
      #10 [...] mdt_enqueue+0x46 at ... [mdt]                                            
      #11 [...] mdt_handle_common+0x648 at ... [mdt]                                     
      #12 [...] mds_regular_handle+0x15 at ... [mdt]                                     
      #13 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]                       
      #14 [...] ptlrpc_main+0xace at ... [ptlrpc]                                        
      #15 [...] child_rip+0xa
      

      Attachments

        Issue Links

          Activity

            People

              bfaccini Bruno Faccini (Inactive)
              prakash Prakash Surya (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: