Description
Grove's MDS had some trouble yesterday when it got into a situation where all available CPUs were spinning in lock_res_and_lock, presumably on the same ldlm_lock. The system was crashed to gather a core dump. Upon Lustre starting back up, it went through recovery and then got back into the same state where all cores were spinning in lock_res_and_lock. Rebooting again, and bringing lustre up with the "abort_recov" option got the system back into a usable state.
The core dump captured show all CPUs spinning with one of the following back traces:
PID: 25091 TASK: ... CPU: 0 COMMAND: "mdt00_033" ... --- <IRQ STACK> --- #17 [...] ret_from_intr [exception RIP: _spin_lock+30] #18 [...] lock_res_and_lock+0x30 at ... [ptlrpc] #19 [...] ldlm_handle_enqueue0+0x907 at ... [ptlrpc] #20 [...] mdt_enqueue+0x46 at ... [mdt] #21 [...] mdt_handle_common+0x648 at ... [mdt] #22 [...] mds_regular_handle+0x15 at ... [mdt] #23 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc] #24 [...] ptlrpc_main+0xace at ... [ptlrpc] #25 [...] child_rip+0xa PID: 25291 TASK: ... CPU:1 COMMAND: "mdt00_088" ... --- <NMI exception stack> --- #6 [...] _spin_lock+0x1e #7 [...] lock_res_and_lock+0x30 at ... [ptlrpc] #8 [...] ldlm_lock_enqueue+0x11d at ... [ptlrpc] #9 [...] ldlm_handle_enqueue+0x4ef at ... [ptlrpc] #10 [...] mdt_enqueue+0x46 at ... [mdt] #11 [...] mdt_handle_common+0x648 at ... [mdt] #12 [...] mds_regular_handle+0x15 at ... [mdt] #13 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc] #14 [...] ptlrpc_main+0xace at ... [ptlrpc] #15 [...] child_rip+0xa