Description
Grove's MDS had some trouble yesterday when it got into a situation where all available CPUs were spinning in lock_res_and_lock, presumably on the same ldlm_lock. The system was crashed to gather a core dump. Upon Lustre starting back up, it went through recovery and then got back into the same state where all cores were spinning in lock_res_and_lock. Rebooting again, and bringing lustre up with the "abort_recov" option got the system back into a usable state.
The core dump captured show all CPUs spinning with one of the following back traces:
PID: 25091 TASK: ... CPU: 0 COMMAND: "mdt00_033"
...
--- <IRQ STACK> ---
#17 [...] ret_from_intr
[exception RIP: _spin_lock+30]
#18 [...] lock_res_and_lock+0x30 at ... [ptlrpc]
#19 [...] ldlm_handle_enqueue0+0x907 at ... [ptlrpc]
#20 [...] mdt_enqueue+0x46 at ... [mdt]
#21 [...] mdt_handle_common+0x648 at ... [mdt]
#22 [...] mds_regular_handle+0x15 at ... [mdt]
#23 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]
#24 [...] ptlrpc_main+0xace at ... [ptlrpc]
#25 [...] child_rip+0xa
PID: 25291 TASK: ... CPU:1 COMMAND: "mdt00_088"
...
--- <NMI exception stack> ---
#6 [...] _spin_lock+0x1e
#7 [...] lock_res_and_lock+0x30 at ... [ptlrpc]
#8 [...] ldlm_lock_enqueue+0x11d at ... [ptlrpc]
#9 [...] ldlm_handle_enqueue+0x4ef at ... [ptlrpc]
#10 [...] mdt_enqueue+0x46 at ... [mdt]
#11 [...] mdt_handle_common+0x648 at ... [mdt]
#12 [...] mds_regular_handle+0x15 at ... [mdt]
#13 [...] ptlrpc_server_handle_request+0x398 at ... [ptlrpc]
#14 [...] ptlrpc_main+0xace at ... [ptlrpc]
#15 [...] child_rip+0xa