We also encounter this same situation on one of our MDS who was not fully operational ("df" was working, but some sub-trees access was hanging).
In our case, there was 4 threads eating 100% cpu but regularly re-schedule()'ing :
> 9735 2 18 ffff88187c3d87d0 RU 0.0 0 0 [ldlm_elt]
> 9770 2 12 ffff88205ada0100 RU 0.0 0 0 [ll_evictor]
> 35489 2 1 ffff881854ef2790 RU 0.0 0 0 [mdt_121]
> 46002 2 15 ffff88184cca97d0 RU 0.0 0 0 [mdt_445]
since we could not find a way to gracefully recover from the situation we decided to re-boot the MDT and to take/force a crash-dump where the threads stacks looks the same than in our live Alt+SysRq+l attempts :
crash> bt 9735 9770 35489 46002
PID: 9735 TASK: ffff88187c3d87d0 CPU: 18 COMMAND: "ldlm_elt"
#0 [ffff88109c707e90] crash_nmi_callback at ffffffff8101fd06
#1 [ffff88109c707ea0] notifier_call_chain at ffffffff814837f5
#2 [ffff88109c707ee0] atomic_notifier_call_chain at ffffffff8148385a
#3 [ffff88109c707ef0] notify_die at ffffffff8108026e
#4 [ffff88109c707f20] do_nmi at ffffffff81481443
#5 [ffff88109c707f50] nmi at ffffffff81480d50
[exception RIP: cfs_hash_for_each_relax+193]
RIP: ffffffffa0410e11 RSP: ffff8818675e7b80 RFLAGS: 00000282
RAX: ffff88165f78567c RBX: ffff881020d2ccc0 RCX: 0000000000000005
RDX: 000000000000000d RSI: ffff8818675e7bd0 RDI: ffff881020d2ccc0
RBP: ffff8818675e7c10 R8: 0000000209bd7a5d R9: 6d68000000000000
R10: 6b40000000000000 R11: 0c204977264ccdad R12: 0000000000000000
R13: ffffffffa058e3c0 R14: 0000000000000001 R15: ffff881020d2cd40
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff8818675e7b80] cfs_hash_for_each_relax at ffffffffa0410e11 [libcfs]
#7 [ffff8818675e7c18] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs]
#8 [ffff8818675e7c98] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc]
#9 [ffff8818675e7ca8] server_disconnect_export at ffffffffa059a2b4 [ptlrpc]
#10 [ffff8818675e7d28] mdt_obd_disconnect at ffffffffa0a079eb [mdt]
#11 [ffff8818675e7e48] class_fail_export at ffffffffa04aacde [obdclass]
#12 [ffff8818675e7e98] expired_lock_main at ffffffffa05b0dd4 [ptlrpc]
#13 [ffff8818675e7f48] kernel_thread at ffffffff810041aa
PID: 9770 TASK: ffff88205ada0100 CPU: 12 COMMAND: "ll_evictor"
#0 [ffff8800450c7e90] crash_nmi_callback at ffffffff8101fd06
#1 [ffff8800450c7ea0] notifier_call_chain at ffffffff814837f5
#2 [ffff8800450c7ee0] atomic_notifier_call_chain at ffffffff8148385a
#3 [ffff8800450c7ef0] notify_die at ffffffff8108026e
#4 [ffff8800450c7f20] do_nmi at ffffffff81481443
#5 [ffff8800450c7f50] nmi at ffffffff81480d50
[exception RIP: cfs_hash_for_each_relax+168]
RIP: ffffffffa0410df8 RSP: ffff88205adc7b20 RFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff8810254ecd80 RCX: 0000000000000005
RDX: 0000000000000003 RSI: ffff88205adc7b70 RDI: ffff8810254ecd80
RBP: ffff88205adc7bb0 R8: 0000000208590aee R9: 2e90000000000000
R10: 7480000000000000 R11: 9ddca10a56b825d2 R12: 0000000000000000
R13: ffffffffa058e3c0 R14: 0000000000000001 R15: ffff8810254ece00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff88205adc7b20] cfs_hash_for_each_relax at ffffffffa0410df8 [libcfs]
#7 [ffff88205adc7bb8] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs]
#8 [ffff88205adc7c38] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc]
#9 [ffff88205adc7c48] server_disconnect_export at ffffffffa059a2b4 [ptlrpc]
#10 [ffff88205adc7cc8] mdt_obd_disconnect at ffffffffa0a079eb [mdt]
#11 [ffff88205adc7de8] class_fail_export at ffffffffa04aacde [obdclass]
#12 [ffff88205adc7e38] ping_evictor_main at ffffffffa05e5a5d [ptlrpc]
#13 [ffff88205adc7f48] kernel_thread at ffffffff810041aa
PID: 35489 TASK: ffff881854ef2790 CPU: 1 COMMAND: "mdt_121"
#0 [ffff88089c407e90] crash_nmi_callback at ffffffff8101fd06
#1 [ffff88089c407ea0] notifier_call_chain at ffffffff814837f5
#2 [ffff88089c407ee0] atomic_notifier_call_chain at ffffffff8148385a
#3 [ffff88089c407ef0] notify_die at ffffffff8108026e
#4 [ffff88089c407f20] do_nmi at ffffffff81481443
#5 [ffff88089c407f50] nmi at ffffffff81480d50
[exception RIP: cfs_hash_for_each_relax+193]
RIP: ffffffffa0410e11 RSP: ffff88184e85b960 RFLAGS: 00000286
RAX: ffff8817f53bf10c RBX: ffff88105d484800 RCX: 0000000000000005
RDX: 000000000000001f RSI: ffff88184e85b9b0 RDI: ffff88105d484800
RBP: ffff88184e85b9f0 R8: 0000000208590aee R9: f188000000000000
R10: 8c40000000000000 R11: 9ddca109f53c3e31 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff88184e85b960] cfs_hash_for_each_relax at ffffffffa0410e11 [libcfs]
#7 [ffff88184e85b9f8] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs]
#8 [ffff88184e85ba78] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc]
#9 [ffff88184e85ba88] server_disconnect_export at ffffffffa059a2b4 [ptlrpc]
#10 [ffff88184e85bb08] mdt_obd_disconnect at ffffffffa0a079eb [mdt]
#11 [ffff88184e85bc28] target_handle_disconnect at ffffffffa05966a9 [ptlrpc]
#12 [ffff88184e85bc88] mdt_disconnect at ffffffffa09fe2a9 [mdt]
#13 [ffff88184e85bcd8] mdt_handle_common at ffffffffa09f9865 [mdt]
#14 [ffff88184e85bd58] mdt_regular_handle at ffffffffa09fa875 [mdt]
#15 [ffff88184e85bd68] ptlrpc_main at ffffffffa05e4829 [ptlrpc]
#16 [ffff88184e85bf48] kernel_thread at ffffffff810041aa
PID: 46002 TASK: ffff88184cca97d0 CPU: 15 COMMAND: "mdt_445"
#0 [ffff88189c4c7e90] crash_nmi_callback at ffffffff8101fd06
#1 [ffff88189c4c7ea0] notifier_call_chain at ffffffff814837f5
#2 [ffff88189c4c7ee0] atomic_notifier_call_chain at ffffffff8148385a
#3 [ffff88189c4c7ef0] notify_die at ffffffff8108026e
#4 [ffff88189c4c7f20] do_nmi at ffffffff81481443
#5 [ffff88189c4c7f50] nmi at ffffffff81480d50
[exception RIP: cfs_hash_hh_hhead]
RIP: ffffffffa040f280 RSP: ffff881849f6b958 RFLAGS: 00000246
RAX: ffffffffa0422200 RBX: ffff88106b5d5b00 RCX: 0000000000000005
RDX: 0000000000000002 RSI: ffff881849f6b9b0 RDI: ffff88106b5d5b00
RBP: ffff881849f6b9f0 R8: 00000002075042cd R9: 0858000000000000
R10: 42c0000000000000 R11: c0fc1c42c8d4010b R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff881849f6b958] cfs_hash_hh_hhead at ffffffffa040f280 [libcfs]
#7 [ffff881849f6b958] cfs_hash_for_each_relax at ffffffffa0410e05 [libcfs]
#8 [ffff881849f6b9f8] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs]
#9 [ffff881849f6ba78] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc]
#10 [ffff881849f6ba88] server_disconnect_export at ffffffffa059a2b4 [ptlrpc]
#11 [ffff881849f6bb08] mdt_obd_disconnect at ffffffffa0a079eb [mdt]
#12 [ffff881849f6bc28] target_handle_disconnect at ffffffffa05966a9 [ptlrpc]
#13 [ffff881849f6bc88] mdt_disconnect at ffffffffa09fe2a9 [mdt]
#14 [ffff881849f6bcd8] mdt_handle_common at ffffffffa09f9865 [mdt]
#15 [ffff881849f6bd58] mdt_regular_handle at ffffffffa09fa875 [mdt]
#16 [ffff881849f6bd68] ptlrpc_main at ffffffffa05e4829 [ptlrpc]
#17 [ffff881849f6bf48] kernel_thread at ffffffff810041aa
crash>
crash> ps | grep UN
crash> log | less
crash> bt 35683
PID: 35683 TASK: ffff8810732200c0 CPU: 21 COMMAND: "mdt_315"
#0 [ffff881025bf7650] schedule at ffffffff8147dddc
#1 [ffff881025bf7718] cfs_waitq_wait at ffffffffa040175e [libcfs]
#2 [ffff881025bf7728] ldlm_completion_ast at ffffffffa05ad372 [ptlrpc]
#3 [ffff881025bf77f8] ldlm_cli_enqueue_local at ffffffffa05aca79 [ptlrpc]
#4 [ffff881025bf78b8] mdt_object_lock at ffffffffa09f544e [mdt]
#5 [ffff881025bf7978] mdt_getattr_name_lock at ffffffffa0a0122b [mdt]
#6 [ffff881025bf7a58] mdt_intent_getattr at ffffffffa0a0247a [mdt]
#7 [ffff881025bf7af8] mdt_intent_policy at ffffffffa09ff630 [mdt]
#8 [ffff881025bf7b68] ldlm_lock_enqueue at ffffffffa058eb8a [ptlrpc]
#9 [ffff881025bf7c08] ldlm_handle_enqueue0 at ffffffffa05b5767 [ptlrpc]
#10 [ffff881025bf7ca8] mdt_enqueue at ffffffffa09ff0ca [mdt]
#11 [ffff881025bf7cd8] mdt_handle_common at ffffffffa09f9865 [mdt]
#12 [ffff881025bf7d58] mdt_regular_handle at ffffffffa09fa875 [mdt]
#13 [ffff881025bf7d68] ptlrpc_main at ffffffffa05e4829 [ptlrpc]
#14 [ffff881025bf7f48] kernel_thread at ffffffff810041aa
crash>
thus my current opinion is that there seems to be a problem (loop ?) in the hash-list/struct that manage export's locks.
But I need to dig more in the crash-dump to conclude ...
Landed for 2.1.3 and 2.3