[LU-1428] MDT servrice threads spinning in cfs_hash_for_each_relax() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3
Affects Version/s: Lustre 2.1.1
Labels:
- libcfs
- server
Environment:
https://github.com/chaos/lustre/commits/2.1.1-4chaos

Severity:
3
Rank (Obsolete):
4587

Description

We have two MDT service threads using 100% CPU on a production MDS. I can't get a backtrace from crash because they do not yield the CPU, but based on oprofile they seem to be spinning in cfs_hash_for_each_relax(). At the same time we are seeing client hangs and high lock cancellation rates on the OSTs.

samples  %        image name               app name                 symbol name
4225020  33.0708  libcfs.ko                libcfs.ko                cfs_hash_for_each_relax
3345225  26.1843  libcfs.ko                libcfs.ko                cfs_hash_hh_hhead
532409    4.1674  ptlrpc.ko                ptlrpc.ko                ldlm_cancel_locks_for_export_cb
307199    2.4046  ptlrpc.ko                ptlrpc.ko                lock_res_and_lock
175349    1.3725  vmlinux                  vmlinux                  native_read_tsc
151989    1.1897  ptlrpc.ko                ptlrpc.ko                ldlm_del_waiting_lock
136679    1.0698  libcfs.ko                libcfs.ko                cfs_hash_rw_lock
109269    0.8553  jbd2.ko                  jbd2.ko                  journal_clean_one_cp_list

Attachments

Issue Links

is duplicated by

LU-1087 mdt thread spinning out of control

Resolved

Trackbacks

Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....

Activity

[LU-1428] MDT servrice threads spinning in cfs_hash_for_each_relax()

Peter Jones added a comment - 23/Jul/12 11:28 AM

Landed for 2.1.3 and 2.3

Peter Jones added a comment - 23/Jul/12 11:28 AM Landed for 2.1.3 and 2.3

Jay Lan (Inactive) added a comment - 29/Jun/12 5:26 PM

NASA Ames hit this this morning on a 2.1.1 MDS.

Jay Lan (Inactive) added a comment - 29/Jun/12 5:26 PM NASA Ames hit this this morning on a 2.1.1 MDS.

Andreas Dilger added a comment - 19/Jun/12 6:46 PM

Patch landed to master for Lustre 2.3.0, still needs to be landed to b2_1 for 2.1.3.

Andreas Dilger added a comment - 19/Jun/12 6:46 PM Patch landed to master for Lustre 2.3.0, still needs to be landed to b2_1 for 2.1.3.

Bruno Faccini (Inactive) added a comment - 19/Jun/12 10:34 AM

Liang,

I wanted to provide more details for this problem, and particulary to indicate that I feel pretty sure that the looping situation will never clear due to "orphaned" locks stiil beeing on the hash-lists, but seems that you finally found the bug/window !!

So, may be its too late, but just in case, I also found that digging in the Servers/Clients logs, it seems to me that at the start of the problem, the concerned Client/export reported the following errors/msgs during MDT umount :

1339507111 2012 Jun 12 15:18:31 lascaux3332 kern err kernel LustreError: 57793:0:(lmv_obd.c:665:lmv_disconnect_mdc()) Target scratch2-MDT0000_UUID disconnect error -110
1339507111 2012 Jun 12 15:18:31 lascaux3332 kern warning kernel Lustre: client ffff88087d539400 umount complete
1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Skipped 52 previous similar messages

Also interesting is that this situation may have a "low" impact when concerning a single Client/export which already umounted and locks on a very specific part/directory of the filesystem !! This explain why we found multiple mdt_<id> threads spinning in this situation since days and no problem reported nor found whan accessing/using the concerned filesystem !!!

Bruno Faccini (Inactive) added a comment - 19/Jun/12 10:34 AM Liang, I wanted to provide more details for this problem, and particulary to indicate that I feel pretty sure that the looping situation will never clear due to "orphaned" locks stiil beeing on the hash-lists, but seems that you finally found the bug/window !! So, may be its too late, but just in case, I also found that digging in the Servers/Clients logs, it seems to me that at the start of the problem, the concerned Client/export reported the following errors/msgs during MDT umount : 1339507111 2012 Jun 12 15:18:31 lascaux3332 kern err kernel LustreError: 57793:0:(lmv_obd.c:665:lmv_disconnect_mdc()) Target scratch2-MDT0000_UUID disconnect error -110 1339507111 2012 Jun 12 15:18:31 lascaux3332 kern warning kernel Lustre: client ffff88087d539400 umount complete 1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway 1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Skipped 52 previous similar messages Also interesting is that this situation may have a "low" impact when concerning a single Client/export which already umounted and locks on a very specific part/directory of the filesystem !! This explain why we found multiple mdt_<id> threads spinning in this situation since days and no problem reported nor found whan accessing/using the concerned filesystem !!!

Ned Bass (Inactive) added a comment - 12/Jun/12 12:09 PM

Thanks, we also need this patch for 2.1.

Ned Bass (Inactive) added a comment - 12/Jun/12 12:09 PM Thanks, we also need this patch for 2.1.

Liang Zhen (Inactive) added a comment - 05/Jun/12 5:32 AM

I've posted a patch for review: http://review.whamcloud.com/#change,3028

Liang Zhen (Inactive) added a comment - 05/Jun/12 5:32 AM I've posted a patch for review: http://review.whamcloud.com/#change,3028

Liang Zhen (Inactive) added a comment - 05/Jun/12 3:43 AM

Looks like this one is same with ~~LU-1087~~, due to log on ~~LU-1087~~ I suspect that we left a lock on the export and can't remove it by this path:

ldlm_cancel_locks_for_export_cb()->
       ldlm_lock_cancel()->
             ldlm_lock_destroy_nolock()->
                   ldlm_lock_destroy_internal()->cfs_hash_del()

we actually started to see this since:
https://bugzilla.lustre.org/show_bug.cgi?id=19557

Probably we never really can fix this issue although we have landed a patch on BZ 19557.

I suspect it's because:

ldlm_lock_destroy_internal:

        if (lock->l_destroyed) {
                LASSERT(cfs_list_empty(&lock->l_lru));
                EXIT;
                return 0;
        }
        lock->l_destroyed = 1;

        if (lock->l_export && lock->l_export->exp_lock_hash &&
            !cfs_hlist_unhashed(&lock->l_exp_hash))
                cfs_hash_del(lock->l_export->exp_lock_hash,
                             &lock->l_remote_handle, &lock->l_exp_hash);

lock->l_exp_hash should be protected by internal lock of cfs_hash, but we called cfs_hlist_unhashed(&lock->l_exp_hash) w/o holding cfs_hash lock, which means if someone wants to cancel this lock while export->exp_lock_hash is in progress of rehashing (thread context of cfs_workitem), there could be tiny window between deleting a lock from bucket[A] and re-adding it to bucket[B] of l_exp_hash, and cfs_hlist_unhashed(&lock->l_exp_hash) will return 1 in this tiny window, then we destroyed a lock but left it on l_exp_hash forever because we set lock::l_destroyed to 1 and ldlm_lock_destroy_internal() wouldn't do this again for us even it's called multiple times.

making a simple change to cfs_hash_del() and removing the above checking could fix this issue, I can post a patch for this.

Liang Zhen (Inactive) added a comment - 05/Jun/12 3:43 AM Looks like this one is same with LU-1087 , due to log on LU-1087 I suspect that we left a lock on the export and can't remove it by this path: ldlm_cancel_locks_for_export_cb()-> ldlm_lock_cancel()-> ldlm_lock_destroy_nolock()-> ldlm_lock_destroy_internal()->cfs_hash_del() we actually started to see this since: https://bugzilla.lustre.org/show_bug.cgi?id=19557 Probably we never really can fix this issue although we have landed a patch on BZ 19557. I suspect it's because: ldlm_lock_destroy_internal: if (lock->l_destroyed) { LASSERT(cfs_list_empty(&lock->l_lru)); EXIT; return 0; } lock->l_destroyed = 1; if (lock->l_export && lock->l_export->exp_lock_hash && !cfs_hlist_unhashed(&lock->l_exp_hash)) cfs_hash_del(lock->l_export->exp_lock_hash, &lock->l_remote_handle, &lock->l_exp_hash); lock->l_exp_hash should be protected by internal lock of cfs_hash, but we called cfs_hlist_unhashed(&lock->l_exp_hash) w/o holding cfs_hash lock, which means if someone wants to cancel this lock while export->exp_lock_hash is in progress of rehashing (thread context of cfs_workitem), there could be tiny window between deleting a lock from bucket [A] and re-adding it to bucket [B] of l_exp_hash, and cfs_hlist_unhashed(&lock->l_exp_hash) will return 1 in this tiny window, then we destroyed a lock but left it on l_exp_hash forever because we set lock::l_destroyed to 1 and ldlm_lock_destroy_internal() wouldn't do this again for us even it's called multiple times. making a simple change to cfs_hash_del() and removing the above checking could fix this issue, I can post a patch for this.

Bruno Faccini (Inactive) added a comment - 04/Jun/12 9:16 AM

Thank's for pointing that Ned, the patch for ~~LU-143~~ was not integrated. R&D made the job to back-port it in our Lustre 2.1 ditro, so now we just need to expose it to customer's production work-load and see how things work !!...

Bruno Faccini (Inactive) added a comment - 04/Jun/12 9:16 AM Thank's for pointing that Ned, the patch for LU-143 was not integrated. R&D made the job to back-port it in our Lustre 2.1 ditro, so now we just need to expose it to customer's production work-load and see how things work !!...

Ned Bass (Inactive) added a comment - 23/May/12 2:43 PM

Bruno,

Do you have the ~~LU-143~~ patch? It was not landed until 2.2. Since your threads were getting rescheduled it makes me wonder if you are just suffering from poor hash distribution.

Ned Bass (Inactive) added a comment - 23/May/12 2:43 PM Bruno, Do you have the LU-143 patch? It was not landed until 2.2. Since your threads were getting rescheduled it makes me wonder if you are just suffering from poor hash distribution.

Bruno Faccini (Inactive) added a comment - 23/May/12 10:34 AM

We also encounter this same situation on one of our MDS who was not fully operational ("df" was working, but some sub-trees access was hanging).

In our case, there was 4 threads eating 100% cpu but regularly re-schedule()'ing :

> 9735 2 18 ffff88187c3d87d0 RU 0.0 0 0 [ldlm_elt]
> 9770 2 12 ffff88205ada0100 RU 0.0 0 0 [ll_evictor]
> 35489 2 1 ffff881854ef2790 RU 0.0 0 0 [mdt_121]
> 46002 2 15 ffff88184cca97d0 RU 0.0 0 0 [mdt_445]

since we could not find a way to gracefully recover from the situation we decided to re-boot the MDT and to take/force a crash-dump where the threads stacks looks the same than in our live Alt+SysRq+l attempts :

crash> bt 9735 9770 35489 46002
PID: 9735 TASK: ffff88187c3d87d0 CPU: 18 COMMAND: "ldlm_elt"
#0 [ffff88109c707e90] crash_nmi_callback at ffffffff8101fd06
#1 [ffff88109c707ea0] notifier_call_chain at ffffffff814837f5
#2 [ffff88109c707ee0] atomic_notifier_call_chain at ffffffff8148385a
#3 [ffff88109c707ef0] notify_die at ffffffff8108026e
#4 [ffff88109c707f20] do_nmi at ffffffff81481443
#5 [ffff88109c707f50] nmi at ffffffff81480d50
[exception RIP: cfs_hash_for_each_relax+193]
RIP: ffffffffa0410e11 RSP: ffff8818675e7b80 RFLAGS: 00000282
RAX: ffff88165f78567c RBX: ffff881020d2ccc0 RCX: 0000000000000005
RDX: 000000000000000d RSI: ffff8818675e7bd0 RDI: ffff881020d2ccc0
RBP: ffff8818675e7c10 R8: 0000000209bd7a5d R9: 6d68000000000000
R10: 6b40000000000000 R11: 0c204977264ccdad R12: 0000000000000000
R13: ffffffffa058e3c0 R14: 0000000000000001 R15: ffff881020d2cd40
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff8818675e7b80] cfs_hash_for_each_relax at ffffffffa0410e11 [libcfs]
#7 [ffff8818675e7c18] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs]
#8 [ffff8818675e7c98] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc]
#9 [ffff8818675e7ca8] server_disconnect_export at ffffffffa059a2b4 [ptlrpc]
#10 [ffff8818675e7d28] mdt_obd_disconnect at ffffffffa0a079eb [mdt]
#11 [ffff8818675e7e48] class_fail_export at ffffffffa04aacde [obdclass]
#12 [ffff8818675e7e98] expired_lock_main at ffffffffa05b0dd4 [ptlrpc]
#13 [ffff8818675e7f48] kernel_thread at ffffffff810041aa

PID: 9770 TASK: ffff88205ada0100 CPU: 12 COMMAND: "ll_evictor"
#0 [ffff8800450c7e90] crash_nmi_callback at ffffffff8101fd06
#1 [ffff8800450c7ea0] notifier_call_chain at ffffffff814837f5
#2 [ffff8800450c7ee0] atomic_notifier_call_chain at ffffffff8148385a
#3 [ffff8800450c7ef0] notify_die at ffffffff8108026e
#4 [ffff8800450c7f20] do_nmi at ffffffff81481443
#5 [ffff8800450c7f50] nmi at ffffffff81480d50
[exception RIP: cfs_hash_for_each_relax+168]
RIP: ffffffffa0410df8 RSP: ffff88205adc7b20 RFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff8810254ecd80 RCX: 0000000000000005
RDX: 0000000000000003 RSI: ffff88205adc7b70 RDI: ffff8810254ecd80
RBP: ffff88205adc7bb0 R8: 0000000208590aee R9: 2e90000000000000
R10: 7480000000000000 R11: 9ddca10a56b825d2 R12: 0000000000000000
R13: ffffffffa058e3c0 R14: 0000000000000001 R15: ffff8810254ece00
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff88205adc7b20] cfs_hash_for_each_relax at ffffffffa0410df8 [libcfs]
#7 [ffff88205adc7bb8] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs]
#8 [ffff88205adc7c38] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc]
#9 [ffff88205adc7c48] server_disconnect_export at ffffffffa059a2b4 [ptlrpc]
#10 [ffff88205adc7cc8] mdt_obd_disconnect at ffffffffa0a079eb [mdt]
#11 [ffff88205adc7de8] class_fail_export at ffffffffa04aacde [obdclass]
#12 [ffff88205adc7e38] ping_evictor_main at ffffffffa05e5a5d [ptlrpc]
#13 [ffff88205adc7f48] kernel_thread at ffffffff810041aa

PID: 35489 TASK: ffff881854ef2790 CPU: 1 COMMAND: "mdt_121"
#0 [ffff88089c407e90] crash_nmi_callback at ffffffff8101fd06
#1 [ffff88089c407ea0] notifier_call_chain at ffffffff814837f5
#2 [ffff88089c407ee0] atomic_notifier_call_chain at ffffffff8148385a
#3 [ffff88089c407ef0] notify_die at ffffffff8108026e
#4 [ffff88089c407f20] do_nmi at ffffffff81481443
#5 [ffff88089c407f50] nmi at ffffffff81480d50
[exception RIP: cfs_hash_for_each_relax+193]
RIP: ffffffffa0410e11 RSP: ffff88184e85b960 RFLAGS: 00000286
RAX: ffff8817f53bf10c RBX: ffff88105d484800 RCX: 0000000000000005
RDX: 000000000000001f RSI: ffff88184e85b9b0 RDI: ffff88105d484800
RBP: ffff88184e85b9f0 R8: 0000000208590aee R9: f188000000000000
R10: 8c40000000000000 R11: 9ddca109f53c3e31 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff88184e85b960] cfs_hash_for_each_relax at ffffffffa0410e11 [libcfs]
#7 [ffff88184e85b9f8] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs]
#8 [ffff88184e85ba78] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc]
#9 [ffff88184e85ba88] server_disconnect_export at ffffffffa059a2b4 [ptlrpc]
#10 [ffff88184e85bb08] mdt_obd_disconnect at ffffffffa0a079eb [mdt]
#11 [ffff88184e85bc28] target_handle_disconnect at ffffffffa05966a9 [ptlrpc]
#12 [ffff88184e85bc88] mdt_disconnect at ffffffffa09fe2a9 [mdt]
#13 [ffff88184e85bcd8] mdt_handle_common at ffffffffa09f9865 [mdt]
#14 [ffff88184e85bd58] mdt_regular_handle at ffffffffa09fa875 [mdt]
#15 [ffff88184e85bd68] ptlrpc_main at ffffffffa05e4829 [ptlrpc]
#16 [ffff88184e85bf48] kernel_thread at ffffffff810041aa

PID: 46002 TASK: ffff88184cca97d0 CPU: 15 COMMAND: "mdt_445"
#0 [ffff88189c4c7e90] crash_nmi_callback at ffffffff8101fd06
#1 [ffff88189c4c7ea0] notifier_call_chain at ffffffff814837f5
#2 [ffff88189c4c7ee0] atomic_notifier_call_chain at ffffffff8148385a
#3 [ffff88189c4c7ef0] notify_die at ffffffff8108026e
#4 [ffff88189c4c7f20] do_nmi at ffffffff81481443
#5 [ffff88189c4c7f50] nmi at ffffffff81480d50
[exception RIP: cfs_hash_hh_hhead]
RIP: ffffffffa040f280 RSP: ffff881849f6b958 RFLAGS: 00000246
RAX: ffffffffa0422200 RBX: ffff88106b5d5b00 RCX: 0000000000000005
RDX: 0000000000000002 RSI: ffff881849f6b9b0 RDI: ffff88106b5d5b00
RBP: ffff881849f6b9f0 R8: 00000002075042cd R9: 0858000000000000
R10: 42c0000000000000 R11: c0fc1c42c8d4010b R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
— <NMI exception stack> —
#6 [ffff881849f6b958] cfs_hash_hh_hhead at ffffffffa040f280 [libcfs]
#7 [ffff881849f6b958] cfs_hash_for_each_relax at ffffffffa0410e05 [libcfs]
#8 [ffff881849f6b9f8] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs]
#9 [ffff881849f6ba78] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc]
#10 [ffff881849f6ba88] server_disconnect_export at ffffffffa059a2b4 [ptlrpc]
#11 [ffff881849f6bb08] mdt_obd_disconnect at ffffffffa0a079eb [mdt]
#12 [ffff881849f6bc28] target_handle_disconnect at ffffffffa05966a9 [ptlrpc]
#13 [ffff881849f6bc88] mdt_disconnect at ffffffffa09fe2a9 [mdt]
#14 [ffff881849f6bcd8] mdt_handle_common at ffffffffa09f9865 [mdt]
#15 [ffff881849f6bd58] mdt_regular_handle at ffffffffa09fa875 [mdt]
#16 [ffff881849f6bd68] ptlrpc_main at ffffffffa05e4829 [ptlrpc]
#17 [ffff881849f6bf48] kernel_thread at ffffffff810041aa
crash>
crash> ps | grep UN
crash> log | less
crash> bt 35683
PID: 35683 TASK: ffff8810732200c0 CPU: 21 COMMAND: "mdt_315"
#0 [ffff881025bf7650] schedule at ffffffff8147dddc
#1 [ffff881025bf7718] cfs_waitq_wait at ffffffffa040175e [libcfs]
#2 [ffff881025bf7728] ldlm_completion_ast at ffffffffa05ad372 [ptlrpc]
#3 [ffff881025bf77f8] ldlm_cli_enqueue_local at ffffffffa05aca79 [ptlrpc]
#4 [ffff881025bf78b8] mdt_object_lock at ffffffffa09f544e [mdt]
#5 [ffff881025bf7978] mdt_getattr_name_lock at ffffffffa0a0122b [mdt]
#6 [ffff881025bf7a58] mdt_intent_getattr at ffffffffa0a0247a [mdt]
#7 [ffff881025bf7af8] mdt_intent_policy at ffffffffa09ff630 [mdt]
#8 [ffff881025bf7b68] ldlm_lock_enqueue at ffffffffa058eb8a [ptlrpc]
#9 [ffff881025bf7c08] ldlm_handle_enqueue0 at ffffffffa05b5767 [ptlrpc]
#10 [ffff881025bf7ca8] mdt_enqueue at ffffffffa09ff0ca [mdt]
#11 [ffff881025bf7cd8] mdt_handle_common at ffffffffa09f9865 [mdt]
#12 [ffff881025bf7d58] mdt_regular_handle at ffffffffa09fa875 [mdt]
#13 [ffff881025bf7d68] ptlrpc_main at ffffffffa05e4829 [ptlrpc]
#14 [ffff881025bf7f48] kernel_thread at ffffffff810041aa
crash>

thus my current opinion is that there seems to be a problem (loop ?) in the hash-list/struct that manage export's locks.

But I need to dig more in the crash-dump to conclude ...

Bruno Faccini (Inactive) added a comment - 23/May/12 10:34 AM We also encounter this same situation on one of our MDS who was not fully operational ("df" was working, but some sub-trees access was hanging). In our case, there was 4 threads eating 100% cpu but regularly re-schedule()'ing : > 9735 2 18 ffff88187c3d87d0 RU 0.0 0 0 [ldlm_elt] > 9770 2 12 ffff88205ada0100 RU 0.0 0 0 [ll_evictor] > 35489 2 1 ffff881854ef2790 RU 0.0 0 0 [mdt_121] > 46002 2 15 ffff88184cca97d0 RU 0.0 0 0 [mdt_445] since we could not find a way to gracefully recover from the situation we decided to re-boot the MDT and to take/force a crash-dump where the threads stacks looks the same than in our live Alt+SysRq+l attempts : crash> bt 9735 9770 35489 46002 PID: 9735 TASK: ffff88187c3d87d0 CPU: 18 COMMAND: "ldlm_elt" #0 [ffff88109c707e90] crash_nmi_callback at ffffffff8101fd06 #1 [ffff88109c707ea0] notifier_call_chain at ffffffff814837f5 #2 [ffff88109c707ee0] atomic_notifier_call_chain at ffffffff8148385a #3 [ffff88109c707ef0] notify_die at ffffffff8108026e #4 [ffff88109c707f20] do_nmi at ffffffff81481443 #5 [ffff88109c707f50] nmi at ffffffff81480d50 [exception RIP: cfs_hash_for_each_relax+193] RIP: ffffffffa0410e11 RSP: ffff8818675e7b80 RFLAGS: 00000282 RAX: ffff88165f78567c RBX: ffff881020d2ccc0 RCX: 0000000000000005 RDX: 000000000000000d RSI: ffff8818675e7bd0 RDI: ffff881020d2ccc0 RBP: ffff8818675e7c10 R8: 0000000209bd7a5d R9: 6d68000000000000 R10: 6b40000000000000 R11: 0c204977264ccdad R12: 0000000000000000 R13: ffffffffa058e3c0 R14: 0000000000000001 R15: ffff881020d2cd40 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 — <NMI exception stack> — #6 [ffff8818675e7b80] cfs_hash_for_each_relax at ffffffffa0410e11 [libcfs] #7 [ffff8818675e7c18] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs] #8 [ffff8818675e7c98] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc] #9 [ffff8818675e7ca8] server_disconnect_export at ffffffffa059a2b4 [ptlrpc] #10 [ffff8818675e7d28] mdt_obd_disconnect at ffffffffa0a079eb [mdt] #11 [ffff8818675e7e48] class_fail_export at ffffffffa04aacde [obdclass] #12 [ffff8818675e7e98] expired_lock_main at ffffffffa05b0dd4 [ptlrpc] #13 [ffff8818675e7f48] kernel_thread at ffffffff810041aa PID: 9770 TASK: ffff88205ada0100 CPU: 12 COMMAND: "ll_evictor" #0 [ffff8800450c7e90] crash_nmi_callback at ffffffff8101fd06 #1 [ffff8800450c7ea0] notifier_call_chain at ffffffff814837f5 #2 [ffff8800450c7ee0] atomic_notifier_call_chain at ffffffff8148385a #3 [ffff8800450c7ef0] notify_die at ffffffff8108026e #4 [ffff8800450c7f20] do_nmi at ffffffff81481443 #5 [ffff8800450c7f50] nmi at ffffffff81480d50 [exception RIP: cfs_hash_for_each_relax+168] RIP: ffffffffa0410df8 RSP: ffff88205adc7b20 RFLAGS: 00000246 RAX: 0000000000000000 RBX: ffff8810254ecd80 RCX: 0000000000000005 RDX: 0000000000000003 RSI: ffff88205adc7b70 RDI: ffff8810254ecd80 RBP: ffff88205adc7bb0 R8: 0000000208590aee R9: 2e90000000000000 R10: 7480000000000000 R11: 9ddca10a56b825d2 R12: 0000000000000000 R13: ffffffffa058e3c0 R14: 0000000000000001 R15: ffff8810254ece00 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 — <NMI exception stack> — #6 [ffff88205adc7b20] cfs_hash_for_each_relax at ffffffffa0410df8 [libcfs] #7 [ffff88205adc7bb8] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs] #8 [ffff88205adc7c38] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc] #9 [ffff88205adc7c48] server_disconnect_export at ffffffffa059a2b4 [ptlrpc] #10 [ffff88205adc7cc8] mdt_obd_disconnect at ffffffffa0a079eb [mdt] #11 [ffff88205adc7de8] class_fail_export at ffffffffa04aacde [obdclass] #12 [ffff88205adc7e38] ping_evictor_main at ffffffffa05e5a5d [ptlrpc] #13 [ffff88205adc7f48] kernel_thread at ffffffff810041aa PID: 35489 TASK: ffff881854ef2790 CPU: 1 COMMAND: "mdt_121" #0 [ffff88089c407e90] crash_nmi_callback at ffffffff8101fd06 #1 [ffff88089c407ea0] notifier_call_chain at ffffffff814837f5 #2 [ffff88089c407ee0] atomic_notifier_call_chain at ffffffff8148385a #3 [ffff88089c407ef0] notify_die at ffffffff8108026e #4 [ffff88089c407f20] do_nmi at ffffffff81481443 #5 [ffff88089c407f50] nmi at ffffffff81480d50 [exception RIP: cfs_hash_for_each_relax+193] RIP: ffffffffa0410e11 RSP: ffff88184e85b960 RFLAGS: 00000286 RAX: ffff8817f53bf10c RBX: ffff88105d484800 RCX: 0000000000000005 RDX: 000000000000001f RSI: ffff88184e85b9b0 RDI: ffff88105d484800 RBP: ffff88184e85b9f0 R8: 0000000208590aee R9: f188000000000000 R10: 8c40000000000000 R11: 9ddca109f53c3e31 R12: 0000000000000000 R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 — <NMI exception stack> — #6 [ffff88184e85b960] cfs_hash_for_each_relax at ffffffffa0410e11 [libcfs] #7 [ffff88184e85b9f8] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs] #8 [ffff88184e85ba78] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc] #9 [ffff88184e85ba88] server_disconnect_export at ffffffffa059a2b4 [ptlrpc] #10 [ffff88184e85bb08] mdt_obd_disconnect at ffffffffa0a079eb [mdt] #11 [ffff88184e85bc28] target_handle_disconnect at ffffffffa05966a9 [ptlrpc] #12 [ffff88184e85bc88] mdt_disconnect at ffffffffa09fe2a9 [mdt] #13 [ffff88184e85bcd8] mdt_handle_common at ffffffffa09f9865 [mdt] #14 [ffff88184e85bd58] mdt_regular_handle at ffffffffa09fa875 [mdt] #15 [ffff88184e85bd68] ptlrpc_main at ffffffffa05e4829 [ptlrpc] #16 [ffff88184e85bf48] kernel_thread at ffffffff810041aa PID: 46002 TASK: ffff88184cca97d0 CPU: 15 COMMAND: "mdt_445" #0 [ffff88189c4c7e90] crash_nmi_callback at ffffffff8101fd06 #1 [ffff88189c4c7ea0] notifier_call_chain at ffffffff814837f5 #2 [ffff88189c4c7ee0] atomic_notifier_call_chain at ffffffff8148385a #3 [ffff88189c4c7ef0] notify_die at ffffffff8108026e #4 [ffff88189c4c7f20] do_nmi at ffffffff81481443 #5 [ffff88189c4c7f50] nmi at ffffffff81480d50 [exception RIP: cfs_hash_hh_hhead] RIP: ffffffffa040f280 RSP: ffff881849f6b958 RFLAGS: 00000246 RAX: ffffffffa0422200 RBX: ffff88106b5d5b00 RCX: 0000000000000005 RDX: 0000000000000002 RSI: ffff881849f6b9b0 RDI: ffff88106b5d5b00 RBP: ffff881849f6b9f0 R8: 00000002075042cd R9: 0858000000000000 R10: 42c0000000000000 R11: c0fc1c42c8d4010b R12: 0000000000000000 R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 — <NMI exception stack> — #6 [ffff881849f6b958] cfs_hash_hh_hhead at ffffffffa040f280 [libcfs] #7 [ffff881849f6b958] cfs_hash_for_each_relax at ffffffffa0410e05 [libcfs] #8 [ffff881849f6b9f8] cfs_hash_for_each_empty at ffffffffa0412466 [libcfs] #9 [ffff881849f6ba78] ldlm_cancel_locks_for_export at ffffffffa058a49f [ptlrpc] #10 [ffff881849f6ba88] server_disconnect_export at ffffffffa059a2b4 [ptlrpc] #11 [ffff881849f6bb08] mdt_obd_disconnect at ffffffffa0a079eb [mdt] #12 [ffff881849f6bc28] target_handle_disconnect at ffffffffa05966a9 [ptlrpc] #13 [ffff881849f6bc88] mdt_disconnect at ffffffffa09fe2a9 [mdt] #14 [ffff881849f6bcd8] mdt_handle_common at ffffffffa09f9865 [mdt] #15 [ffff881849f6bd58] mdt_regular_handle at ffffffffa09fa875 [mdt] #16 [ffff881849f6bd68] ptlrpc_main at ffffffffa05e4829 [ptlrpc] #17 [ffff881849f6bf48] kernel_thread at ffffffff810041aa crash> crash> ps | grep UN crash> log | less crash> bt 35683 PID: 35683 TASK: ffff8810732200c0 CPU: 21 COMMAND: "mdt_315" #0 [ffff881025bf7650] schedule at ffffffff8147dddc #1 [ffff881025bf7718] cfs_waitq_wait at ffffffffa040175e [libcfs] #2 [ffff881025bf7728] ldlm_completion_ast at ffffffffa05ad372 [ptlrpc] #3 [ffff881025bf77f8] ldlm_cli_enqueue_local at ffffffffa05aca79 [ptlrpc] #4 [ffff881025bf78b8] mdt_object_lock at ffffffffa09f544e [mdt] #5 [ffff881025bf7978] mdt_getattr_name_lock at ffffffffa0a0122b [mdt] #6 [ffff881025bf7a58] mdt_intent_getattr at ffffffffa0a0247a [mdt] #7 [ffff881025bf7af8] mdt_intent_policy at ffffffffa09ff630 [mdt] #8 [ffff881025bf7b68] ldlm_lock_enqueue at ffffffffa058eb8a [ptlrpc] #9 [ffff881025bf7c08] ldlm_handle_enqueue0 at ffffffffa05b5767 [ptlrpc] #10 [ffff881025bf7ca8] mdt_enqueue at ffffffffa09ff0ca [mdt] #11 [ffff881025bf7cd8] mdt_handle_common at ffffffffa09f9865 [mdt] #12 [ffff881025bf7d58] mdt_regular_handle at ffffffffa09fa875 [mdt] #13 [ffff881025bf7d68] ptlrpc_main at ffffffffa05e4829 [ptlrpc] #14 [ffff881025bf7f48] kernel_thread at ffffffff810041aa crash> thus my current opinion is that there seems to be a problem (loop ?) in the hash-list/struct that manage export's locks. But I need to dig more in the crash-dump to conclude ...

People

Assignee:: Liang Zhen (Inactive)

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 21/May/12 3:41 PM

Updated:: 23/Jul/12 11:28 AM

Resolved:: 23/Jul/12 11:28 AM