[LU-1428] MDT servrice threads spinning in cfs_hash_for_each_relax() - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: Lustre 2.3.0, Lustre 2.1.3
Affects Version/s: Lustre 2.1.1
Labels:
- libcfs
- server
Environment:
https://github.com/chaos/lustre/commits/2.1.1-4chaos

Severity:
3
Rank (Obsolete):
4587

Description

We have two MDT service threads using 100% CPU on a production MDS. I can't get a backtrace from crash because they do not yield the CPU, but based on oprofile they seem to be spinning in cfs_hash_for_each_relax(). At the same time we are seeing client hangs and high lock cancellation rates on the OSTs.

samples  %        image name               app name                 symbol name
4225020  33.0708  libcfs.ko                libcfs.ko                cfs_hash_for_each_relax
3345225  26.1843  libcfs.ko                libcfs.ko                cfs_hash_hh_hhead
532409    4.1674  ptlrpc.ko                ptlrpc.ko                ldlm_cancel_locks_for_export_cb
307199    2.4046  ptlrpc.ko                ptlrpc.ko                lock_res_and_lock
175349    1.3725  vmlinux                  vmlinux                  native_read_tsc
151989    1.1897  ptlrpc.ko                ptlrpc.ko                ldlm_del_waiting_lock
136679    1.0698  libcfs.ko                libcfs.ko                cfs_hash_rw_lock
109269    0.8553  jbd2.ko                  jbd2.ko                  journal_clean_one_cp_list

Attachments

Issue Links

is duplicated by

LU-1087 mdt thread spinning out of control

Resolved

Trackbacks

Changelog 2.1 Changes from version 2.1.2 to version 2.1.3 Server support for kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1.el6 (RHEL6) Client support for unpatched kernels: 2.6.18308.13.1.el5 (RHEL5) 2.6.32279.2.1....

Activity

[LU-1428] MDT servrice threads spinning in cfs_hash_for_each_relax()

Peter Jones added a comment - 23/Jul/12 11:28 AM

Landed for 2.1.3 and 2.3

Peter Jones added a comment - 23/Jul/12 11:28 AM Landed for 2.1.3 and 2.3

Jay Lan (Inactive) added a comment - 29/Jun/12 5:26 PM

NASA Ames hit this this morning on a 2.1.1 MDS.

Jay Lan (Inactive) added a comment - 29/Jun/12 5:26 PM NASA Ames hit this this morning on a 2.1.1 MDS.

Andreas Dilger added a comment - 19/Jun/12 6:46 PM

Patch landed to master for Lustre 2.3.0, still needs to be landed to b2_1 for 2.1.3.

Andreas Dilger added a comment - 19/Jun/12 6:46 PM Patch landed to master for Lustre 2.3.0, still needs to be landed to b2_1 for 2.1.3.

Bruno Faccini (Inactive) added a comment - 19/Jun/12 10:34 AM

Liang,

I wanted to provide more details for this problem, and particulary to indicate that I feel pretty sure that the looping situation will never clear due to "orphaned" locks stiil beeing on the hash-lists, but seems that you finally found the bug/window !!

So, may be its too late, but just in case, I also found that digging in the Servers/Clients logs, it seems to me that at the start of the problem, the concerned Client/export reported the following errors/msgs during MDT umount :

1339507111 2012 Jun 12 15:18:31 lascaux3332 kern err kernel LustreError: 57793:0:(lmv_obd.c:665:lmv_disconnect_mdc()) Target scratch2-MDT0000_UUID disconnect error -110
1339507111 2012 Jun 12 15:18:31 lascaux3332 kern warning kernel Lustre: client ffff88087d539400 umount complete
1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Skipped 52 previous similar messages

Also interesting is that this situation may have a "low" impact when concerning a single Client/export which already umounted and locks on a very specific part/directory of the filesystem !! This explain why we found multiple mdt_<id> threads spinning in this situation since days and no problem reported nor found whan accessing/using the concerned filesystem !!!

Bruno Faccini (Inactive) added a comment - 19/Jun/12 10:34 AM Liang, I wanted to provide more details for this problem, and particulary to indicate that I feel pretty sure that the looping situation will never clear due to "orphaned" locks stiil beeing on the hash-lists, but seems that you finally found the bug/window !! So, may be its too late, but just in case, I also found that digging in the Servers/Clients logs, it seems to me that at the start of the problem, the concerned Client/export reported the following errors/msgs during MDT umount : 1339507111 2012 Jun 12 15:18:31 lascaux3332 kern err kernel LustreError: 57793:0:(lmv_obd.c:665:lmv_disconnect_mdc()) Target scratch2-MDT0000_UUID disconnect error -110 1339507111 2012 Jun 12 15:18:31 lascaux3332 kern warning kernel Lustre: client ffff88087d539400 umount complete 1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway 1339507130 2012 Jun 12 15:18:50 lascaux3332 kern err kernel LustreError: 57863:0:(ldlm_request.c:1172:ldlm_cli_cancel_req()) Skipped 52 previous similar messages Also interesting is that this situation may have a "low" impact when concerning a single Client/export which already umounted and locks on a very specific part/directory of the filesystem !! This explain why we found multiple mdt_<id> threads spinning in this situation since days and no problem reported nor found whan accessing/using the concerned filesystem !!!

Ned Bass (Inactive) added a comment - 12/Jun/12 12:09 PM

Thanks, we also need this patch for 2.1.

Ned Bass (Inactive) added a comment - 12/Jun/12 12:09 PM Thanks, we also need this patch for 2.1.

Liang Zhen (Inactive) added a comment - 05/Jun/12 5:32 AM

I've posted a patch for review: http://review.whamcloud.com/#change,3028

Liang Zhen (Inactive) added a comment - 05/Jun/12 5:32 AM I've posted a patch for review: http://review.whamcloud.com/#change,3028

People

Assignee:: Liang Zhen (Inactive)

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 21/May/12 3:41 PM

Updated:: 23/Jul/12 11:28 AM

Resolved:: 23/Jul/12 11:28 AM