[LU-1087] mdt thread spinning out of control Created: 09/Feb/12  Updated: 13/Jul/12  Resolved: 05/Jun/12

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0
Fix Version/s: None

Type: Bug Priority: Critical
Reporter: Christopher Morrone Assignee: Lai Siyao
Resolution: Duplicate Votes: 0
Labels: None
Environment:

lustre 2.1.0-21chaos (github.com/chaos/lustre)


Issue Links:
Duplicate
duplicates LU-1428 MDT servrice threads spinning in cfs_... Resolved
Severity: 2
Rank (Obsolete): 6465

 Description   

On one of our filesystems that was upgraded to lustre 2.1 last week, the MDS has one thread that appears to be stick spinning:

13711 root      20   0     0    0    0 R 99.9  0.0   6289:28 mdt_253

Using sysreq (since crash can't trace a live process):

2012-02-09 16:18:39 Call Trace:
2012-02-09 16:18:39  [<ffffffffa06de440>] ? ldlm_cancel_locks_for_export_cb+0x0/0xd0 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa06de440>] ? ldlm_cancel_locks_for_export_cb+0x0/0xd0 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa0498f46>] cfs_hash_for_each_empty+0xc6/0x170 [libcfs]
2012-02-09 16:18:39  [<ffffffffa06db15f>] ldlm_cancel_locks_for_export+0x1f/0x30 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa06e7c94>] server_disconnect_export+0x64/0x170 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa0c50dcb>] mdt_obd_disconnect+0x4b/0xd50 [mdt]
2012-02-09 16:18:39  [<ffffffffa071681e>] ? lustre_pack_reply_flags+0xae/0x1b0 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa0716931>] ? lustre_pack_reply+0x11/0x20 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa06e4d4a>] target_handle_disconnect+0x16a/0x3a0 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa0c4a805>] mdt_disconnect+0x35/0x100 [mdt]
2012-02-09 16:18:39  [<ffffffffa0c4674d>] mdt_handle_common+0x73d/0x12b0 [mdt]
2012-02-09 16:18:39  [<ffffffffa0717334>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa0c47395>] mdt_regular_handle+0x15/0x20 [mdt]
2012-02-09 16:18:39  [<ffffffffa0723181>] ptlrpc_main+0xcd1/0x1690 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
2012-02-09 16:18:39  [<ffffffff8100c14a>] child_rip+0xa/0x20
2012-02-09 16:18:39  [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
2012-02-09 16:18:39  [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc]
2012-02-09 16:18:39  [<ffffffff8100c140>] ? child_rip+0x0/0x20


 Comments   
Comment by Christopher Morrone [ 09/Feb/12 ]

Excerpt from the log for the spinning thread, showing it repeatedly trying to free the same hash:

00010000:00000040:15.0:1328835885.126894:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2
00010000:00000040:15.0:1328835885.126895:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1
00000001:00000040:15.0:1328835885.126925:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178108
00010000:00000040:15.0:1328835885.126957:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2
00010000:00000040:15.0:1328835885.126958:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1
00000001:00000040:15.0:1328835885.126988:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178109
00010000:00000040:15.0:1328835885.127020:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2
00010000:00000040:15.0:1328835885.127022:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1
00000001:00000040:15.0:1328835885.127051:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178110
Comment by Peter Jones [ 09/Feb/12 ]

Lai

Could you please look into this one?

Thanks

Peter

Comment by Lai Siyao [ 10/Feb/12 ]

Hi Chris, did you see any error messages about ldlm lock before this happened?

Comment by Christopher Morrone [ 10/Feb/12 ]

We saw all manner of timeouts, and reconnects, slow service warnings and just generally a lot of log traffic. There are certainly lock callback timer expirations and other ldlm stuff happening in there.

Comment by Lai Siyao [ 16/Feb/12 ]

Chris, your log shows you've enabled D_INFO log, could you enable D_DLMTRACE and check which lock is in the dead loop (ldlm_cancel_lock_for_export_cb() should print lock information)?

Comment by Christopher Morrone [ 16/Feb/12 ]

If it happens again I will do that. Unfortunately D_INFO caused LU-1088, so I was not able to get any further debugging information.

Comment by Liang Zhen (Inactive) [ 05/Jun/12 ]

I've posted a patch for review: http://review.whamcloud.com/#change,3028

Comment by Peter Jones [ 05/Jun/12 ]

This is believed to be a duplicate of LU-1428

Generated at Sat Feb 10 01:13:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.