[LU-1087] mdt thread spinning out of control Created: 09/Feb/12 Updated: 13/Jul/12 Resolved: 05/Jun/12 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical |
| Reporter: | Christopher Morrone | Assignee: | Lai Siyao |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
lustre 2.1.0-21chaos (github.com/chaos/lustre) |
||
| Issue Links: |
|
||||||||
| Severity: | 2 | ||||||||
| Rank (Obsolete): | 6465 | ||||||||
| Description |
|
On one of our filesystems that was upgraded to lustre 2.1 last week, the MDS has one thread that appears to be stick spinning: 13711 root 20 0 0 0 0 R 99.9 0.0 6289:28 mdt_253 Using sysreq (since crash can't trace a live process): 2012-02-09 16:18:39 Call Trace: 2012-02-09 16:18:39 [<ffffffffa06de440>] ? ldlm_cancel_locks_for_export_cb+0x0/0xd0 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa06de440>] ? ldlm_cancel_locks_for_export_cb+0x0/0xd0 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa0498f46>] cfs_hash_for_each_empty+0xc6/0x170 [libcfs] 2012-02-09 16:18:39 [<ffffffffa06db15f>] ldlm_cancel_locks_for_export+0x1f/0x30 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa06e7c94>] server_disconnect_export+0x64/0x170 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa0c50dcb>] mdt_obd_disconnect+0x4b/0xd50 [mdt] 2012-02-09 16:18:39 [<ffffffffa071681e>] ? lustre_pack_reply_flags+0xae/0x1b0 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa0716931>] ? lustre_pack_reply+0x11/0x20 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa06e4d4a>] target_handle_disconnect+0x16a/0x3a0 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa0c4a805>] mdt_disconnect+0x35/0x100 [mdt] 2012-02-09 16:18:39 [<ffffffffa0c4674d>] mdt_handle_common+0x73d/0x12b0 [mdt] 2012-02-09 16:18:39 [<ffffffffa0717334>] ? lustre_msg_get_transno+0x54/0x90 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa0c47395>] mdt_regular_handle+0x15/0x20 [mdt] 2012-02-09 16:18:39 [<ffffffffa0723181>] ptlrpc_main+0xcd1/0x1690 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc] 2012-02-09 16:18:39 [<ffffffff8100c14a>] child_rip+0xa/0x20 2012-02-09 16:18:39 [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc] 2012-02-09 16:18:39 [<ffffffffa07224b0>] ? ptlrpc_main+0x0/0x1690 [ptlrpc] 2012-02-09 16:18:39 [<ffffffff8100c140>] ? child_rip+0x0/0x20 |
| Comments |
| Comment by Christopher Morrone [ 09/Feb/12 ] |
|
Excerpt from the log for the spinning thread, showing it repeatedly trying to free the same hash: 00010000:00000040:15.0:1328835885.126894:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2 00010000:00000040:15.0:1328835885.126895:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1 00000001:00000040:15.0:1328835885.126925:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178108 00010000:00000040:15.0:1328835885.126957:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2 00010000:00000040:15.0:1328835885.126958:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1 00000001:00000040:15.0:1328835885.126988:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178109 00010000:00000040:15.0:1328835885.127020:0:13711:0:(ldlm_resource.c:1102:ldlm_resource_getref()) getref res: ffff88082dc5fa80 count: 2 00010000:00000040:15.0:1328835885.127022:0:13711:0:(ldlm_resource.c:1141:ldlm_resource_putref()) putref res: ffff88082dc5fa80 count: 1 00000001:00000040:15.0:1328835885.127051:0:13711:0:(hash.c:1687:cfs_hash_for_each_empty()) Try to empty hash: 60009e44-9f98-9, loop: 178110 |
| Comment by Peter Jones [ 09/Feb/12 ] |
|
Lai Could you please look into this one? Thanks Peter |
| Comment by Lai Siyao [ 10/Feb/12 ] |
|
Hi Chris, did you see any error messages about ldlm lock before this happened? |
| Comment by Christopher Morrone [ 10/Feb/12 ] |
|
We saw all manner of timeouts, and reconnects, slow service warnings and just generally a lot of log traffic. There are certainly lock callback timer expirations and other ldlm stuff happening in there. |
| Comment by Lai Siyao [ 16/Feb/12 ] |
|
Chris, your log shows you've enabled D_INFO log, could you enable D_DLMTRACE and check which lock is in the dead loop (ldlm_cancel_lock_for_export_cb() should print lock information)? |
| Comment by Christopher Morrone [ 16/Feb/12 ] |
|
If it happens again I will do that. Unfortunately D_INFO caused |
| Comment by Liang Zhen (Inactive) [ 05/Jun/12 ] |
|
I've posted a patch for review: http://review.whamcloud.com/#change,3028 |
| Comment by Peter Jones [ 05/Jun/12 ] |
|
This is believed to be a duplicate of |