[LU-6230] open handle leak Created: 10/Feb/15 Updated: 26/Feb/15 Resolved: 26/Feb/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.7.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | nasf (Inactive) | Assignee: | nasf (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 17440 | ||||||||||||||||
| Description |
|
For open case, the client side open handling thread may hit error after the MDT grant the open. Under the such case, the client should send close RPC to the MDT as cleanup; otherwise, the open handle on the MDT will be leaked there until the client umount or evicted. If the LFSCK marks LU_OBJECT_HEARD_BANSHEE on the MDT-object that is opened by others for repairing some inconsistency, such as repairing multiple-referenced OST-object, because the leaked open handle still references the MDT-object, then it will block the subsequent threads that want to locate such object via FID. 23:07:57:INFO: task mdt00_000:6380 blocked for more than 120 seconds. 23:07:57: Not tainted 2.6.32-504.8.1.el6_lustre.g0ef66b1.x86_64 #1 23:07:57:mdt00_000 D 0000000000000001 0 6380 2 0x00000080 23:07:57:Call Trace: 23:07:57: [<ffffffffa05f62af>] ? lu_object_find_try+0x9f/0x260 [obdclass] 23:07:57: [<ffffffffa05f64ad>] lu_object_find_at+0x3d/0xe0 [obdclass] 23:07:57: [<ffffffffa05f6566>] lu_object_find+0x16/0x20 [obdclass] 23:07:57: [<ffffffffa0ebe056>] mdt_object_find+0x56/0x170 [mdt] 23:07:57: [<ffffffffa0ef5407>] mdt_reint_open+0x1527/0x2c70 [mdt] 23:07:57: [<ffffffffa0edd0cd>] mdt_reint_rec+0x5d/0x200 [mdt] 23:07:57: [<ffffffffa0ec123b>] mdt_reint_internal+0x4cb/0x7a0 [mdt] 23:07:57: [<ffffffffa0ec1706>] mdt_intent_reint+0x1f6/0x430 [mdt] 23:07:57: [<ffffffffa0ebfcf4>] mdt_intent_policy+0x494/0xce0 [mdt] 23:07:57: [<ffffffffa07c24f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc] 23:07:57: [<ffffffffa07ee48b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc] 23:07:57: [<ffffffffa086e951>] tgt_enqueue+0x61/0x230 [ptlrpc] 23:07:57: [<ffffffffa086f59e>] tgt_request_handle+0x8be/0x1000 [ptlrpc] 23:07:57: [<ffffffffa081f5c1>] ptlrpc_main+0xe41/0x1960 [ptlrpc] 23:07:57: [<ffffffff8109e66e>] kthread+0x9e/0xc0 23:07:57: [<ffffffff8100c20a>] child_rip+0xa/0x20 |
| Comments |
| Comment by Gerrit Updater [ 10/Feb/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13709 |
| Comment by nasf (Inactive) [ 21/Feb/15 ] |
|
In fact, the |
| Comment by Andreas Dilger [ 24/Feb/15 ] |
|
Fan Yong, as much as I want to land this patch to fix the current testing problem, it is bad that a poorly-behaving client can cause the MDS to deadlock. That seems like a bug in the MDS code that it is blocked on a client open reference, even if the client is doing the wrong thing. I'd expect the MDS to not care at all whether the file is open and/or unlinked forever, since this is a normal use case and not even an error. At worst the MDS should evict such a client if there is a clear problem. |
| Comment by nasf (Inactive) [ 24/Feb/15 ] |
|
We found this issue during sanity-lfsck test_17 failure. There are some failure instances in Generally, asking other to release the reference is NOT the right solution, because it is normal that someone reference the MDT-object for very time, such as open(). The right solution is NOT set LU_OBJECT_HEARD_BANSHEE on the MDT-object, instead, we should make the LOD-object to re-attach the OSP-object(s) after the LOV EA refreshed. |
| Comment by Gerrit Updater [ 24/Feb/15 ] |
|
Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13848 |
| Comment by John Hammond [ 24/Feb/15 ] |
|
> For open case, the client side open handling thread may hit error after the MDT grant the open. Under the such case, the client should send close RPC to the MDT as cleanup; otherwise, the open handle on the MDT will be leaked there until the client umount or evicted. In further, if someone unlinked the file, but because the open handle holds the reference on such file/object, then it will block the subsequent threads that want to locate such object via FID. Is this description correct? If there is an open handle then mod_count should remain positive and so the object will not be destroyed. So LU_OBJECT_HEARD_BANSHEE will not be set because of the unlink. Or am I missing something? Would you please offer a test case that reproduces the behavior in the description? |
| Comment by nasf (Inactive) [ 24/Feb/15 ] |
|
>> For open case, the client side open handling thread may hit error after the MDT grant the open. Under the such case, the client should send close RPC to the MDT as cleanup; otherwise, the open handle on the MDT will be leaked there until the client umount or evicted. In further, if someone unlinked the file, but because the open handle holds the reference on such file/object, then it will block the subsequent threads that want to locate such object via FID. Sorry, some misguide in the issue description. It should be the LFSCK set LU_OBJECT_HEARD_BANSHEE on the MDT-object to repair multiple-referenced OST-object, but such MDT-object is still referenced by the leaked open handle. I have updated the issue description. As for the test case, sanity-lfsck test_17 can be used to verify that. All patches based on current master failed for this issue. |
| Comment by Gerrit Updater [ 25/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13848/ |
| Comment by Jodi Levi (Inactive) [ 26/Feb/15 ] |
|
Patch landed to Master. Additional work will land under |