[LU-6230] open handle leak Created: 10/Feb/15  Updated: 26/Feb/15  Resolved: 26/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Blocker
Reporter: nasf (Inactive) Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-6272 sanity-lfsck test_17: MDS deadlock Resolved
Related
is related to LU-6301 open handle leak Resolved
Severity: 3
Rank (Obsolete): 17440

 Description   

For open case, the client side open handling thread may hit error after the MDT grant the open. Under the such case, the client should send close RPC to the MDT as cleanup; otherwise, the open handle on the MDT will be leaked there until the client umount or evicted.

If the LFSCK marks LU_OBJECT_HEARD_BANSHEE on the MDT-object that is opened by others for repairing some inconsistency, such as repairing multiple-referenced OST-object, because the leaked open handle still references the MDT-object, then it will block the subsequent threads that want to locate such object via FID.

23:07:57:INFO: task mdt00_000:6380 blocked for more than 120 seconds.
23:07:57:      Not tainted 2.6.32-504.8.1.el6_lustre.g0ef66b1.x86_64 #1
23:07:57:mdt00_000     D 0000000000000001     0  6380      2 0x00000080
23:07:57:Call Trace:
23:07:57: [<ffffffffa05f62af>] ? lu_object_find_try+0x9f/0x260 [obdclass]
23:07:57: [<ffffffffa05f64ad>] lu_object_find_at+0x3d/0xe0 [obdclass]
23:07:57: [<ffffffffa05f6566>] lu_object_find+0x16/0x20 [obdclass]
23:07:57: [<ffffffffa0ebe056>] mdt_object_find+0x56/0x170 [mdt]
23:07:57: [<ffffffffa0ef5407>] mdt_reint_open+0x1527/0x2c70 [mdt]
23:07:57: [<ffffffffa0edd0cd>] mdt_reint_rec+0x5d/0x200 [mdt]
23:07:57: [<ffffffffa0ec123b>] mdt_reint_internal+0x4cb/0x7a0 [mdt]
23:07:57: [<ffffffffa0ec1706>] mdt_intent_reint+0x1f6/0x430 [mdt]
23:07:57: [<ffffffffa0ebfcf4>] mdt_intent_policy+0x494/0xce0 [mdt]
23:07:57: [<ffffffffa07c24f9>] ldlm_lock_enqueue+0x129/0x9d0 [ptlrpc]
23:07:57: [<ffffffffa07ee48b>] ldlm_handle_enqueue0+0x51b/0x13f0 [ptlrpc]
23:07:57: [<ffffffffa086e951>] tgt_enqueue+0x61/0x230 [ptlrpc]
23:07:57: [<ffffffffa086f59e>] tgt_request_handle+0x8be/0x1000 [ptlrpc]
23:07:57: [<ffffffffa081f5c1>] ptlrpc_main+0xe41/0x1960 [ptlrpc]
23:07:57: [<ffffffff8109e66e>] kthread+0x9e/0xc0
23:07:57: [<ffffffff8100c20a>] child_rip+0xa/0x20


 Comments   
Comment by Gerrit Updater [ 10/Feb/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13709
Subject: LU-6230 llite: cleanup open handle for client open failure
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 307f1432afc920155b8b800841a626fcfe858bc8

Comment by nasf (Inactive) [ 21/Feb/15 ]

In fact, the LU-5791 patch http://review.whamcloud.com/#/c/13392/ (that has been landed to master already) depends on this patch, otherwise the sanity-lfsck will get failure.

Comment by Andreas Dilger [ 24/Feb/15 ]

Fan Yong, as much as I want to land this patch to fix the current testing problem, it is bad that a poorly-behaving client can cause the MDS to deadlock. That seems like a bug in the MDS code that it is blocked on a client open reference, even if the client is doing the wrong thing. I'd expect the MDS to not care at all whether the file is open and/or unlinked forever, since this is a normal use case and not even an error. At worst the MDS should evict such a client if there is a clear problem.

Comment by nasf (Inactive) [ 24/Feb/15 ]

We found this issue during sanity-lfsck test_17 failure. There are some failure instances in LU-6727.
The root reason for such failure is that when the LFSCK repairing multiple referenced OST-object, it will set the un-recognized MDT-object as LU_OBJECT_HEARD_BANSHEE to make such MDT-object to reload OSP-object after the LOV EA refreshed. But at that time, some others may still reference such wrong MDT-object (in sanity-lfsck test_17, it is the leaked open handle). Then all the subsequent object locating against such MDT-object will be blocked there.

Generally, asking other to release the reference is NOT the right solution, because it is normal that someone reference the MDT-object for very time, such as open().

The right solution is NOT set LU_OBJECT_HEARD_BANSHEE on the MDT-object, instead, we should make the LOD-object to re-attach the OSP-object(s) after the LOV EA refreshed.

Comment by Gerrit Updater [ 24/Feb/15 ]

Fan Yong (fan.yong@intel.com) uploaded a new patch: http://review.whamcloud.com/13848
Subject: LU-6230 lfsck: reload OSP-object via set LOV EA on LOD-object
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 3d020e94a3bdb74f4db956f551bac058eaf18a44

Comment by John Hammond [ 24/Feb/15 ]

> For open case, the client side open handling thread may hit error after the MDT grant the open. Under the such case, the client should send close RPC to the MDT as cleanup; otherwise, the open handle on the MDT will be leaked there until the client umount or evicted. In further, if someone unlinked the file, but because the open handle holds the reference on such file/object, then it will block the subsequent threads that want to locate such object via FID.

Is this description correct? If there is an open handle then mod_count should remain positive and so the object will not be destroyed. So LU_OBJECT_HEARD_BANSHEE will not be set because of the unlink. Or am I missing something? Would you please offer a test case that reproduces the behavior in the description?

Comment by nasf (Inactive) [ 24/Feb/15 ]

>> For open case, the client side open handling thread may hit error after the MDT grant the open. Under the such case, the client should send close RPC to the MDT as cleanup; otherwise, the open handle on the MDT will be leaked there until the client umount or evicted. In further, if someone unlinked the file, but because the open handle holds the reference on such file/object, then it will block the subsequent threads that want to locate such object via FID.
> Is this description correct? If there is an open handle then mod_count should remain positive and so the object will not be destroyed. So LU_OBJECT_HEARD_BANSHEE will not be set because of the unlink. Or am I missing something? Would you please offer a test case that reproduces the behavior in the description?

Sorry, some misguide in the issue description. It should be the LFSCK set LU_OBJECT_HEARD_BANSHEE on the MDT-object to repair multiple-referenced OST-object, but such MDT-object is still referenced by the leaked open handle. I have updated the issue description.

As for the test case, sanity-lfsck test_17 can be used to verify that. All patches based on current master failed for this issue.

Comment by Gerrit Updater [ 25/Feb/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13848/
Subject: LU-6230 lfsck: reload OSP-object via set LOV EA on LOD-object
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 7c82a9c81d03dec059132dddafd0bdde188b321d

Comment by Jodi Levi (Inactive) [ 26/Feb/15 ]

Patch landed to Master. Additional work will land under LU-6301

Generated at Sat Feb 10 01:58:24 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.