[LU-2487] 2.2 Client deadlock between ll_md_blocking_ast, sys_close, and sys_open Created: 13/Dec/12  Updated: 03/Oct/13  Resolved: 02/Jan/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.2.0
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Artem Blagodarenko (Inactive) Assignee: Andreas Dilger
Resolution: Cannot Reproduce Votes: 0
Labels: client, patch

Issue Links:
Related
is related to LU-4053 client leaking objects/locks during IO Resolved
Severity: 2
Rank (Obsolete): 5837

 Description   

Spinlock Usage

Walking through the code and digging through pointers in the stack frames of the 3 threads leads to the following 3 suspect structures and their related spinlocks:

inode = 0xffff880581a26638 (i_lock)
dentry = 0xffff88051f7242c0 (d_lock)
ldlm_resource = 0xffff880308f916c0 (lr_lock)

Looks like there is a 3 way deadlock between ll_md_blocking_ast, sys_open, and sys_close using the above spinlocks.

CPU 10: ll_md_blocking_ast()
+ ll_inode_from_resource() gets lock-l_resource->lr_lock

  • igrab() wants lock->l_resource->lr_lvb_inode->i_lock

CPU 7: sys_close()
+ dput() gets dentry->d_lock

  • ldlm_resource_foreach() wants res->lr_lock

CPU 15: sys_open
+ ll_splice_alias()/ll_find_alias() gets inode->i_lock

  • ll_splice_alias()/ll_find_alias() wants dentry->d_lock

The lr_lock, d_lock, and i_lock are the same in all cases. So ll_md_blocking_ast() waits for sys_open(), sys_open waits for sys_close(), and sys_close() waits for ll_md_blocking_ast. A 3-way deadlock.

This deadlock is possible because of two pathces:
1) LU-903(not landed yet) (bz24555) - get inode lock. If we get ldlm_lock lock thin no deadlock possible.
resource dentry inode
ldlm_lock resource dentry
2) dcache_lock removing (http://review.whamcloud.com/#change,1865). Before this patch (with bz24555 enabled)also no deadlock possible.
resource dentry inode
inode resource dcache_lock



 Comments   
Comment by Artem Blagodarenko (Inactive) [ 13/Dec/12 ]

I set minor priority because race is possible only with patch from LU-903. I already have patch, but need to rebase it ot master.

Comment by Artem Blagodarenko (Inactive) [ 14/Dec/12 ]

Xyratex MRP-675

Comment by Artem Blagodarenko (Inactive) [ 14/Dec/12 ]

http://review.whamcloud.com/4833

Comment by Andreas Dilger [ 21/Dec/12 ]

Which client kernel is this? The new dcache_lock removal is only in use for kernels > 2.6.37, so I guess this is SLES11 SP2 (3.0)? I'm trying to see where there is a deadlock in the code, because your original comment is not showing the callpath to the function getting the second lock. I'm guessing something like:

CPU 10: ll_md_blocking_ast()
   + ll_inode_from_resource()/ll_inode_from_lock() gets lock->l_lock and lock->l_resource->lr_lock
   - igrab() wants lock->l_resource->lr_lvb_inode->i_lock

CPU 7: sys_close()
   + dput() gets dentry->d_lock
   - ll_ddelete->find_cbdata->ldlm_resource_foreach() wants res->lr_lock

CPU 15: sys_open
   + ll_splice_alias()/ll_find_alias() gets inode->i_lock 
   - ll_splice_alias()/ll_find_alias() wants dentry->d_lock

The ll_ddelete->find_cbdata() path has been disabled in 9f3469f1:

        /* Disable this piece of code temproarily because this is called
         * inside dcache_lock so it's not appropriate to do lots of work
         * here. */
#if 0
        /* if not ldlm lock for this inode, set i_nlink to 0 so that
         * this inode can be recycled later b=20433 */
        if (de->d_inode && !find_cbdata(de->d_inode))
                clear_nlink(de->d_inode);
#endif

So it seems there is no need for this complex patch?

Comment by Artem Blagodarenko (Inactive) [ 29/Dec/12 ]

Yes, we do not need this patch, because one of deadlock's branches is disabled in current master. But I have added comment "LU-2487 need to be resolved before branch can be enabled again" to LU-903 patch to prevent possible deadlock if somebody enable that branch.

Comment by Artem Blagodarenko (Inactive) [ 29/Dec/12 ]

Can we close this issue with "wan't fix"?

Comment by Andreas Dilger [ 02/Jan/13 ]

I think the comment you added in the LU-903 patch should be enough for now. Thanks for the update.

I'm going to close this as Cannot Reproduce, since the problem never existed in any of the public Lustre releases.

Generated at Sat Feb 10 01:25:37 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.