[LU-6085] racer stuck on mutex_lock in ll_setattr_raw() Created: 07/Jan/15 Updated: 16/Jan/15 Resolved: 16/Jan/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Jinshan Xiong (Inactive) | Assignee: | Jinshan Xiong (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mq115 | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 16934 | ||||||||||||||||
| Description |
|
With stack trace of: chmod D 0000000000000000 0 25015 1 0x00000000 ffff880175afdca8 0000000000000086 ffff880175afdc88 ffffffffa077c842 ffff880175afdc28 ffff880182b3d400 ffffffff8100b9ce ffff880175afdca8 ffff88018162baf8 ffff880175afdfd8 000000000000fb88 ffff88018162baf8 Call Trace: [<ffffffffa077c842>] ? __req_capsule_get+0x162/0x6d0 [ptlrpc] [<ffffffff8100b9ce>] ? common_interrupt+0xe/0x13 [<ffffffff810521eb>] ? mutex_spin_on_owner+0x9b/0xc0 [<ffffffff8150fc5e>] __mutex_lock_slowpath+0x13e/0x180 [<ffffffff8150fafb>] mutex_lock+0x2b/0x50 [<ffffffffa0e92e5c>] ll_setattr_raw+0x58c/0x1ae0 [lustre] [<ffffffff81192a72>] ? user_path_at+0x62/0xa0 [<ffffffffa0e94415>] ll_setattr+0x65/0xd0 [lustre] [<ffffffff8119ead8>] notify_change+0x168/0x340 [<ffffffff8117ee13>] sys_fchmodat+0xc3/0x100 [<ffffffff81186fc6>] ? sys_newstat+0x36/0x50 [<ffffffff8151171e>] ? do_device_not_available+0xe/0x10 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b It turned out that the inode mutex is already held by the current thread itself. The root cause of this issue is in function ll_md_setattr() where it calls simple_setattr() even setting attribute on the MDT fails: ptlrpc_req_finished(request);
if (rc == -ENOENT) {
clear_nlink(inode);
/* Unlinked special device node? Or just a race?
* Pretend we done everything. */
if (!S_ISREG(inode->i_mode) &&
!S_ISDIR(inode->i_mode)) {
ia_valid = op_data->op_attr.ia_valid;
op_data->op_attr.ia_valid &= ~TIMES_SET_FLAGS;
rc = simple_setattr(dentry, &op_data->op_attr);
op_data->op_attr.ia_valid = ia_valid;
}
} else if (rc != -EPERM && rc != -EACCES && rc != -ETXTBSY) {
CERROR("md_setattr fails: rc = %d\n", rc);
}
RETURN(rc);
In racer, it may try to change a SOCK file to a regular file which will definitely fail. If that file happens to have been deleted, it will call simple_setattr() because it encounters ENOENT error, then the file's mode will be changed to regular file and then causes mutex_lock stuck. I will push a patch to fix this issue. |
| Comments |
| Comment by Jinshan Xiong (Inactive) [ 07/Jan/15 ] |
|
This issue is more complex than I thought. After I made and applied a patch on the client side, I found that MDT actually returns different file type in the setattr reply. I patched my client code as follows: diff --git a/lustre/llite/llite_lib.c b/lustre/llite/llite_lib.c index ee14f15..81d9906 100644 --- a/lustre/llite/llite_lib.c +++ b/lustre/llite/llite_lib.c @@ -1498,15 +1498,18 @@ static int ll_md_setattr(struct dentry *dentry, struct md_op_data * ptlrpc_req_finished(request); if (rc == -ENOENT) { clear_nlink(inode); +#if 0 /* Unlinked special device node? Or just a race? * Pretend we done everything. */ if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) { ia_valid = op_data->op_attr.ia_valid; op_data->op_attr.ia_valid &= ~TIMES_SET_FLAGS; + op_data->op_attr.ia_valid &= ~ATTR_MODE; rc = simple_setattr(dentry, &op_data->op_attr); op_data->op_attr.ia_valid = ia_valid; } +#endif } else if (rc != -EPERM && rc != -EACCES && rc != -ETXTBSY) { CERROR("md_setattr fails: rc = %d\n", rc); } @@ -1520,6 +1523,14 @@ static int ll_md_setattr(struct dentry *dentry, struct md_op_data *o RETURN(rc); } + if (md.body->mbo_valid & OBD_MD_FLTYPE) + LASSERTF((inode->i_mode & S_IFMT) == (md.body->mbo_mode & S_IFMT), + "mode changed: %o -> %o, ia_valid = %x, mode = %o," + " FID = "DFID"/"DFID" \n", + inode->i_mode & S_IFMT, md.body->mbo_mode & S_IFMT, + ia_valid, op_data->op_attr.ia_mode, + PFID(ll_inode2fid(inode)), PFID(&md.body->mbo_fid1)); + ia_valid = op_data->op_attr.ia_valid; /* inode size will be in ll_setattr_ost, can't do it now since dirty * cache is not cleared yet. */ and this is what I got: LustreError: 26785:0:(llite_lib.c:1543:ll_md_setattr()) ASSERTION( (inode->i_mode & S_IFMT) == (md.body->mbo_mode & S_IFMT) ) failed: mode changed: 40000 -> 100000, ia_valid = 10000046, mode = 0, FID = [0x200000403:0x8f5:0x0]/[0x200000403:0x8f5:0x0] LustreError: 26785:0:(llite_lib.c:1543:ll_md_setattr()) LBUG Pid: 26785, comm: chown Call Trace: [<ffffffffa0483895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] [<ffffffffa0483e97>] lbug_with_loc+0x47/0xb0 [libcfs] [<ffffffffa0eedc04>] ll_setattr_raw+0x1304/0x1c60 [lustre] [<ffffffffa0eee5c5>] ll_setattr+0x65/0xd0 [lustre] [<ffffffff8119ead8>] notify_change+0x168/0x340 [<ffffffff81192a72>] ? user_path_at+0x62/0xa0 [<ffffffff8117e94e>] chown_common+0x6e/0x90 [<ffffffff8117ec96>] sys_fchownat+0x96/0xb0 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b It clearly shows that client and MDT had a different idea about the type of the file. This needs further investigation. |
| Comment by Gerrit Updater [ 12/Jan/15 ] |
|
Jinshan Xiong (jinshan.xiong@intel.com) uploaded a new patch: http://review.whamcloud.com/13344 |
| Comment by Vinayak Hariharmath (Inactive) [ 12/Jan/15 ] |
|
Hi Jinshan, I did not understand how the patch http://review.whamcloud.com/13344 will resolve this issue. Call Trace: [<ffffffffa077c842>] ? __req_capsule_get+0x162/0x6d0 [ptlrpc] If I am not wrong the fix you introduced will come into picture only after getting the reply but as call trace tells that the client is stuck on waiting for the request. Please give me some information or correct me if I am wrong. |
| Comment by Jinshan Xiong (Inactive) [ 15/Jan/15 ] |
|
I think the patch is correct. The call trace shows that the process was waiting for inode mutex instead of an RPC request to finish. |
| Comment by Gerrit Updater [ 16/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/13344/ |
| Comment by Jian Yu [ 16/Jan/15 ] |
|
Hi Jinshan, While back-porting the patch to Lustre b2_5 branch, I hit some conflicts. Could you please create a patch on b2_5 to resolve the similar failure in |
| Comment by Jodi Levi (Inactive) [ 16/Jan/15 ] |
|
Patch landed to Master. |