[LU-3544] Writing to new files under NFS export from Lustre will result in ENOENT (SLES11SP2) Created: 01/Jul/13 Updated: 27/Jan/15 Resolved: 20/Jun/14 |
|
| Status: | Closed |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0 |
| Fix Version/s: | Lustre 2.6.0 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Cheng Shao (Inactive) | Assignee: | Lai Siyao |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | patch | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 8922 | ||||||||||||||||||||
| Description |
|
After resolving the issues in Creating and reading files works fine, but attempting to write to a new file resulted in an ENOENT coming back. 00000080:00000001:1.0:1372108818.561839:0:12348:0:(file.c:407:ll_intent_file_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe) Here's the line of code: if (it_disposition(itp, DISP_LOOKUP_NEG))
GOTO(out, rc = -ENOENT);
Looking through the DK logs, we noticed something strange. This is the intent open log message for the attempted write: 00800000:00000002:0.0:1372108818.561125:0:12348:0:(lmv_intent.c:198:lmv_intent_open()) OPEN_INTENT with fid1=[0x2000056c1:0x2:0x0], fid2=[0x2000056c1:0x4:0x0], name='/' -> mds #0 Except this write is to a file named something like 'myfile'. Specifically, "echo 5 > myfile". |
| Comments |
| Comment by Cheng Shao (Inactive) [ 01/Jul/13 ] |
|
This was caused by the patch introduced by SLES11SP2, the same root cause for in Patch has been verified to resolve the issue. Will submit for review shortly. |
| Comment by Cheng Shao (Inactive) [ 08/Jul/13 ] |
|
Patch is up: http://review.whamcloud.com/#/c/6920/ |
| Comment by Patrick Farrell (Inactive) [ 23/Jul/13 ] |
|
This patch causes an interoperability issue between 2.1 servers and 2.4 clients. LustreError: 29927:0:(mdt_handler.c:226:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed. A look at the 2.4 and 2.1 (using the WC/Intel b2_1 source) code around this assertion, and it seems clear why it works with 2.4. Here's 2.1: if (name != NULL) { LASSERT(namelen > 0); lh->mlh_pdo_hash = full_name_hash(name, namelen); }else { LASSERT(namelen == 0); lh->mlh_pdo_hash = 0ull; }— Here's 2.4. The important part is the change to the 'if' condition. Ignore the 'Workaround for — if (name != NULL && (name[0] != '\0')) { LASSERT(namelen > 0); lh->mlh_pdo_hash = full_name_hash(name, namelen); /* XXX Workaround for LU-2856 * Zero is a valid return value of full_name_hash, but several * users of mlh_pdo_hash assume a non-zero hash value. We * therefore map zero onto an arbitrary, but consistent * value (1) to avoid problems further down the road. */ if (unlikely(!lh->mlh_pdo_hash)) lh->mlh_pdo_hash = 1; } else { LASSERT(namelen == 0); lh->mlh_pdo_hash = 0ull; } — It seems the 2.1 patch should be as simple as replacing 'if (name != NULL)' with 'if (name != NULL && (name[0] != '\0'))', as apparently when the name is not provided from the client, name != NULL, but the first character is not '\0' as expected. I will post a patch to b2.1 with this change. |
| Comment by Bob Glossman (Inactive) [ 23/Jul/13 ] |
|
Patrick, |
| Comment by Patrick Farrell (Inactive) [ 23/Jul/13 ] |
|
Bob, I'm not entirely certain it doesn't. Cray doesn't run 2.1 clients anywhere, so I'm not well placed to test. However, it doesn't seem like it would. If the first character of a name is null, then it makes sense that the hash would be zero, which is what the change does. Separately, Cheng just pointed out that 2.1 does not support the MDS_OPEN_BY_FID flag. It appears this flag is only used in ll_intent_file_open. That's normally used when exporting via NFS and occasionally otherwise. |
| Comment by Patrick Farrell (Inactive) [ 23/Jul/13 ] |
|
I'll need to take a more thorough look at the code to see why, but I wanted to report the results of the testing today. I've confirmed that despite not supporting the MDS_OPEN_BY_FID flag which is being set by the Lustre client, 2.1 servers are handling NFS exported Lustre just fine. |
| Comment by Lai Siyao [ 17/Aug/13 ] |
|
|
| Comment by Patrick Farrell (Inactive) [ 22/Aug/13 ] |
|
This is related to the current discussion on If we can fix that, then we can go back to sending name and namelen. Looking at this particular snippet of code, it seems like the only way DISP_LOOKUP_NEG would get set is if mdt_open_by_fid_lock returns ENOENT (in that case, it's set in open_by_fid_lock). But I don't see how that could happen when the dentry name is not null, and not happen when it is. Thoughts? — |
| Comment by Lai Siyao [ 23/Aug/13 ] |
|
Op_by_fid is not clearly defined in lustre code when lu_fid is introduced, and before lu_fid there doesn't exist s globally unique identifier for a file, so in many operations still supports name after fid failed, this should be a technical debt. I'm planning to clean these code, and after that client request will pack either name or fid (if it's available), and server will use fid or name only too. This change is a little big, and will take more time. |
| Comment by Oleg Drokin [ 01/Oct/13 ] |
|
The patch in this ticked was revertd from master due to interop problems described in |
| Comment by Patrick Farrell (Inactive) [ 09/Oct/13 ] |
|
Lai, Peng, I understand the goal of recovering the technical debt by cleaning up the MDS_OPEN_BY_FID code and that that is the eventual goal here. However: |
| Comment by Lai Siyao [ 10/Oct/13 ] |
|
Patrick, yes, the reverted patch could partly work, it's reverted because the interop test is broken, but we want that always work to prevent new commit to break it silently. I'll include the change http://review.whamcloud.com/#/c/6920/ in http://review.whamcloud.com/#/c/7476/ for |
| Comment by John Hammond [ 15/Oct/13 ] |
|
In mdt_reint_open() we set DISP_LOOKUP_NEG even when MDS_OPEN_CREAT is used, the create is successful, and 0 is returned. if (result == -ENOENT || result == -ESTALE) {
mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
if (result == -ESTALE) {
/*
* -ESTALE means the parent is a dead(unlinked) dir, so
* it should return -ENOENT to in accordance with the
* original mds implementaion.
*/
GOTO(out_parent, result = -ENOENT);
}
if (!(create_flags & MDS_OPEN_CREAT))
GOTO(out_parent, result);
if (exp_connect_flags(req->rq_export) & OBD_CONNECT_RDONLY)
GOTO(out_parent, result = -EROFS);
*child_fid = *info->mti_rr.rr_fid2;
LASSERTF(fid_is_sane(child_fid), "fid="DFID"\n",
PFID(child_fid));
/* In the function below, .hs_keycmp resolves to
* lu_obj_hop_keycmp() */
/* coverity[overrun-buffer-val] */
child = mdt_object_new(info->mti_env, mdt, child_fid);
Perhaps this issue could be addressed by only returning DISP_LOOKUP_NEG on no create: if (result == -ENOENT || result == -ESTALE) {
if (result == -ESTALE) {
/*
* -ESTALE means the parent is a dead(unlinked) dir, so
* it should return -ENOENT to in accordance with the
* original mds implementaion.
*/
mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
GOTO(out_parent, result = -ENOENT);
}
if (!(create_flags & MDS_OPEN_CREAT)) {
mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
GOTO(out_parent, result);
}
if (exp_connect_flags(req->rq_export) & OBD_CONNECT_RDONLY)
GOTO(out_parent, result = -EROFS);
*child_fid = *info->mti_rr.rr_fid2;
LASSERTF(fid_is_sane(child_fid), "fid="DFID"\n",
PFID(child_fid));
/* In the function below, .hs_keycmp resolves to
* lu_obj_hop_keycmp() */
/* coverity[overrun-buffer-val] */
child = mdt_object_new(info->mti_env, mdt, child_fid);
|
| Comment by John Hammond [ 15/Oct/13 ] |
|
But I think this would require the clients to change the way they check for stale FID name pairs. In mdc_finish_intent_lock() if M_CHECK_STALE is set then we compare the returned FID to both op_fid2 and op_fid3. If O_CREATE was set and op_fid3 was stale (because the old file was unlinked) then the logic is a bit off. Since the old FID (op_fid3) is stale but the returned FID is equal to op_fid2. So my the above proposal might fix ll_intent_file_open() at the expense "trying to change FID LBUGs" in ll_update_inode() from ll_revalidate_it(). Bah! Also if you're trying to trace through this then please be aware that mdc_finish_intent_lock() prints op_fid2 twice when it detects a stale FID: if ((!lu_fid_eq(&op_data->op_fid2, &mdt_body->fid1)) &&
(!lu_fid_eq(&op_data->op_fid3, &mdt_body->fid1))) {
CDEBUG(D_DENTRY, "Found stale data "DFID"("DFID")/"DFID
"\n", PFID(&op_data->op_fid2),
PFID(&op_data->op_fid2), PFID(&mdt_body->fid1));
RETURN(-ESTALE);
}
(I found that a bit confusing at first.) |
| Comment by Lai Siyao [ 16/Oct/13 ] |
|
Yes, it's tricky for open by fid if create is set, because you know we have the "trying to change FID LBUG", so if the file with the specified fid doesn't exist, we should only create with the old fid (exclusively), but this is also strange, because this doesn't conform to fid design, and may confuse user space backup tools. So in http://review.whamcloud.com/#/c/7476/ if open is by fid, it won't create upon -ENOENT, and per Fan's suggestion, client should return -ESTALE to not confuse user space application. This may not conform to posix, but it's the best we can do, the root cause is that revalidate and file open is not atomic. |
| Comment by John Hammond [ 16/Oct/13 ] |
|
Returning -ENOENT or -ESTALE on open(..., O_CREAT) is not acceptable. We shouldn't add this breakage to fix NFS export. |
| Comment by Lai Siyao [ 16/Oct/13 ] |
|
Current code can't handle this either IMHO, either it will return error in M_CHECK_STALE if fid is different from known ones, or it will LBUG in ll_update_inode(). |
| Comment by John Hammond [ 16/Oct/13 ] |
|
As long as the parent exists, the current code does not return -ENOENT or -ESTALE on opens with O_CREAT set. |
| Comment by Patrick Farrell (Inactive) [ 16/Oct/13 ] |
|
John, With what you're saying here: "In mdt_reint_open() we set DISP_LOOKUP_NEG even when MDS_OPEN_CREAT is used, the create is successful, and 0 is returned." In other words, we'd expect creating new files in an NFS export from a 2.4 or greater client backed by a 2.4 or greater server to fail with ENOENT. Is that correct? That's not what we observe. From reading the code, it seems like that would happen, but we are able to do NFS exports of 2.4 servers from 2.4 clients. I may have missed something in what you're saying. (And I've definitely missed something in the code, since it works fine.) |
| Comment by Lai Siyao [ 17/Oct/13 ] |
|
John, could you check ll_intent_file_open -> mdc_intent_lock -> mdc_finish_intent_lock: if file was unlinked and recreated by others (different fid), M_CHECK_STALE will return -ESTALE. |
| Comment by John Hammond [ 17/Oct/13 ] |
|
Lai I've read the same code as you but I don't think it happens. Can you give me an example where open(..., O_CREAT) returns -ESTALE or -ENOENT? |
| Comment by Lai Siyao [ 17/Oct/13 ] |
|
1. client1 revalidated(IT_OPEN|IT_CREATE) file1 and found open lock cached locally, so it returned successfully. |
| Comment by John Hammond [ 17/Oct/13 ] |
|
Can you create this situation and demonstrate that open(..., O_CREAT) returns -ESTALE or -ENOENT without some trick like deleting the parent directory? With the current master and a RHEL 6.5 (2.6.32-358.18.1.el6.lustre.x86_64) kernel it can't make it happen. But maybe thing are different on the newer kernel with atomic open support. |
| Comment by Lai Siyao [ 18/Oct/13 ] |
|
I'll do some test to see the result, btw, could you explain why the above scenario won't happen? |
| Comment by Lai Siyao [ 18/Oct/13 ] |
|
John, you're right, this won't happen on current code because once O_CREAT is set, revalidate will ignore open lock and open on MDS anyway, so in file open, it can always find the opened request from intent. But doesn't this mean that ll_intent_file_open() will never create file? |
| Comment by Alexey Lyashkov [ 18/Dec/13 ] |
|
this bug is result of regression introduced by any object is live on FS after unlink until last open closed, so problem now, MDS set a DEAD_OBJ flag for unlinked object but mdd_cd_sanity_check - see that flag and return ENOENT error for orphan object until it's closed. |
| Comment by John Hammond [ 19/Dec/13 ] |
|
Does open create always fail on NFS exports of lustre on SLES11SP2? 3.11 also uses '/' as the name of disconnected dentries from NFS file handle conversion. And I cannot reproduce this issue using Lustre master + patches for 3.11 / 3.11 kernel. Alexey, I agree about handling of open unlinked objects (and I look forward to your patches). But I also think my "fix" for |
| Comment by Jodi Levi (Inactive) [ 08/May/14 ] |
|
We need the patches that were reverted here to land in 2.6. And we no longer care about 2.1 server compatibility. But when these land we need to ensure they work with 2.4 and 2.5 |
| Comment by Lai Siyao [ 09/May/14 ] |
|
I'll update http://review.whamcloud.com/#/c/7476/ soon. |
| Comment by Patrick Farrell (Inactive) [ 09/Jun/14 ] |
|
From testing of patch set 20: During Cray testing of NFS export from a SLES11SP3 client with this patch (client+server), I hit the assertion below on MDS0 (which has MDTs 0 and 1 on it). Ran a modified racer against the NFS export and an unmodified racer on a Lustre 2.5.58 client. Dump of MDS0 is up at: <6>Lustre: Skipped 1 previous similar message |
| Comment by Lai Siyao [ 12/Jun/14 ] |
|
Patrick, the racer issue is DNE NFS reexport issue, which was not full tested, could you create a new ticket to track there (you can assign to me)? |
| Comment by Lai Siyao [ 12/Jun/14 ] |
|
In Parallel-scale-nfs test 2 new issues are found, so now there are 3 patches in total: |
| Comment by Peter Jones [ 20/Jun/14 ] |
|
Landed for 2.6 |
| Comment by Lai Siyao [ 23/Jun/14 ] |
|
parallel-scale-nfsv3 iorssf still failed, and it should be the same issue as LU-1639, which is a kernel nfs issue. |
| Comment by Lai Siyao [ 23/Jun/14 ] |
|
All required patches are landed. |
| Comment by Gerrit Updater [ 05/Dec/14 ] |
|
Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/12952 |
| Comment by Gerrit Updater [ 27/Jan/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12952/ |