[LU-3544] Writing to new files under NFS export from Lustre will result in ENOENT (SLES11SP2) Created: 01/Jul/13  Updated: 27/Jan/15  Resolved: 20/Jun/14

Status: Closed
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.6.0

Type: Bug Priority: Blocker
Reporter: Cheng Shao (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: patch

Issue Links:
Duplicate
is duplicated by LU-3765 2.5.0<->2.1.5 interop: sanity test 24... Resolved
Related
is related to LU-5109 Failure on test suite parallel-scale-... Resolved
is related to LU-5177 object leak in mdt_open_by_fid_lock() Resolved
Severity: 3
Rank (Obsolete): 8922

 Description   

After resolving the issues in LU-3484 and LU-3486, the testing of exporting NFS over Lustre on SLES11SP2 continues and we hit yet another issue.

Creating and reading files works fine, but attempting to write to a new file resulted in an ENOENT coming back.
It's coming from here:

00000080:00000001:1.0:1372108818.561839:0:12348:0:(file.c:407:ll_intent_file_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)

Here's the line of code:

        if (it_disposition(itp, DISP_LOOKUP_NEG))
                 GOTO(out, rc = -ENOENT);

Looking through the DK logs, we noticed something strange. This is the intent open log message for the attempted write:

00800000:00000002:0.0:1372108818.561125:0:12348:0:(lmv_intent.c:198:lmv_intent_open()) OPEN_INTENT with fid1=[0x2000056c1:0x2:0x0], fid2=[0x2000056c1:0x4:0x0], name='/' -> mds #0

Except this write is to a file named something like 'myfile'. Specifically, "echo 5 > myfile".
Comparing this to the normal case, not exported over NFS, we see the name of the file here like I'd expect.



 Comments   
Comment by Cheng Shao (Inactive) [ 01/Jul/13 ]

This was caused by the patch introduced by SLES11SP2, the same root cause for in LU-3484 and LU-3486, wherein the anonymous dentry has name field set to '/' instead of an empty string.

Patch has been verified to resolve the issue. Will submit for review shortly.

Comment by Cheng Shao (Inactive) [ 08/Jul/13 ]

Patch is up: http://review.whamcloud.com/#/c/6920/

Comment by Patrick Farrell (Inactive) [ 23/Jul/13 ]

This patch causes an interoperability issue between 2.1 servers and 2.4 clients.
In particular, the following assertion is triggered during some (fairly heavy) testing at Cray.

LustreError: 29927:0:(mdt_handler.c:226:mdt_lock_pdo_init()) ASSERTION( namelen > 0 ) failed.

A look at the 2.4 and 2.1 (using the WC/Intel b2_1 source) code around this assertion, and it seems clear why it works with 2.4.

Here's 2.1:

if (name != NULL)

{ LASSERT(namelen > 0); lh->mlh_pdo_hash = full_name_hash(name, namelen); }

else

{ LASSERT(namelen == 0); lh->mlh_pdo_hash = 0ull; }



Here's 2.4. The important part is the change to the 'if' condition. Ignore the 'Workaround for LU-2856' comment - That's unrelated.


if (name != NULL && (name[0] != '\0')) { LASSERT(namelen > 0); lh->mlh_pdo_hash = full_name_hash(name, namelen); /* XXX Workaround for LU-2856 * Zero is a valid return value of full_name_hash, but several * users of mlh_pdo_hash assume a non-zero hash value. We * therefore map zero onto an arbitrary, but consistent * value (1) to avoid problems further down the road. */ if (unlikely(!lh->mlh_pdo_hash)) lh->mlh_pdo_hash = 1; } else { LASSERT(namelen == 0); lh->mlh_pdo_hash = 0ull; }

It seems the 2.1 patch should be as simple as replacing 'if (name != NULL)' with 'if (name != NULL && (name[0] != '\0'))', as apparently when the name is not provided from the client, name != NULL, but the first character is not '\0' as expected.

I will post a patch to b2.1 with this change.

Comment by Bob Glossman (Inactive) [ 23/Jul/13 ]

Patrick,
I can see how your proposed patch in b2_1 may fix the interop problem you describe, but are you sure it has no bad side effects when both client and server are 2.1 ?

Comment by Patrick Farrell (Inactive) [ 23/Jul/13 ]

Bob,

I'm not entirely certain it doesn't. Cray doesn't run 2.1 clients anywhere, so I'm not well placed to test. However, it doesn't seem like it would. If the first character of a name is null, then it makes sense that the hash would be zero, which is what the change does.

Separately, Cheng just pointed out that 2.1 does not support the MDS_OPEN_BY_FID flag.

It appears this flag is only used in ll_intent_file_open. That's normally used when exporting via NFS and occasionally otherwise.
However, I was able to export NFS from a 2.1 server, so I'm a bit puzzled as to why that worked at all. I'll be doing more testing this afternoon and will come back with more data on that.

Comment by Patrick Farrell (Inactive) [ 23/Jul/13 ]

I'll need to take a more thorough look at the code to see why, but I wanted to report the results of the testing today.

I've confirmed that despite not supporting the MDS_OPEN_BY_FID flag which is being set by the Lustre client, 2.1 servers are handling NFS exported Lustre just fine.

Comment by Lai Siyao [ 17/Aug/13 ]

LU-3765 is for this interop issue, and I've provided a patch which will fix issue on 2.4 code.

Comment by Patrick Farrell (Inactive) [ 22/Aug/13 ]

This is related to the current discussion on LU-3765. Note this is currently only for 2.4 servers, but if we want to be interoperable with 2.1 servers by sending the name and length rather than removing the assertion server side, we need to understand why sending the name is negatively affecting 2.4 servers when MDS_OPEN_BY_FID is set. (Just to be clear for anyone not already familiar: With 2.4 servers, when MDS_OPEN_BY_FID is set and the name is sent, ENOENT can result incorrectly. When the name is not sent, this doesn't happen.)

If we can fix that, then we can go back to sending name and namelen.

I'm trying to understand precisely why sending the name to a server (IE, 2.4) which supports MDS_OPEN_BY_FID leads to the ENOENT. I think it's coming from somewhere in mdt_reint_open (and/or the call to mdt_open_by_fid_lock in mdt_reint_open). (Specifically, the client will return ENOENT if DISP_LOOKUP_NEG is set by the server.)

Looking at this particular snippet of code, it seems like the only way DISP_LOOKUP_NEG would get set is if mdt_open_by_fid_lock returns ENOENT (in that case, it's set in open_by_fid_lock). But I don't see how that could happen when the dentry name is not null, and not happen when it is.

Thoughts?

} else if ((rr->rr_namelen == 0 && create_flags & MDS_OPEN_LOCK) ||
(create_flags & MDS_OPEN_BY_FID))

{ result = mdt_open_by_fid_lock(info, ldlm_rep, lhc); /* If result is 0 then open by FID has found the file * and there is nothing left for us to do here. More * generally if it is anything other than -ENOENT or * -EREMOTE then we return that now. If -ENOENT and * MDS_OPEN_CREAT is set then we must create the file * below. If -EREMOTE then we need to return a LOOKUP * lock to the client, which we do below. Hence this * odd looking condition. See LU-2523. */ if (!(result == -ENOENT && (create_flags & MDS_OPEN_CREAT)) && result != -EREMOTE) GOTO(out, result); if (unlikely(rr->rr_namelen == 0)) GOTO(out, result = -EINVAL); CDEBUG(D_INFO, "No object(2), continue as regular open.\n"); }

Comment by Lai Siyao [ 23/Aug/13 ]

Op_by_fid is not clearly defined in lustre code when lu_fid is introduced, and before lu_fid there doesn't exist s globally unique identifier for a file, so in many operations still supports name after fid failed, this should be a technical debt. I'm planning to clean these code, and after that client request will pack either name or fid (if it's available), and server will use fid or name only too. This change is a little big, and will take more time.

Comment by Oleg Drokin [ 01/Oct/13 ]

The patch in this ticked was revertd from master due to interop problems described in LU-3765

Comment by Patrick Farrell (Inactive) [ 09/Oct/13 ]

Lai, Peng,

I understand the goal of recovering the technical debt by cleaning up the MDS_OPEN_BY_FID code and that that is the eventual goal here.

However:
If we add support for MDS_OPEN_BY_FID to 2.1 servers, is there any reason we couldn't go back to not sending the name in ll_intent_file_open? (IE, the LU-3544 patch that was reverted.) As far as I can see, that would work until the op_by_fid support is completed.

Comment by Lai Siyao [ 10/Oct/13 ]

Patrick, yes, the reverted patch could partly work, it's reverted because the interop test is broken, but we want that always work to prevent new commit to break it silently. I'll include the change http://review.whamcloud.com/#/c/6920/ in http://review.whamcloud.com/#/c/7476/ for LU-3765.

Comment by John Hammond [ 15/Oct/13 ]

In mdt_reint_open() we set DISP_LOOKUP_NEG even when MDS_OPEN_CREAT is used, the create is successful, and 0 is returned.

        if (result == -ENOENT || result == -ESTALE) {
        	mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
                if (result == -ESTALE) {
                        /*                                                                          
                         * -ESTALE means the parent is a dead(unlinked) dir, so
                         * it should return -ENOENT to in accordance with the
                         * original mds implementaion.
                         */
                        GOTO(out_parent, result = -ENOENT);
                }
        	if (!(create_flags & MDS_OPEN_CREAT))
                        GOTO(out_parent, result);
                if (exp_connect_flags(req->rq_export) & OBD_CONNECT_RDONLY)
                        GOTO(out_parent, result = -EROFS);
                *child_fid = *info->mti_rr.rr_fid2;
                LASSERTF(fid_is_sane(child_fid), "fid="DFID"\n",
                         PFID(child_fid));
                /* In the function below, .hs_keycmp resolves to
                 * lu_obj_hop_keycmp() */
                /* coverity[overrun-buffer-val] */
                child = mdt_object_new(info->mti_env, mdt, child_fid);

Perhaps this issue could be addressed by only returning DISP_LOOKUP_NEG on no create:

        if (result == -ENOENT || result == -ESTALE) {
                if (result == -ESTALE) {
                        /*                                                                          
                         * -ESTALE means the parent is a dead(unlinked) dir, so
                         * it should return -ENOENT to in accordance with the
                         * original mds implementaion.
                         */
        	        mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
                        GOTO(out_parent, result = -ENOENT);
        	}

        	if (!(create_flags & MDS_OPEN_CREAT)) {
        	        mdt_set_disposition(info, ldlm_rep, DISP_LOOKUP_NEG);
                        GOTO(out_parent, result);
                }

                if (exp_connect_flags(req->rq_export) & OBD_CONNECT_RDONLY)
                        GOTO(out_parent, result = -EROFS);

                *child_fid = *info->mti_rr.rr_fid2;
                LASSERTF(fid_is_sane(child_fid), "fid="DFID"\n",
                         PFID(child_fid));
                /* In the function below, .hs_keycmp resolves to
                 * lu_obj_hop_keycmp() */
                /* coverity[overrun-buffer-val] */
                child = mdt_object_new(info->mti_env, mdt, child_fid);
Comment by John Hammond [ 15/Oct/13 ]

But I think this would require the clients to change the way they check for stale FID name pairs. In mdc_finish_intent_lock() if M_CHECK_STALE is set then we compare the returned FID to both op_fid2 and op_fid3. If O_CREATE was set and op_fid3 was stale (because the old file was unlinked) then the logic is a bit off. Since the old FID (op_fid3) is stale but the returned FID is equal to op_fid2. So my the above proposal might fix ll_intent_file_open() at the expense "trying to change FID LBUGs" in ll_update_inode() from ll_revalidate_it(). Bah!

Also if you're trying to trace through this then please be aware that mdc_finish_intent_lock() prints op_fid2 twice when it detects a stale FID:

                if ((!lu_fid_eq(&op_data->op_fid2, &mdt_body->fid1)) &&
                    (!lu_fid_eq(&op_data->op_fid3, &mdt_body->fid1))) {
                        CDEBUG(D_DENTRY, "Found stale data "DFID"("DFID")/"DFID
                               "\n", PFID(&op_data->op_fid2),
                               PFID(&op_data->op_fid2), PFID(&mdt_body->fid1));
                        RETURN(-ESTALE);
                }

(I found that a bit confusing at first.)

Comment by Lai Siyao [ 16/Oct/13 ]

Yes, it's tricky for open by fid if create is set, because you know we have the "trying to change FID LBUG", so if the file with the specified fid doesn't exist, we should only create with the old fid (exclusively), but this is also strange, because this doesn't conform to fid design, and may confuse user space backup tools. So in http://review.whamcloud.com/#/c/7476/ if open is by fid, it won't create upon -ENOENT, and per Fan's suggestion, client should return -ESTALE to not confuse user space application. This may not conform to posix, but it's the best we can do, the root cause is that revalidate and file open is not atomic.

Comment by John Hammond [ 16/Oct/13 ]

Returning -ENOENT or -ESTALE on open(..., O_CREAT) is not acceptable. We shouldn't add this breakage to fix NFS export.

Comment by Lai Siyao [ 16/Oct/13 ]

Current code can't handle this either IMHO, either it will return error in M_CHECK_STALE if fid is different from known ones, or it will LBUG in ll_update_inode().

Comment by John Hammond [ 16/Oct/13 ]

As long as the parent exists, the current code does not return -ENOENT or -ESTALE on opens with O_CREAT set.

Comment by Patrick Farrell (Inactive) [ 16/Oct/13 ]

John,

With what you're saying here: "In mdt_reint_open() we set DISP_LOOKUP_NEG even when MDS_OPEN_CREAT is used, the create is successful, and 0 is returned."
It sounds like we would expect to see ENOENT on successful O_CREAT requests (when MDS_OPEN_BY_FID is set).

In other words, we'd expect creating new files in an NFS export from a 2.4 or greater client backed by a 2.4 or greater server to fail with ENOENT.

Is that correct? That's not what we observe. From reading the code, it seems like that would happen, but we are able to do NFS exports of 2.4 servers from 2.4 clients. I may have missed something in what you're saying. (And I've definitely missed something in the code, since it works fine.)

Comment by Lai Siyao [ 17/Oct/13 ]

John, could you check ll_intent_file_open -> mdc_intent_lock -> mdc_finish_intent_lock: if file was unlinked and recreated by others (different fid), M_CHECK_STALE will return -ESTALE.

Comment by John Hammond [ 17/Oct/13 ]

Lai I've read the same code as you but I don't think it happens. Can you give me an example where open(..., O_CREAT) returns -ESTALE or -ENOENT?

Comment by Lai Siyao [ 17/Oct/13 ]

1. client1 revalidated(IT_OPEN|IT_CREATE) file1 and found open lock cached locally, so it returned successfully.
2. client2 unlinked file1, which also canceled the open lock on client1.
3. client2 recreated file1 (with different fid).
4. client1 opened file1, because it was not opened in revalidate, it called ll_intent_file_open().
5. mds found file1 (by name), and opened it and replied back.
6. client1 found fid in the reply was different from old and newly allocated fid, -ESTALE was returned.

Comment by John Hammond [ 17/Oct/13 ]

Can you create this situation and demonstrate that open(..., O_CREAT) returns -ESTALE or -ENOENT without some trick like deleting the parent directory? With the current master and a RHEL 6.5 (2.6.32-358.18.1.el6.lustre.x86_64) kernel it can't make it happen. But maybe thing are different on the newer kernel with atomic open support.

Comment by Lai Siyao [ 18/Oct/13 ]

I'll do some test to see the result, btw, could you explain why the above scenario won't happen?

Comment by Lai Siyao [ 18/Oct/13 ]

John, you're right, this won't happen on current code because once O_CREAT is set, revalidate will ignore open lock and open on MDS anyway, so in file open, it can always find the opened request from intent. But doesn't this mean that ll_intent_file_open() will never create file?

Comment by Alexey Lyashkov [ 18/Dec/13 ]

this bug is result of regression introduced by LU-220 and "fix" from LU-2523.
i think we don't need an return ESTALE it (but it's good to prevent panic with update different fid for same object).

any object is live on FS after unlink until last open closed, so problem now, MDS set a DEAD_OBJ flag for unlinked object but mdd_cd_sanity_check - see that flag and return ENOENT error for orphan object until it's closed.
I think, we may revert an LU-2523 change and replace with a add additional check for ORPHAN_OBJ to allow operations on orphan objects, anyway we may find that object only by fid, not a by name - so it's very special operations allowed.

Comment by John Hammond [ 19/Dec/13 ]

Does open create always fail on NFS exports of lustre on SLES11SP2? 3.11 also uses '/' as the name of disconnected dentries from NFS file handle conversion. And I cannot reproduce this issue using Lustre master + patches for 3.11 / 3.11 kernel.

Alexey, I agree about handling of open unlinked objects (and I look forward to your patches). But I also think my "fix" for LU-2523 is correct. Please explain what you dislike about it.

Comment by Jodi Levi (Inactive) [ 08/May/14 ]

We need the patches that were reverted here to land in 2.6. And we no longer care about 2.1 server compatibility. But when these land we need to ensure they work with 2.4 and 2.5

Comment by Lai Siyao [ 09/May/14 ]

I'll update http://review.whamcloud.com/#/c/7476/ soon.

Comment by Patrick Farrell (Inactive) [ 09/Jun/14 ]

From testing of patch set 20:

During Cray testing of NFS export from a SLES11SP3 client with this patch (client+server), I hit the assertion below on MDS0 (which has MDTs 0 and 1 on it). Ran a modified racer against the NFS export and an unmodified racer on a Lustre 2.5.58 client.

Dump of MDS0 is up at:
ftp.whamcloud.com/uploads/LU-3544/LU-3544_140609.tar.gz

<6>Lustre: Skipped 1 previous similar message
<3>LustreError: 7816:0:(mdt_reint.c:1519:mdt_reint_migrate_internal()) centssm2-MDT0000: parent [0x400000400:0x1:0x0] is still on the same MDT, which should be migrated first: rc = -1
<3>LustreError: 7816:0:(mdt_reint.c:1519:mdt_reint_migrate_internal()) Skipped 3 previous similar messages
<3>LustreError: 7298:0:(mdd_dir.c:3957:mdd_migrate()) centssm2-MDD0000: [0x400000401:0x330f:0x0]8 is already opened count 1: rc = -16
<0>LustreError: 7816:0:(service.c:193:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed:
<0>LustreError: 7816:0:(service.c:193:ptlrpc_save_lock()) LBUG
<4>Pid: 7816, comm: mdt00_007
<4>
<4>Call Trace:
<4> [<ffffffffa0b27895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
<4> [<ffffffffa0b27e97>] lbug_with_loc+0x47/0xb0 [libcfs]
<4> [<ffffffffa0ebd656>] ptlrpc_save_lock+0xb6/0xf0 [ptlrpc]
<4> [<ffffffffa15b074b>] mdt_save_lock+0x22b/0x320 [mdt]
<4> [<ffffffffa15b089c>] mdt_object_unlock+0x5c/0x160 [mdt]
<4> [<ffffffffa15b2187>] mdt_object_unlock_put+0x17/0x110 [mdt]
<4> [<ffffffffa15cf18d>] mdt_unlock_list+0x5d/0x1e0 [mdt]
<4> [<ffffffffa15d1e7c>] mdt_reint_migrate_internal+0x109c/0x1b50 [mdt]
<4> [<ffffffffa15d6113>] mdt_reint_rename_or_migrate+0x2a3/0x660 [mdt]
<4> [<ffffffffa0e8abc0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
<4> [<ffffffffa0e8c230>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
<4> [<ffffffffa15d64e3>] mdt_reint_migrate+0x13/0x20 [mdt]
<4> [<ffffffffa15cea81>] mdt_reint_rec+0x41/0xe0 [mdt]
<4> [<ffffffffa15b3e93>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
<4> [<ffffffffa15b471b>] mdt_reint+0x6b/0x120 [mdt]
<4> [<ffffffffa0f182ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
<4> [<ffffffffa0ec7d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
<4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
<4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e
<4> [<ffffffffa0ec7000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
<4> [<ffffffff8109aee6>] kthread+0x96/0xa0
<4> [<ffffffff8100c20a>] child_rip+0xa/0x20
<4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
<4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
<4>
<0>Kernel panic - not syncing: LBUG
<4>Pid: 7816, comm: mdt00_007 Not tainted 2.6.32.431.5.1.el6_lustre #1

Comment by Lai Siyao [ 12/Jun/14 ]

Patrick, the racer issue is DNE NFS reexport issue, which was not full tested, could you create a new ticket to track there (you can assign to me)?

Comment by Lai Siyao [ 12/Jun/14 ]

In Parallel-scale-nfs test 2 new issues are found, so now there are 3 patches in total:
http://review.whamcloud.com/#/c/7476/
http://review.whamcloud.com/#/c/10692/
http://review.whamcloud.com/#/c/10693/

Comment by Peter Jones [ 20/Jun/14 ]

Landed for 2.6

Comment by Lai Siyao [ 23/Jun/14 ]

parallel-scale-nfsv3 iorssf still failed, and it should be the same issue as LU-1639, which is a kernel nfs issue.

Comment by Lai Siyao [ 23/Jun/14 ]

All required patches are landed.

Comment by Gerrit Updater [ 05/Dec/14 ]

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/12952
Subject: LU-3544 xattr: xattr data may be gone with lock held
Project: fs/lustre-release
Branch: b2_5
Current Patch Set: 1
Commit: e057a284afb2020681868f538934a839b2649c99

Comment by Gerrit Updater [ 27/Jan/15 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12952/
Subject: LU-3544 xattr: xattr data may be gone with lock held
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: fe9ad627b6d83e29039c0c6c0b555aae5f23e9a7

Generated at Sat Feb 10 01:34:49 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.