Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3544

Writing to new files under NFS export from Lustre will result in ENOENT (SLES11SP2)

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.6.0
    • Lustre 2.4.0
    • 3
    • 8922

    Description

      After resolving the issues in LU-3484 and LU-3486, the testing of exporting NFS over Lustre on SLES11SP2 continues and we hit yet another issue.

      Creating and reading files works fine, but attempting to write to a new file resulted in an ENOENT coming back.
      It's coming from here:

      00000080:00000001:1.0:1372108818.561839:0:12348:0:(file.c:407:ll_intent_file_open()) Process leaving via out (rc=18446744073709551614 : -2 : 0xfffffffffffffffe)

      Here's the line of code:

              if (it_disposition(itp, DISP_LOOKUP_NEG))
                       GOTO(out, rc = -ENOENT);
      

      Looking through the DK logs, we noticed something strange. This is the intent open log message for the attempted write:

      00800000:00000002:0.0:1372108818.561125:0:12348:0:(lmv_intent.c:198:lmv_intent_open()) OPEN_INTENT with fid1=[0x2000056c1:0x2:0x0], fid2=[0x2000056c1:0x4:0x0], name='/' -> mds #0

      Except this write is to a file named something like 'myfile'. Specifically, "echo 5 > myfile".
      Comparing this to the normal case, not exported over NFS, we see the name of the file here like I'd expect.

      Attachments

        Issue Links

          Activity

            [LU-3544] Writing to new files under NFS export from Lustre will result in ENOENT (SLES11SP2)

            Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12952/
            Subject: LU-3544 xattr: xattr data may be gone with lock held
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set:
            Commit: fe9ad627b6d83e29039c0c6c0b555aae5f23e9a7

            gerrit Gerrit Updater added a comment - Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12952/ Subject: LU-3544 xattr: xattr data may be gone with lock held Project: fs/lustre-release Branch: b2_5 Current Patch Set: Commit: fe9ad627b6d83e29039c0c6c0b555aae5f23e9a7

            Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/12952
            Subject: LU-3544 xattr: xattr data may be gone with lock held
            Project: fs/lustre-release
            Branch: b2_5
            Current Patch Set: 1
            Commit: e057a284afb2020681868f538934a839b2649c99

            gerrit Gerrit Updater added a comment - Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/12952 Subject: LU-3544 xattr: xattr data may be gone with lock held Project: fs/lustre-release Branch: b2_5 Current Patch Set: 1 Commit: e057a284afb2020681868f538934a839b2649c99
            laisiyao Lai Siyao added a comment -

            All required patches are landed.

            laisiyao Lai Siyao added a comment - All required patches are landed.
            laisiyao Lai Siyao added a comment -

            parallel-scale-nfsv3 iorssf still failed, and it should be the same issue as LU-1639, which is a kernel nfs issue.

            laisiyao Lai Siyao added a comment - parallel-scale-nfsv3 iorssf still failed, and it should be the same issue as LU-1639 , which is a kernel nfs issue.
            pjones Peter Jones added a comment -

            Landed for 2.6

            pjones Peter Jones added a comment - Landed for 2.6
            laisiyao Lai Siyao added a comment -

            In Parallel-scale-nfs test 2 new issues are found, so now there are 3 patches in total:
            http://review.whamcloud.com/#/c/7476/
            http://review.whamcloud.com/#/c/10692/
            http://review.whamcloud.com/#/c/10693/

            laisiyao Lai Siyao added a comment - In Parallel-scale-nfs test 2 new issues are found, so now there are 3 patches in total: http://review.whamcloud.com/#/c/7476/ http://review.whamcloud.com/#/c/10692/ http://review.whamcloud.com/#/c/10693/
            laisiyao Lai Siyao added a comment -

            Patrick, the racer issue is DNE NFS reexport issue, which was not full tested, could you create a new ticket to track there (you can assign to me)?

            laisiyao Lai Siyao added a comment - Patrick, the racer issue is DNE NFS reexport issue, which was not full tested, could you create a new ticket to track there (you can assign to me)?

            From testing of patch set 20:

            During Cray testing of NFS export from a SLES11SP3 client with this patch (client+server), I hit the assertion below on MDS0 (which has MDTs 0 and 1 on it). Ran a modified racer against the NFS export and an unmodified racer on a Lustre 2.5.58 client.

            Dump of MDS0 is up at:
            ftp.whamcloud.com/uploads/LU-3544/LU-3544_140609.tar.gz

            <6>Lustre: Skipped 1 previous similar message
            <3>LustreError: 7816:0:(mdt_reint.c:1519:mdt_reint_migrate_internal()) centssm2-MDT0000: parent [0x400000400:0x1:0x0] is still on the same MDT, which should be migrated first: rc = -1
            <3>LustreError: 7816:0:(mdt_reint.c:1519:mdt_reint_migrate_internal()) Skipped 3 previous similar messages
            <3>LustreError: 7298:0:(mdd_dir.c:3957:mdd_migrate()) centssm2-MDD0000: [0x400000401:0x330f:0x0]8 is already opened count 1: rc = -16
            <0>LustreError: 7816:0:(service.c:193:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed:
            <0>LustreError: 7816:0:(service.c:193:ptlrpc_save_lock()) LBUG
            <4>Pid: 7816, comm: mdt00_007
            <4>
            <4>Call Trace:
            <4> [<ffffffffa0b27895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            <4> [<ffffffffa0b27e97>] lbug_with_loc+0x47/0xb0 [libcfs]
            <4> [<ffffffffa0ebd656>] ptlrpc_save_lock+0xb6/0xf0 [ptlrpc]
            <4> [<ffffffffa15b074b>] mdt_save_lock+0x22b/0x320 [mdt]
            <4> [<ffffffffa15b089c>] mdt_object_unlock+0x5c/0x160 [mdt]
            <4> [<ffffffffa15b2187>] mdt_object_unlock_put+0x17/0x110 [mdt]
            <4> [<ffffffffa15cf18d>] mdt_unlock_list+0x5d/0x1e0 [mdt]
            <4> [<ffffffffa15d1e7c>] mdt_reint_migrate_internal+0x109c/0x1b50 [mdt]
            <4> [<ffffffffa15d6113>] mdt_reint_rename_or_migrate+0x2a3/0x660 [mdt]
            <4> [<ffffffffa0e8abc0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc]
            <4> [<ffffffffa0e8c230>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
            <4> [<ffffffffa15d64e3>] mdt_reint_migrate+0x13/0x20 [mdt]
            <4> [<ffffffffa15cea81>] mdt_reint_rec+0x41/0xe0 [mdt]
            <4> [<ffffffffa15b3e93>] mdt_reint_internal+0x4c3/0x7c0 [mdt]
            <4> [<ffffffffa15b471b>] mdt_reint+0x6b/0x120 [mdt]
            <4> [<ffffffffa0f182ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc]
            <4> [<ffffffffa0ec7d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc]
            <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320
            <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e
            <4> [<ffffffffa0ec7000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc]
            <4> [<ffffffff8109aee6>] kthread+0x96/0xa0
            <4> [<ffffffff8100c20a>] child_rip+0xa/0x20
            <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0
            <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20
            <4>
            <0>Kernel panic - not syncing: LBUG
            <4>Pid: 7816, comm: mdt00_007 Not tainted 2.6.32.431.5.1.el6_lustre #1

            paf Patrick Farrell (Inactive) added a comment - From testing of patch set 20: During Cray testing of NFS export from a SLES11SP3 client with this patch (client+server), I hit the assertion below on MDS0 (which has MDTs 0 and 1 on it). Ran a modified racer against the NFS export and an unmodified racer on a Lustre 2.5.58 client. Dump of MDS0 is up at: ftp.whamcloud.com/uploads/ LU-3544 / LU-3544 _140609.tar.gz <6>Lustre: Skipped 1 previous similar message <3>LustreError: 7816:0:(mdt_reint.c:1519:mdt_reint_migrate_internal()) centssm2-MDT0000: parent [0x400000400:0x1:0x0] is still on the same MDT, which should be migrated first: rc = -1 <3>LustreError: 7816:0:(mdt_reint.c:1519:mdt_reint_migrate_internal()) Skipped 3 previous similar messages <3>LustreError: 7298:0:(mdd_dir.c:3957:mdd_migrate()) centssm2-MDD0000: [0x400000401:0x330f:0x0] 8 is already opened count 1: rc = -16 <0>LustreError: 7816:0:(service.c:193:ptlrpc_save_lock()) ASSERTION( rs->rs_nlocks < 8 ) failed: <0>LustreError: 7816:0:(service.c:193:ptlrpc_save_lock()) LBUG <4>Pid: 7816, comm: mdt00_007 <4> <4>Call Trace: <4> [<ffffffffa0b27895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] <4> [<ffffffffa0b27e97>] lbug_with_loc+0x47/0xb0 [libcfs] <4> [<ffffffffa0ebd656>] ptlrpc_save_lock+0xb6/0xf0 [ptlrpc] <4> [<ffffffffa15b074b>] mdt_save_lock+0x22b/0x320 [mdt] <4> [<ffffffffa15b089c>] mdt_object_unlock+0x5c/0x160 [mdt] <4> [<ffffffffa15b2187>] mdt_object_unlock_put+0x17/0x110 [mdt] <4> [<ffffffffa15cf18d>] mdt_unlock_list+0x5d/0x1e0 [mdt] <4> [<ffffffffa15d1e7c>] mdt_reint_migrate_internal+0x109c/0x1b50 [mdt] <4> [<ffffffffa15d6113>] mdt_reint_rename_or_migrate+0x2a3/0x660 [mdt] <4> [<ffffffffa0e8abc0>] ? ldlm_blocking_ast+0x0/0x180 [ptlrpc] <4> [<ffffffffa0e8c230>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc] <4> [<ffffffffa15d64e3>] mdt_reint_migrate+0x13/0x20 [mdt] <4> [<ffffffffa15cea81>] mdt_reint_rec+0x41/0xe0 [mdt] <4> [<ffffffffa15b3e93>] mdt_reint_internal+0x4c3/0x7c0 [mdt] <4> [<ffffffffa15b471b>] mdt_reint+0x6b/0x120 [mdt] <4> [<ffffffffa0f182ac>] tgt_request_handle+0x23c/0xac0 [ptlrpc] <4> [<ffffffffa0ec7d1a>] ptlrpc_main+0xd1a/0x1980 [ptlrpc] <4> [<ffffffff810096f0>] ? __switch_to+0xd0/0x320 <4> [<ffffffff81528090>] ? thread_return+0x4e/0x76e <4> [<ffffffffa0ec7000>] ? ptlrpc_main+0x0/0x1980 [ptlrpc] <4> [<ffffffff8109aee6>] kthread+0x96/0xa0 <4> [<ffffffff8100c20a>] child_rip+0xa/0x20 <4> [<ffffffff8109ae50>] ? kthread+0x0/0xa0 <4> [<ffffffff8100c200>] ? child_rip+0x0/0x20 <4> <0>Kernel panic - not syncing: LBUG <4>Pid: 7816, comm: mdt00_007 Not tainted 2.6.32.431.5.1.el6_lustre #1
            laisiyao Lai Siyao added a comment - I'll update http://review.whamcloud.com/#/c/7476/ soon.

            We need the patches that were reverted here to land in 2.6. And we no longer care about 2.1 server compatibility. But when these land we need to ensure they work with 2.4 and 2.5

            jlevi Jodi Levi (Inactive) added a comment - We need the patches that were reverted here to land in 2.6. And we no longer care about 2.1 server compatibility. But when these land we need to ensure they work with 2.4 and 2.5

            People

              laisiyao Lai Siyao
              cheng_shao Cheng Shao (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: