Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3550

Stale file handle on mount when mounting Lustre 2.4 via NFS

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.5.0
    • Lustre 2.4.0
    • None
    • 3
    • 8928

    Description

      When attempting to mount NFS exported Lustre, the mount operation reports 'stale file handle' and fails to complete. This happens with 2.4 servers and a 2.4 client. It does NOT happen with a 2.4 client and 2.2 servers.

      Investigation of the NFS traffic between the NFS client and NFS server (Lustre client) shows the NFS client requesting the file handle for the mount, then receiving a file handle back from the server. There is a bit more chatter, then the client sends back the same file handle as part of an info request. Then the server responds with a stale file handle error.

      This is happening on both CentOS 6.4 and SLES11SP2 clients.

      I'm attaching a series of logs of this issue.
      Here's a description of what's in those logs:
      Lustre MDS (2.4). (Full DK logs provided)
      Lustre Client(2.4)/NFS Server [The source of the NFS export] (Full DK logs & /var/log/messages with nfsd debug on full (0x7FFF))
      NFS Client (/var/log/messages with nfs debug set to 1, and a tcpdump of all traffic)

      For analyzing the tcpdump (if you need it - I suspect the NFS debug logs will make it irrelevant), the IP addresses:
      NFS Server: 172.29.53.155
      NFS Client: 172.29.53.160

      The /var/log/messages logs are not trimmed, sorry. Look for the last debug markers from Lustre in those files and you can line them up with the rest of the logs.

      Attachments

        Issue Links

          Activity

            [LU-3550] Stale file handle on mount when mounting Lustre 2.4 via NFS
            dmiter Dmitry Eremin (Inactive) made changes -
            Link New: This issue is related to LU-4057 [ LU-4057 ]
            yong.fan nasf (Inactive) made changes -
            Fix Version/s New: Lustre 2.5.0 [ 10295 ]
            Resolution New: Fixed [ 1 ]
            Status Original: Open [ 1 ] New: Resolved [ 5 ]

            We have submitted related patch to the kernel maintainer, and hope the issue can be resolved from root. From Intel side, we cannot do more but waiting for the respond. If you have got your things work, we can close this ticket, and reopen it in future when needed.

            yong.fan nasf (Inactive) added a comment - We have submitted related patch to the kernel maintainer, and hope the issue can be resolved from root. From Intel side, we cannot do more but waiting for the respond. If you have got your things work, we can close this ticket, and reopen it in future when needed.

            I'm not sure what the long term plan is regarding this bug. The fundamental limitation isn't in Lustre, and we've got an acceptable workaround with setting FSID manually.

            Is further work planned on the Intel side, or should this bug be closed? Cray is getting along fine with the work around.

            paf Patrick Farrell (Inactive) added a comment - I'm not sure what the long term plan is regarding this bug. The fundamental limitation isn't in Lustre, and we've got an acceptable workaround with setting FSID manually. Is further work planned on the Intel side, or should this bug be closed? Cray is getting along fine with the work around.

            Hi Patrick,

            I downloaded the nfs-untils-1.2.3 source, and patched/compiled/tested on RHEL6 (2.6.32-358.6.1.el6). Not care the proc changes.

            yong.fan nasf (Inactive) added a comment - Hi Patrick, I downloaded the nfs-untils-1.2.3 source, and patched/compiled/tested on RHEL6 (2.6.32-358.6.1.el6). Not care the proc changes.

            nasf,

            Has WC tested the latest nfs-utils with CentOS 6.4? I thought I saw a proc interface change between the CentOS 6.4 kernel and the kernels targeted by 1.2.8, but I could be wrong about that.

            • Patrick
            paf Patrick Farrell (Inactive) added a comment - nasf, Has WC tested the latest nfs-utils with CentOS 6.4? I thought I saw a proc interface change between the CentOS 6.4 kernel and the kernels targeted by 1.2.8, but I could be wrong about that. Patrick
            yong.fan nasf (Inactive) added a comment - - edited

            Above patch is for the latest nfs-utils. If you want to use nfs-utils-1.2.3, then the following one:

            343,344c343,344
            < 	uint64_t inode=0;
            < 	uint64_t inode64;
            ---
            > 	unsigned int inode=0;
            > 	unsigned long long inode64;
            
            yong.fan nasf (Inactive) added a comment - - edited Above patch is for the latest nfs-utils. If you want to use nfs-utils-1.2.3, then the following one: 343,344c343,344 < uint64_t inode=0; < uint64_t inode64; --- > unsigned int inode=0; > unsigned long long inode64;

            nasf,

            I've been trying to build nfs-utils 1.2.3 [default in CentOS 6.4] (without patches, just to verify I can) and I am stuck in a dependency hell, with it not finding various installed packages. A bit of searching shows that patching has been done to nfs-utils to clean up a lot of unnecessary dependencies, which include the ones I'm dealing with.
            (http://www.spinics.net/lists/linux-nfs/msg26388.html)

            However, as I understand it, the kernel nfsd /proc interface has changed since CentOS 6.4 and SLES11SP2, so I can't just go grab the latest nfs-utils and expect it to work.

            Do you have a particular version you recommend building, or any tips on this?

            I may be able to land that linking patch by itself without problem and will try that next, but I thought I'd ask you as well.

            • Patrick
            paf Patrick Farrell (Inactive) added a comment - nasf, I've been trying to build nfs-utils 1.2.3 [default in CentOS 6.4] (without patches, just to verify I can) and I am stuck in a dependency hell, with it not finding various installed packages. A bit of searching shows that patching has been done to nfs-utils to clean up a lot of unnecessary dependencies, which include the ones I'm dealing with. ( http://www.spinics.net/lists/linux-nfs/msg26388.html ) However, as I understand it, the kernel nfsd /proc interface has changed since CentOS 6.4 and SLES11SP2, so I can't just go grab the latest nfs-utils and expect it to work. Do you have a particular version you recommend building, or any tips on this? I may be able to land that linking patch by itself without problem and will try that next, but I thought I'd ask you as well. Patrick

            There are two issues for this topic:

            1) Originally, Lustre did not return FSID via statfs() to nfs-utils. This issue has been resolved by the patch http://review.whamcloud.com/6493, which has already been landed to master (Lustre-2.5)

            2) The nfs-utils defect of converting 64-bits ino# into 32-bits and causes information lost as to cannot locate the right root. It can be resolved by the patch:

            diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
            index 517aa62..a7212e7 100644
            --- a/utils/mountd/cache.c
            +++ b/utils/mountd/cache.c
            @@ -388,7 +388,7 @@ struct parsed_fsid {
                    int fsidtype;
                    /* We could use a union for this, but it would be more
                     * complicated; why bother? */
            -       unsigned int inode;
            +       uint64_t inode;
                    unsigned int minor;
                    unsigned int major;
                    unsigned int fsidnum;
            -- 
            1.7.1
            

            If you have chance, you can test above two patches together for verification. Thanks!

            yong.fan nasf (Inactive) added a comment - There are two issues for this topic: 1) Originally, Lustre did not return FSID via statfs() to nfs-utils. This issue has been resolved by the patch http://review.whamcloud.com/6493 , which has already been landed to master (Lustre-2.5) 2) The nfs-utils defect of converting 64-bits ino# into 32-bits and causes information lost as to cannot locate the right root. It can be resolved by the patch: diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c index 517aa62..a7212e7 100644 --- a/utils/mountd/cache.c +++ b/utils/mountd/cache.c @@ -388,7 +388,7 @@ struct parsed_fsid { int fsidtype; /* We could use a union for this , but it would be more * complicated; why bother? */ - unsigned int inode; + uint64_t inode; unsigned int minor; unsigned int major; unsigned int fsidnum; -- 1.7.1 If you have chance, you can test above two patches together for verification. Thanks!

            I've discussed this internally at Cray with someone with NFS expertise.

            He agrees that this work around (using the -o fsid= option to exportfs when exporting Lustre over NFS) is the appropriate solution, as the only other option is a fairly invasive patch to the NFS code in the Linux kernel. In light of that, WC may want to update documentation for exporting Lustre over NFS, but no code changes are necessary.

            paf Patrick Farrell (Inactive) added a comment - I've discussed this internally at Cray with someone with NFS expertise. He agrees that this work around (using the -o fsid= option to exportfs when exporting Lustre over NFS) is the appropriate solution, as the only other option is a fairly invasive patch to the NFS code in the Linux kernel. In light of that, WC may want to update documentation for exporting Lustre over NFS, but no code changes are necessary.

            People

              yong.fan nasf (Inactive)
              paf Patrick Farrell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: