Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2904

parallel-scale-nfsv3: FAIL: setup nfs failed!

Details

    • 3
    • 6993

    Description

      The parallel-scale-nfsv3 test failed as follows:

      Mounting NFS clients (version 3)...
      CMD: client-12vm1,client-12vm2 mkdir -p /mnt/lustre
      CMD: client-12vm1,client-12vm2 mount -t nfs -o nfsvers=3,async                 client-12vm3:/mnt/lustre /mnt/lustre
      client-12vm2: mount.nfs: Connection timed out
      client-12vm1: mount.nfs: Connection timed out
       parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! 
      

      Syslog on Lustre MDS/Lustre Client/NFS Server client-12vm3 showed that:

      Mar  4 17:34:15 client-12vm3 mrshd[4254]: root@client-12vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  sh -c "exportfs -o rw,async,no_root_squash *:/mnt/lustre         && exportfs -v");echo XXRETCODE:$?'
      Mar  4 17:34:15 client-12vm3 xinetd[1640]: EXIT: mshell status=0 pid=4253 duration=0(sec)
      Mar  4 17:34:16 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:894 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:16 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:713 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:17 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:784 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:17 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:877 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:19 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:946 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:19 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:1013 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:23 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:797 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:23 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:701 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:719 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:941 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:943 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:810 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:849 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:740 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:846 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:667 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:955 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:1006 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:828 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:739 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:1011 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:994 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:847 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:756 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:892 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:749 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:1017 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:873 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:874 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:749 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:916 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:841 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:21 client-12vm3 xinetd[1640]: START: mshell pid=4286 from=::ffff:10.10.4.206
      Mar  4 17:36:21 client-12vm3 mrshd[4287]: root@client-12vm1.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "/usr/sbin/lctl mark  parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! ";echo XXRETCODE:$?'
      Mar  4 17:36:21 client-12vm3 kernel: Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed!
      

      Maloo report: https://maloo.whamcloud.com/test_sets/5cbf6978-853e-11e2-bfd3-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-2904] parallel-scale-nfsv3: FAIL: setup nfs failed!
            yujian Jian Yu added a comment - - edited

            Patch for Lustre b2_1 branch to add "32bitapi" Lustre client mount option while exporting the Lustre client as NFSv3 server: http://review.whamcloud.com/6457
            Patch for Lustre b1_8 branch: http://review.whamcloud.com/6663
            Patch for Lustre master branch: http://review.whamcloud.com/6649

            yujian Jian Yu added a comment - - edited Patch for Lustre b2_1 branch to add "32bitapi" Lustre client mount option while exporting the Lustre client as NFSv3 server: http://review.whamcloud.com/6457 Patch for Lustre b1_8 branch: http://review.whamcloud.com/6663 Patch for Lustre master branch: http://review.whamcloud.com/6649
            yujian Jian Yu added a comment -

            Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/204
            Lustre master server build: http://build.whamcloud.com/job/lustre-master/1508
            Distro/Arch: RHEL6.4/x86_64

            The issue still occurred: https://maloo.whamcloud.com/test_sets/b5a0c146-c624-11e2-9bf1-52540035b04c

            CMD: client-26vm3 exportfs -o rw,async,no_root_squash *:/mnt/lustre         && exportfs -v
            /mnt/lustre   	<world>(rw,async,wdelay,no_root_squash,no_subtree_check)
            
            Mounting NFS clients (version 3)...
            CMD: client-26vm5,client-26vm6.lab.whamcloud.com mkdir -p /mnt/lustre
            CMD: client-26vm5,client-26vm6.lab.whamcloud.com mount -t nfs -o nfsvers=3,async                 client-26vm3:/mnt/lustre /mnt/lustre
            client-26vm6: mount.nfs: Connection timed out
            client-26vm5: mount.nfs: Connection timed out
             parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! 
            

            On somehow, this issue can be resolved by specifying "fsid=1" (without "32bitapi" for Lustre mount option) when re-export Lustre via NFS (v3 or v4). For example: "/mnt/lustre 10.211.55.*(rw,no_root_squash,fsid=1)". (Verified on 2.6.32-358.2.1.el6)

            We need a patch on Lustre b2_1 branch to resolve the interop issue.

            yujian Jian Yu added a comment - Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/204 Lustre master server build: http://build.whamcloud.com/job/lustre-master/1508 Distro/Arch: RHEL6.4/x86_64 The issue still occurred: https://maloo.whamcloud.com/test_sets/b5a0c146-c624-11e2-9bf1-52540035b04c CMD: client-26vm3 exportfs -o rw,async,no_root_squash *:/mnt/lustre && exportfs -v /mnt/lustre <world>(rw,async,wdelay,no_root_squash,no_subtree_check) Mounting NFS clients (version 3)... CMD: client-26vm5,client-26vm6.lab.whamcloud.com mkdir -p /mnt/lustre CMD: client-26vm5,client-26vm6.lab.whamcloud.com mount -t nfs -o nfsvers=3,async client-26vm3:/mnt/lustre /mnt/lustre client-26vm6: mount.nfs: Connection timed out client-26vm5: mount.nfs: Connection timed out parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! On somehow, this issue can be resolved by specifying "fsid=1" (without "32bitapi" for Lustre mount option) when re-export Lustre via NFS (v3 or v4). For example: "/mnt/lustre 10.211.55.*(rw,no_root_squash,fsid=1)". (Verified on 2.6.32-358.2.1.el6) We need a patch on Lustre b2_1 branch to resolve the interop issue.

            Yes, we already do this in the client:

            static int client_common_fill_super(...)
            {
                    :
                    :
                    /* We set sb->s_dev equal on all lustre clients in order to support
                     * NFS export clustering.  NFSD requires that the FSID be the same
                     * on all clients. */
                    /* s_dev is also used in lt_compare() to compare two fs, but that is
                     * only a node-local comparison. */
                    uuid = obd_get_uuid(sbi->ll_md_exp);
                    if (uuid != NULL)
                            sb->s_dev = get_uuid2int(uuid->uuid, strlen(uuid->uuid));
            

            This could be improved to provide the full fsid for NFS instead of just the 32-bit device number.

            Note that I submitted the nfs-utils mount patch upstream, so the need for "32bitapi" mount option for 64-bit clients will not be around long.

            adilger Andreas Dilger added a comment - Yes, we already do this in the client: static int client_common_fill_super(...) { : : /* We set sb->s_dev equal on all lustre clients in order to support * NFS export clustering. NFSD requires that the FSID be the same * on all clients. */ /* s_dev is also used in lt_compare() to compare two fs, but that is * only a node-local comparison. */ uuid = obd_get_uuid(sbi->ll_md_exp); if (uuid != NULL) sb->s_dev = get_uuid2int(uuid->uuid, strlen(uuid->uuid)); This could be improved to provide the full fsid for NFS instead of just the 32-bit device number. Note that I submitted the nfs-utils mount patch upstream, so the need for "32bitapi" mount option for 64-bit clients will not be around long.

            You mean the name 'lustre-MDT0000' or similar? The uuid used for nfs handle is two int values, which is returned via statfs(). If we want to use it, we need to make some conversation.

            yong.fan nasf (Inactive) added a comment - You mean the name 'lustre-MDT0000' or similar? The uuid used for nfs handle is two int values, which is returned via statfs(). If we want to use it, we need to make some conversation.

            Actually, I thought we are using the Lustre MDT target name for the UUID already? That is common across all clients already and will not be broken by backup and restore of the underlying MDT device.

            adilger Andreas Dilger added a comment - Actually, I thought we are using the Lustre MDT target name for the UUID already? That is common across all clients already and will not be broken by backup and restore of the underlying MDT device.

            Patch for 1.2)

            yong.fan nasf (Inactive) added a comment - Patch for 1.2)

            Patch for 1.1)

            yong.fan nasf (Inactive) added a comment - Patch for 1.1)

            There are two ways to resolve the issue:

            1) Patch Lustre to support UUID. Means the statfs64() on Lustre will return valid UUID, nfsd will generate nfs handle with 64-bits ino plus the UUID. Then we do NOT need to patch kernel. The work to be done:
            1.1) Patch user space nfs-utils to use 64-bits ino# instead of 32-bits ino#.
            1.2) Patch Lustre to return valid UUID for statfs64(). The client needs to fetch the UUID from MDT0 via MDS_STATFS RPC. On MDT side, we can return the backend FS UUID for that. ldiskfs has supported that already. zfs backend has NOT implemented yet. So need small changes for zfs backend.

            2) Patch kernel to support 64-bits ino# for nfs handle. The work to be done.
            2.1) Patch user space nfs-utils to use 64-bits ino# instead of 32-bits ino#.
            2.2) Patch kernel to support 64-bits ino# for nfs handle.

            The work for 1.1) and 2.1) are similar. But 1.2) and 2.2) are quite different. I prefer to the first solution. How do you think?

            yong.fan nasf (Inactive) added a comment - There are two ways to resolve the issue: 1) Patch Lustre to support UUID. Means the statfs64() on Lustre will return valid UUID, nfsd will generate nfs handle with 64-bits ino plus the UUID. Then we do NOT need to patch kernel. The work to be done: 1.1) Patch user space nfs-utils to use 64-bits ino# instead of 32-bits ino#. 1.2) Patch Lustre to return valid UUID for statfs64(). The client needs to fetch the UUID from MDT0 via MDS_STATFS RPC. On MDT side, we can return the backend FS UUID for that. ldiskfs has supported that already. zfs backend has NOT implemented yet. So need small changes for zfs backend. 2) Patch kernel to support 64-bits ino# for nfs handle. The work to be done. 2.1) Patch user space nfs-utils to use 64-bits ino# instead of 32-bits ino#. 2.2) Patch kernel to support 64-bits ino# for nfs handle. The work for 1.1) and 2.1) are similar. But 1.2) and 2.2) are quite different. I prefer to the first solution. How do you think?

            Nasf, it seems there is still a defect in the upstream kernel, where it cannot handle a 64-bit inode number for the NFS root? Could you please at minimum send a bug report to the linux-nfs@vger.kernel.org mailing list with details (CC me also), so that this can eventually be fixed.

            adilger Andreas Dilger added a comment - Nasf, it seems there is still a defect in the upstream kernel, where it cannot handle a 64-bit inode number for the NFS root? Could you please at minimum send a bug report to the linux-nfs@vger.kernel.org mailing list with details (CC me also), so that this can eventually be fixed.

            Do NOT need more patch, since there is other solution which can bypass the 32-bit ino issue.

            yong.fan nasf (Inactive) added a comment - Do NOT need more patch, since there is other solution which can bypass the 32-bit ino issue.

            On somehow, this issue can be resolved by specifying "fsid=1" (without "32bitapi" for Lustre mount option) when re-export Lustre via NFS (v3 or v4). For example: "/mnt/lustre 10.211.55.*(rw,no_root_squash,fsid=1)". (Verified on 2.6.32-358.2.1.el6)

            yong.fan nasf (Inactive) added a comment - On somehow, this issue can be resolved by specifying "fsid=1" (without "32bitapi" for Lustre mount option) when re-export Lustre via NFS (v3 or v4). For example: "/mnt/lustre 10.211.55.*(rw,no_root_squash,fsid=1)". (Verified on 2.6.32-358.2.1.el6)

            People

              yong.fan nasf (Inactive)
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: