Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-2904

parallel-scale-nfsv3: FAIL: setup nfs failed!

Details

    • 3
    • 6993

    Description

      The parallel-scale-nfsv3 test failed as follows:

      Mounting NFS clients (version 3)...
      CMD: client-12vm1,client-12vm2 mkdir -p /mnt/lustre
      CMD: client-12vm1,client-12vm2 mount -t nfs -o nfsvers=3,async                 client-12vm3:/mnt/lustre /mnt/lustre
      client-12vm2: mount.nfs: Connection timed out
      client-12vm1: mount.nfs: Connection timed out
       parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! 
      

      Syslog on Lustre MDS/Lustre Client/NFS Server client-12vm3 showed that:

      Mar  4 17:34:15 client-12vm3 mrshd[4254]: root@client-12vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  sh -c "exportfs -o rw,async,no_root_squash *:/mnt/lustre         && exportfs -v");echo XXRETCODE:$?'
      Mar  4 17:34:15 client-12vm3 xinetd[1640]: EXIT: mshell status=0 pid=4253 duration=0(sec)
      Mar  4 17:34:16 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:894 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:16 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:713 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:17 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:784 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:17 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:877 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:19 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:946 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:19 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:1013 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:23 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:797 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:23 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:701 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:719 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:941 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:943 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:810 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:849 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:34:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:740 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:846 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:667 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:955 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:1006 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:828 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:739 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:1011 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:994 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:847 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:756 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:892 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:35:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:749 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:1017 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:873 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:874 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:749 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:916 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:841 for /mnt/lustre (/mnt/lustre)
      Mar  4 17:36:21 client-12vm3 xinetd[1640]: START: mshell pid=4286 from=::ffff:10.10.4.206
      Mar  4 17:36:21 client-12vm3 mrshd[4287]: root@client-12vm1.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "/usr/sbin/lctl mark  parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! ";echo XXRETCODE:$?'
      Mar  4 17:36:21 client-12vm3 kernel: Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed!
      

      Maloo report: https://maloo.whamcloud.com/test_sets/5cbf6978-853e-11e2-bfd3-52540035b04c

      Attachments

        Issue Links

          Activity

            [LU-2904] parallel-scale-nfsv3: FAIL: setup nfs failed!

            I strongly disagree about that patch, please revert and commit correct version without clustred nfs interoperability broken.

            shadow Alexey Lyashkov added a comment - I strongly disagree about that patch, please revert and commit correct version without clustred nfs interoperability broken.
            yujian Jian Yu added a comment -

            The patch http://review.whamcloud.com/6493 was landed on Lustre b2_4 branch.

            yujian Jian Yu added a comment - The patch http://review.whamcloud.com/6493 was landed on Lustre b2_4 branch.
            yujian Jian Yu added a comment - Hi Oleg, Could you please cherry-pick the patch to Lustre b2_4 branch? Thanks. The failure occurred regularly on Lustre b2_4 branch: https://maloo.whamcloud.com/test_sets/0c61eedc-fdad-11e2-9fd5-52540035b04c https://maloo.whamcloud.com/test_sets/f1c60464-fd16-11e2-9fdb-52540035b04c https://maloo.whamcloud.com/test_sets/13499228-fcda-11e2-b90c-52540035b04c https://maloo.whamcloud.com/test_sets/3d64d0f8-fcc2-11e2-9fdb-52540035b04c https://maloo.whamcloud.com/test_sets/512fc62e-fcb8-11e2-9222-52540035b04c

            you have break a clustered NFS or NFS failver configuation where
            2 NFS servers in pair - first with older nfs-utils tools where fsid generated from s_dev, second with this patch.

            so you have broke interoperability with older versions.

            shadow Alexey Lyashkov added a comment - you have break a clustered NFS or NFS failver configuation where 2 NFS servers in pair - first with older nfs-utils tools where fsid generated from s_dev, second with this patch. so you have broke interoperability with older versions.
            yujian Jian Yu added a comment -

            The patch of http://review.whamcloud.com/6493 also needs to be backported to Lustre b1_8 and b2_1 branches to pass the testing on the following interop configurations:

            NFS clients + Lustre b1_8/b2_1 NFS server/Lustre client + Lustre b2_4/master Lustre servers

            yujian Jian Yu added a comment - The patch of http://review.whamcloud.com/6493 also needs to be backported to Lustre b1_8 and b2_1 branches to pass the testing on the following interop configurations: NFS clients + Lustre b1_8/b2_1 NFS server/Lustre client + Lustre b2_4/master Lustre servers
            yujian Jian Yu added a comment -

            Hi Oleg,

            Could you please cherry-pick http://review.whamcloud.com/6493 to Lustre b2_4 branch? Thanks.

            The parallel-scale-nfsv3 test also failed on the current Lustre b2_4 branch:
            https://maloo.whamcloud.com/test_sets/9f30063c-ed8f-11e2-8e3a-52540035b04c

            yujian Jian Yu added a comment - Hi Oleg, Could you please cherry-pick http://review.whamcloud.com/6493 to Lustre b2_4 branch? Thanks. The parallel-scale-nfsv3 test also failed on the current Lustre b2_4 branch: https://maloo.whamcloud.com/test_sets/9f30063c-ed8f-11e2-8e3a-52540035b04c

            Currently, the llite can export both 32-bits "s_dev" and the 64-bits "FSID", which one will be used depends on the users (nfs-utils or other applications). Even if they are different, they can be used to indicate/locate the same the device (FS). Using the "s_dev" will ignore "FSID", the same for reserve case. It is not required to be the same.

            I am not sure I caught the point you worry about, but you can give my a detailed example that the patch breaking something.

            yong.fan nasf (Inactive) added a comment - Currently, the llite can export both 32-bits "s_dev" and the 64-bits "FSID", which one will be used depends on the users (nfs-utils or other applications). Even if they are different, they can be used to indicate/locate the same the device (FS). Using the "s_dev" will ignore "FSID", the same for reserve case. It is not required to be the same. I am not sure I caught the point you worry about, but you can give my a detailed example that the patch breaking something.
            shadow Alexey Lyashkov added a comment - - edited

            well, you can't control how caller will use FSID anyway, but FSID purpose - just separate one NFS handle in hash from other. In case single FS exported via different path and NFS may do round robin access to same files or load balancing (in NFS v4). In that case any number (as you see in ticket set FSID=1 in exports file) is enough, but it's need to be unique at host and and same on cluster. As you don't know which versions NFS servers in load balancing pairs used - you need present same FSID for each case - when it's generated from s_dev and from stat() call.
            Also it's avoid using a private kernel types in lustre includes/structures.

            shadow Alexey Lyashkov added a comment - - edited well, you can't control how caller will use FSID anyway, but FSID purpose - just separate one NFS handle in hash from other. In case single FS exported via different path and NFS may do round robin access to same files or load balancing (in NFS v4). In that case any number (as you see in ticket set FSID=1 in exports file) is enough, but it's need to be unique at host and and same on cluster. As you don't know which versions NFS servers in load balancing pairs used - you need present same FSID for each case - when it's generated from s_dev and from stat() call. Also it's avoid using a private kernel types in lustre includes/structures.

            1) In theory, we can fill the low 32-bits of FSID with s_dev, and fill the high 32-bits of FSID as anything, such as Lustre magic. But in spite of what is filled, we cannot control how the caller to use the returned FSID. And there is no explicit advantage of replacing current patch, since the 64-bits FSID only be generated when mount.

            yong.fan nasf (Inactive) added a comment - 1) In theory, we can fill the low 32-bits of FSID with s_dev, and fill the high 32-bits of FSID as anything, such as Lustre magic. But in spite of what is filled, we cannot control how the caller to use the returned FSID. And there is no explicit advantage of replacing current patch, since the 64-bits FSID only be generated when mount.

            1) I understand (and agree) about returning a fs id from statfs - but think we may use a s_dev for it and 32bit is enough, and we may use an high part of fs id with filling lustre magic (if need).

            2) let me time until Monday to look into mounted code carefully.

            shadow Alexey Lyashkov added a comment - 1) I understand (and agree) about returning a fs id from statfs - but think we may use a s_dev for it and 32bit is enough, and we may use an high part of fs id with filling lustre magic (if need). 2) let me time until Monday to look into mounted code carefully.

            The root issue is in user space nfs-utils.

            1) The FSID returned via statfs() to nfs-utils is 64-bits, in spite of use the new generated 64-bits FSID or reuse the old 32-bits FSID. If old 32-bits FSID is used, then __kernel_fsid_t::val[1] (or val[0]) is zero. Before this patch applied, Lustre did not return FSID via statfs().

            2) The root NFS handle contains root inode#. Lustre root inode# is 64-bits, but when nfs-utils parses the root handle, it is converted to 32-bits. So it cannot locate the right "inode". I have made a patch for that, and sent it to related kernel maintainers, and hope the patch can be accepted/landed in the next nfs-utils release.

            diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
            index 517aa62..a7212e7 100644
            — a/utils/mountd/cache.c
            +++ b/utils/mountd/cache.c
            @@ -388,7 +388,7 @@ struct parsed_fsid {
            int fsidtype;
            /* We could use a union for this, but it would be more

            • complicated; why bother? */
            • unsigned int inode;
              + uint64_t inode;
              unsigned int minor;
              unsigned int major;
              unsigned int fsidnum;

              1.7.1
            yong.fan nasf (Inactive) added a comment - The root issue is in user space nfs-utils. 1) The FSID returned via statfs() to nfs-utils is 64-bits, in spite of use the new generated 64-bits FSID or reuse the old 32-bits FSID. If old 32-bits FSID is used, then __kernel_fsid_t::val [1] (or val [0] ) is zero. Before this patch applied, Lustre did not return FSID via statfs(). 2) The root NFS handle contains root inode#. Lustre root inode# is 64-bits, but when nfs-utils parses the root handle, it is converted to 32-bits. So it cannot locate the right "inode". I have made a patch for that, and sent it to related kernel maintainers, and hope the patch can be accepted/landed in the next nfs-utils release. diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c index 517aa62..a7212e7 100644 — a/utils/mountd/cache.c +++ b/utils/mountd/cache.c @@ -388,7 +388,7 @@ struct parsed_fsid { int fsidtype; /* We could use a union for this, but it would be more complicated; why bother? */ unsigned int inode; + uint64_t inode; unsigned int minor; unsigned int major; unsigned int fsidnum; – 1.7.1

            People

              yong.fan nasf (Inactive)
              yujian Jian Yu
              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: