[LU-2904] parallel-scale-nfsv3: FAIL: setup nfs failed! Created: 04/Mar/13  Updated: 20/Nov/13  Resolved: 31/Aug/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0
Fix Version/s: Lustre 2.1.6, Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Blocker
Reporter: Jian Yu Assignee: nasf (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/181
Lustre master server build: http://build.whamcloud.com/job/lustre-master/1285
Distro/Arch: RHEL6.3/x86_64


Attachments: Text File 0001-LU-2904-nfs-support-64-bits-inode-number-in-nfs-hand.patch     Text File 0001-LU-2904-obdclass-return-valid-uuid-for-statfs.patch    
Issue Links:
Related
is related to LU-3318 mdc_set_lock_data() ASSERTION( old_in... Resolved
is related to LU-3550 Stale file handle on mount when mount... Resolved
is related to LU-4057 sub-directory NFS reexport issue Resolved
Severity: 3
Rank (Obsolete): 6993

 Description   

The parallel-scale-nfsv3 test failed as follows:

Mounting NFS clients (version 3)...
CMD: client-12vm1,client-12vm2 mkdir -p /mnt/lustre
CMD: client-12vm1,client-12vm2 mount -t nfs -o nfsvers=3,async                 client-12vm3:/mnt/lustre /mnt/lustre
client-12vm2: mount.nfs: Connection timed out
client-12vm1: mount.nfs: Connection timed out
 parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! 

Syslog on Lustre MDS/Lustre Client/NFS Server client-12vm3 showed that:

Mar  4 17:34:15 client-12vm3 mrshd[4254]: root@client-12vm1.lab.whamcloud.com as root: cmd='(PATH=$PATH:/usr/lib64/lustre/utils:/usr/lib64/lustre/tests:/sbin:/usr/sbin; cd /usr/lib64/lustre/tests; LUSTRE="/usr/lib64/lustre"  sh -c "exportfs -o rw,async,no_root_squash *:/mnt/lustre         && exportfs -v");echo XXRETCODE:$?'
Mar  4 17:34:15 client-12vm3 xinetd[1640]: EXIT: mshell status=0 pid=4253 duration=0(sec)
Mar  4 17:34:16 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:894 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:16 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:713 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:17 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:784 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:17 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:877 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:19 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:946 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:19 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:1013 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:23 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:797 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:23 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:701 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:719 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:941 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:943 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:810 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:849 for /mnt/lustre (/mnt/lustre)
Mar  4 17:34:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:740 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:846 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:667 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:955 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:1006 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:828 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:739 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:1011 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:31 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:994 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:847 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:41 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:756 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:892 for /mnt/lustre (/mnt/lustre)
Mar  4 17:35:51 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:749 for /mnt/lustre (/mnt/lustre)
Mar  4 17:36:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:1017 for /mnt/lustre (/mnt/lustre)
Mar  4 17:36:01 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:873 for /mnt/lustre (/mnt/lustre)
Mar  4 17:36:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:874 for /mnt/lustre (/mnt/lustre)
Mar  4 17:36:11 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:749 for /mnt/lustre (/mnt/lustre)
Mar  4 17:36:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.207:916 for /mnt/lustre (/mnt/lustre)
Mar  4 17:36:21 client-12vm3 rpc.mountd[4165]: authenticated mount request from 10.10.4.206:841 for /mnt/lustre (/mnt/lustre)
Mar  4 17:36:21 client-12vm3 xinetd[1640]: START: mshell pid=4286 from=::ffff:10.10.4.206
Mar  4 17:36:21 client-12vm3 mrshd[4287]: root@client-12vm1.lab.whamcloud.com as root: cmd='/usr/sbin/lctl mark "/usr/sbin/lctl mark  parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! ";echo XXRETCODE:$?'
Mar  4 17:36:21 client-12vm3 kernel: Lustre: DEBUG MARKER: /usr/sbin/lctl mark  parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed!

Maloo report: https://maloo.whamcloud.com/test_sets/5cbf6978-853e-11e2-bfd3-52540035b04c



 Comments   
Comment by Zhenyu Xu [ 05/Mar/13 ]

I think it's not 2.1.4<c--s>2.4.0 interop test, since parallel-scale-nfs test use MDS node to mount the lustre filesystem, then use another client node to NFS mount the filesystem hosted on the MDS node. The lustre client and server are all on MDS node, which is 2.4.0 system, no 2.1.4 lustre client involved.

Comment by Zhenyu Xu [ 05/Mar/13 ]

from dmesg on MDS

Lustre: DEBUG MARKER: exportfs -o rw,async,no_root_squash *:/mnt/lustre && exportfs -v
Lustre: DEBUG MARKER: /usr/sbin/lctl mark parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed!

looks like exportfs fails, can you manually check it to see what possible reason that the exportfs command fails?

Comment by Jian Yu [ 05/Mar/13 ]

I think it's not 2.1.4<c--s>2.4.0 interop test, since parallel-scale-nfs test use MDS node to mount the lustre filesystem, then use another client node to NFS mount the filesystem hosted on the MDS node. The lustre client and server are all on MDS node, which is 2.4.0 system, no 2.1.4 lustre client involved.

You're right. I just searched out the latest parallel-scale-nfsv3 test reports on master branch, and found all of them failed to setup NFS:
https://maloo.whamcloud.com/test_sets/47bc2b0a-82d8-11e2-ba47-52540035b04c
https://maloo.whamcloud.com/test_sets/cc1be074-81fd-11e2-8564-52540035b04c
https://maloo.whamcloud.com/test_sets/ee103132-7e59-11e2-8f4f-52540035b04c
https://maloo.whamcloud.com/test_sets/7f885d40-7ba5-11e2-a4de-52540035b04c
https://maloo.whamcloud.com/test_sets/18a6689c-7ba5-11e2-8242-52540035b04c
https://maloo.whamcloud.com/test_sets/a205aef2-77ce-11e2-abae-52540035b04c

This is really a regression issue on master branch.

Comment by Jian Yu [ 05/Mar/13 ]

looks like exportfs fails, can you manually check it to see what possible reason that the exportfs command fails?

The test output showed that:

CMD: client-12vm3 exportfs -o rw,async,no_root_squash *:/mnt/lustre         && exportfs -v
/mnt/lustre   	<world>(rw,async,wdelay,no_root_squash,no_subtree_check)

Mounting NFS clients (version 3)...
CMD: client-12vm1,client-12vm2 mkdir -p /mnt/lustre
CMD: client-12vm1,client-12vm2 mount -t nfs -o nfsvers=3,async                 client-12vm3:/mnt/lustre /mnt/lustre
client-12vm2: mount.nfs: Connection timed out
client-12vm1: mount.nfs: Connection timed out
 parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed!

Running exportfs passed. The failure occurred while mounting the NFS clients.

Comment by Zhenyu Xu [ 12/Mar/13 ]

git commit 4a88dc8 (http://review.whamcloud.com/4904 LU-1866 osd: FID-in-LMA and OI files) caused this nfs mount-timeout issue. Before that commit, lustre filesystem can be nfs mounted by nfs client.

Comment by Jian Yu [ 13/Mar/13 ]

Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/186
Lustre master server build: http://build.whamcloud.com/job/lustre-master/1302
Distro/Arch: RHEL6.3/x86_64

The same issue occurred: https://maloo.whamcloud.com/test_sets/726a4448-8b59-11e2-965f-52540035b04c

Comment by nasf (Inactive) [ 14/Mar/13 ]

The reason caused the failure:

1) nfs defects, it does not work if the "inode::i_ino" is larger than 2^32, which is known issue.

2) MDS returns new "/ROOT" FID to client, which was IGIF before, but now it is

{FID_SEQ_ROOT, 1, 0}

. When client coverts the new FID into local inode::i_ino, it is larger than 2^32.

Possible solutions:

1) MDS still returns IGIF FID for /ROOT to client as it did before. But it only works for re-export Lustre "/ROOT", if we want to re-export subdir under the "/ROOT", it still does NOT work, such issue has been there all along is spite of the patch (http://review.whamcloud.com/4904) or not.

2) Use "-o 32bitapi" when mount the Lustre client which will re-export via VFS. It works for both "/ROOT" and its subdir re-exporting. (We still need some patch on client side, because of missing handle the "-o 32bitapi" at some corners). I prefer this one.

Andreas, how do you think?

Comment by Alex Zhuravlev [ 14/Mar/13 ]

there is no IGIFs with ZFS backend, so I wouldn't consider (1) as an option.

Comment by nasf (Inactive) [ 14/Mar/13 ]

Alex, I agree with you.

The patch for option 2) is here:

http://review.whamcloud.com/#change,5711

Comment by nasf (Inactive) [ 21/Mar/13 ]

Yu jian, I have successfully mount nfs client with above patch and -o 32bitapi on the Lustre client. So you can verify more nfs related tests with it.

BTW, not only the ROOT, but also any subdir can be re-exported.

Comment by Jian Yu [ 21/Mar/13 ]

Yu jian, I have successfully mount nfs client with above patch and -o 32bitapi on the Lustre client. So you can verify more nfs related tests with it.

Could you please add the following test parameters to verify that? Thanks.

Test-Parameters: envdefinitions=SLOW=yes \
clientjob=lustre-b2_1 clientbuildno=191 \
testlist=parallel-scale-nfsv3,parallel-scale-nfsv4
Comment by nasf (Inactive) [ 25/Mar/13 ]

Another possible solution is that: only allow re-export Lustre against the "ROOT", not its sub-directories, then we can handle "ROOT" FID on the client specially to map it to a 32 bit ion#, then no need "32bitapi".

Andreas/Alex, how do you think?

This is the patch for this idea:

http://review.whamcloud.com/#change,5840

Comment by nasf (Inactive) [ 31/Mar/13 ]

Summary for the possible solutions:

1) Fix NFS issues via client kernel patch. (not works for patch-less client)

2) Use 32bitapi for re-export Lustre via NFS. The advantage is that it works for re-exporting Lustre via NFS for any directory. The shortcoming is that the client needs to be mounted as "-o 32bitapi", which increases the possibility of ion# mapping collision. This is the patch:
http://review.whamcloud.com/#change,5711

3) Return a special IFIG FID

{FID_SEQ_IGIF, 1, 0}

for the ROOT object to client, which is compatible with old Lustre-2.x/1.8 cases. It allows to re-export Lustre ROOT via NFS without client-side modification, needs NOT mount client with "-o 32bitapi". The shortcoming is that it only works for ROOT, but we cannot re-export other non-ROOT via NFS. This is the patch:
http://review.whamcloud.com/#change,5840 set2

4) Map current ROOT FID

{FID_SEQ_ROOT, 1, 0}

to a special 32-bit local ion# on the client, then allows to re-export Lustre ROOT via NFS without mount client with "-o 32bitapi". The shortcoming is that 3.1) it only works for ROOT, but we cannot re-export other non-ROOT via NFS. 3.2) only works for new client, cannot use old client for the re-exporting. This is the patch:
http://review.whamcloud.com/#change,5840 set1

Before the NFS issues is fixed via upstream kernel patches, we need a temporary solution from 2)/3)/4). So any suggestions?

Comment by Oleg Drokin [ 03/Apr/13 ]

I think nfsv4 allows 64 bit fid, does not it?
The solution number 2 seems pretty reasonable to me. in fact we even have it documented in the changelog in this way.

Comment by Andreas Dilger [ 05/Apr/13 ]

Fan Yong, any progress on the #1 approach on a kernel patch to allow 64-bit inode numbers for root? That is a big in the kernel anyway that should be fixed regardless of whether we have some other workaround, and we can keep the patch in our server kernel until it is included upstream. Then, users can either use the server kernel on NFS exporting clients, or whatever other workaround we have for patchless clients, but the problem will go away in the future.

Comment by Peter Jones [ 09/Apr/13 ]

The #2 patch has landed so dropping this in priority. Any further work to push a fix upstream can be handled after 2.4.0 is GA

Comment by nasf (Inactive) [ 22/Apr/13 ]

On somehow, this issue can be resolved by specifying "fsid=1" (without "32bitapi" for Lustre mount option) when re-export Lustre via NFS (v3 or v4). For example: "/mnt/lustre 10.211.55.*(rw,no_root_squash,fsid=1)". (Verified on 2.6.32-358.2.1.el6)

Comment by nasf (Inactive) [ 06/May/13 ]

Do NOT need more patch, since there is other solution which can bypass the 32-bit ino issue.

Comment by Andreas Dilger [ 16/May/13 ]

Nasf, it seems there is still a defect in the upstream kernel, where it cannot handle a 64-bit inode number for the NFS root? Could you please at minimum send a bug report to the linux-nfs@vger.kernel.org mailing list with details (CC me also), so that this can eventually be fixed.

Comment by nasf (Inactive) [ 22/May/13 ]

There are two ways to resolve the issue:

1) Patch Lustre to support UUID. Means the statfs64() on Lustre will return valid UUID, nfsd will generate nfs handle with 64-bits ino plus the UUID. Then we do NOT need to patch kernel. The work to be done:
1.1) Patch user space nfs-utils to use 64-bits ino# instead of 32-bits ino#.
1.2) Patch Lustre to return valid UUID for statfs64(). The client needs to fetch the UUID from MDT0 via MDS_STATFS RPC. On MDT side, we can return the backend FS UUID for that. ldiskfs has supported that already. zfs backend has NOT implemented yet. So need small changes for zfs backend.

2) Patch kernel to support 64-bits ino# for nfs handle. The work to be done.
2.1) Patch user space nfs-utils to use 64-bits ino# instead of 32-bits ino#.
2.2) Patch kernel to support 64-bits ino# for nfs handle.

The work for 1.1) and 2.1) are similar. But 1.2) and 2.2) are quite different. I prefer to the first solution. How do you think?

Comment by nasf (Inactive) [ 22/May/13 ]

Patch for 1.1)

Comment by nasf (Inactive) [ 22/May/13 ]

Patch for 1.2)

Comment by Andreas Dilger [ 22/May/13 ]

Actually, I thought we are using the Lustre MDT target name for the UUID already? That is common across all clients already and will not be broken by backup and restore of the underlying MDT device.

Comment by nasf (Inactive) [ 22/May/13 ]

You mean the name 'lustre-MDT0000' or similar? The uuid used for nfs handle is two int values, which is returned via statfs(). If we want to use it, we need to make some conversation.

Comment by Andreas Dilger [ 23/May/13 ]

Yes, we already do this in the client:

static int client_common_fill_super(...)
{
        :
        :
        /* We set sb->s_dev equal on all lustre clients in order to support
         * NFS export clustering.  NFSD requires that the FSID be the same
         * on all clients. */
        /* s_dev is also used in lt_compare() to compare two fs, but that is
         * only a node-local comparison. */
        uuid = obd_get_uuid(sbi->ll_md_exp);
        if (uuid != NULL)
                sb->s_dev = get_uuid2int(uuid->uuid, strlen(uuid->uuid));

This could be improved to provide the full fsid for NFS instead of just the 32-bit device number.

Note that I submitted the nfs-utils mount patch upstream, so the need for "32bitapi" mount option for 64-bit clients will not be around long.

Comment by Jian Yu [ 27/May/13 ]

Lustre b2_1 client build: http://build.whamcloud.com/job/lustre-b2_1/204
Lustre master server build: http://build.whamcloud.com/job/lustre-master/1508
Distro/Arch: RHEL6.4/x86_64

The issue still occurred: https://maloo.whamcloud.com/test_sets/b5a0c146-c624-11e2-9bf1-52540035b04c

CMD: client-26vm3 exportfs -o rw,async,no_root_squash *:/mnt/lustre         && exportfs -v
/mnt/lustre   	<world>(rw,async,wdelay,no_root_squash,no_subtree_check)

Mounting NFS clients (version 3)...
CMD: client-26vm5,client-26vm6.lab.whamcloud.com mkdir -p /mnt/lustre
CMD: client-26vm5,client-26vm6.lab.whamcloud.com mount -t nfs -o nfsvers=3,async                 client-26vm3:/mnt/lustre /mnt/lustre
client-26vm6: mount.nfs: Connection timed out
client-26vm5: mount.nfs: Connection timed out
 parallel-scale-nfsv3 : @@@@@@ FAIL: setup nfs failed! 

On somehow, this issue can be resolved by specifying "fsid=1" (without "32bitapi" for Lustre mount option) when re-export Lustre via NFS (v3 or v4). For example: "/mnt/lustre 10.211.55.*(rw,no_root_squash,fsid=1)". (Verified on 2.6.32-358.2.1.el6)

We need a patch on Lustre b2_1 branch to resolve the interop issue.

Comment by Jian Yu [ 27/May/13 ]

Patch for Lustre b2_1 branch to add "32bitapi" Lustre client mount option while exporting the Lustre client as NFSv3 server: http://review.whamcloud.com/6457
Patch for Lustre b1_8 branch: http://review.whamcloud.com/6663
Patch for Lustre master branch: http://review.whamcloud.com/6649

Comment by Alexey Lyashkov [ 10/Jul/13 ]

last patch
http://git.whamcloud.com/?p=fs/lustre-release.git;a=commitdiff;h=8c4f4a47e051b097358818f4d3777d02124abbe7

looks invalid - lustre client had already such code

        /* We set sb->s_dev equal on all lustre clients in order to support
         * NFS export clustering.  NFSD requires that the FSID be the same
         * on all clients. */
        /* s_dev is also used in lt_compare() to compare two fs, but that is
         * only a node-local comparison. */
        uuid = obd_get_uuid(sbi->ll_md_exp);
        if (uuid != NULL)
                sb->s_dev = get_uuid2int(uuid->uuid, strlen(uuid->uuid));
        sbi->ll_mnt = mnt;

In that case exporting an s_dev via statfs will enough.

Comment by nasf (Inactive) [ 10/Jul/13 ]

32-bits uuid maybe works for this case, but since the POSIX API is 64-bits, and the statfs() is not only for re-exporting via nfs, but also for others, so we prefer to generate and return 64-bits uuid as expected.

Comment by Alexey Lyashkov [ 10/Jul/13 ]

fsid have a single requirement - that is should be same for a cluster and unique on node.
i think 32bits uid is enough to encode FS id in statfs.
but using a single FS have a benefits with interop - different nodes (with older and new nfsd) have same fs id in NFS handles so may used in failover pair.

I have some question to second patch also - we have prepared own NFS handle structure with lu_fid inside and it's should be don't have a limitation over 32bits, if we have lost one code patch and nfs handle created with wrong format - we need invest it is.

Comment by nasf (Inactive) [ 10/Jul/13 ]

Honestly, I am not sure whether 32-bits is enough or not for kinds of statfs() users. It is true that in mixed environment new client will export 64-bits FSID and old client will export 32-bits FSID, such difference may cause issues if the users want to access Lustre via different clients with the same handle. But I do not know whether someone really will want to use Lustre as that. From a long view, we need to upgrade the FSID to 64-bits, otherwise, if 32-bits is always enough, the statfs() API should be shrink...

As for NFS handle with lu_fid, it works for objects under export-point, but the root NFS handle does not contains the lu_fid (which does not goes down to Lustre). That is why we make this patch.

Comment by Alexey Lyashkov [ 10/Jul/13 ]

did you really think 2^32 lustre mounts exist on single node? FSid just an unique id for a mount point.
root cause to have it's 64bit - 32bit for block device id and 32bit for slice number inside of block device, so just need identify a mount point correctly. In case lustre, we don't have a slices inside a device and don't need a fill it, but device id exactly identify an mount point.

as about root node for an export, let me look - but as i remember a nfs-tools it's also as about NFS handle from an kernel.

Comment by Alexey Lyashkov [ 10/Jul/13 ]

anyway most kernel FS uses

<------>u64 id = huge_encode_dev(sb->s_bdev->bd_dev);
<------>buf->f_fsid.val[0] = (u32)id;
<------>buf->f_fsid.val[1] = (u32)(id >> 32);

i don't understand - why do not use same.

Comment by nasf (Inactive) [ 10/Jul/13 ]

The root issue is in user space nfs-utils.

1) The FSID returned via statfs() to nfs-utils is 64-bits, in spite of use the new generated 64-bits FSID or reuse the old 32-bits FSID. If old 32-bits FSID is used, then __kernel_fsid_t::val[1] (or val[0]) is zero. Before this patch applied, Lustre did not return FSID via statfs().

2) The root NFS handle contains root inode#. Lustre root inode# is 64-bits, but when nfs-utils parses the root handle, it is converted to 32-bits. So it cannot locate the right "inode". I have made a patch for that, and sent it to related kernel maintainers, and hope the patch can be accepted/landed in the next nfs-utils release.

diff --git a/utils/mountd/cache.c b/utils/mountd/cache.c
index 517aa62..a7212e7 100644
— a/utils/mountd/cache.c
+++ b/utils/mountd/cache.c
@@ -388,7 +388,7 @@ struct parsed_fsid {
int fsidtype;
/* We could use a union for this, but it would be more

  • complicated; why bother? */
  • unsigned int inode;
    + uint64_t inode;
    unsigned int minor;
    unsigned int major;
    unsigned int fsidnum;

    1.7.1
Comment by Alexey Lyashkov [ 11/Jul/13 ]

1) I understand (and agree) about returning a fs id from statfs - but think we may use a s_dev for it and 32bit is enough, and we may use an high part of fs id with filling lustre magic (if need).

2) let me time until Monday to look into mounted code carefully.

Comment by nasf (Inactive) [ 11/Jul/13 ]

1) In theory, we can fill the low 32-bits of FSID with s_dev, and fill the high 32-bits of FSID as anything, such as Lustre magic. But in spite of what is filled, we cannot control how the caller to use the returned FSID. And there is no explicit advantage of replacing current patch, since the 64-bits FSID only be generated when mount.

Comment by Alexey Lyashkov [ 12/Jul/13 ]

well, you can't control how caller will use FSID anyway, but FSID purpose - just separate one NFS handle in hash from other. In case single FS exported via different path and NFS may do round robin access to same files or load balancing (in NFS v4). In that case any number (as you see in ticket set FSID=1 in exports file) is enough, but it's need to be unique at host and and same on cluster. As you don't know which versions NFS servers in load balancing pairs used - you need present same FSID for each case - when it's generated from s_dev and from stat() call.
Also it's avoid using a private kernel types in lustre includes/structures.

Comment by nasf (Inactive) [ 14/Jul/13 ]

Currently, the llite can export both 32-bits "s_dev" and the 64-bits "FSID", which one will be used depends on the users (nfs-utils or other applications). Even if they are different, they can be used to indicate/locate the same the device (FS). Using the "s_dev" will ignore "FSID", the same for reserve case. It is not required to be the same.

I am not sure I caught the point you worry about, but you can give my a detailed example that the patch breaking something.

Comment by Jian Yu [ 17/Jul/13 ]

Hi Oleg,

Could you please cherry-pick http://review.whamcloud.com/6493 to Lustre b2_4 branch? Thanks.

The parallel-scale-nfsv3 test also failed on the current Lustre b2_4 branch:
https://maloo.whamcloud.com/test_sets/9f30063c-ed8f-11e2-8e3a-52540035b04c

Comment by Jian Yu [ 17/Jul/13 ]

The patch of http://review.whamcloud.com/6493 also needs to be backported to Lustre b1_8 and b2_1 branches to pass the testing on the following interop configurations:

NFS clients + Lustre b1_8/b2_1 NFS server/Lustre client + Lustre b2_4/master Lustre servers

Comment by Alexey Lyashkov [ 17/Jul/13 ]

you have break a clustered NFS or NFS failver configuation where
2 NFS servers in pair - first with older nfs-utils tools where fsid generated from s_dev, second with this patch.

so you have broke interoperability with older versions.

Comment by Jian Yu [ 09/Aug/13 ]

Hi Oleg,

Could you please cherry-pick the patch to Lustre b2_4 branch? Thanks.

The failure occurred regularly on Lustre b2_4 branch:
https://maloo.whamcloud.com/test_sets/0c61eedc-fdad-11e2-9fd5-52540035b04c
https://maloo.whamcloud.com/test_sets/f1c60464-fd16-11e2-9fdb-52540035b04c
https://maloo.whamcloud.com/test_sets/13499228-fcda-11e2-b90c-52540035b04c
https://maloo.whamcloud.com/test_sets/3d64d0f8-fcc2-11e2-9fdb-52540035b04c
https://maloo.whamcloud.com/test_sets/512fc62e-fcb8-11e2-9222-52540035b04c

Comment by Jian Yu [ 13/Aug/13 ]

The patch http://review.whamcloud.com/6493 was landed on Lustre b2_4 branch.

Comment by Alexey Lyashkov [ 13/Aug/13 ]

I strongly disagree about that patch, please revert and commit correct version without clustred nfs interoperability broken.

Comment by Patrick Farrell (Inactive) [ 14/Aug/13 ]

I think I agree with Alexey - What's the purpose of requiring a patched version of NFS-utils? Obviously we eventually want to fix the entire NFS-utils and kernel NFS/NFSD problems with 64 bit root inodes, but until complete fixes are available, shouldn't we not require a patch? (Also, having an nfs-utils patch adds another package that Lustre users must build themselves or that must be provided with Lustre [like e2fsprogs].)

It seems like the better solution is to document and require the -o fsid= option while pushing for fixes upstream. (This is Cray's plan going forward whether or not this specific patch remains in Lustre.)

Comment by nasf (Inactive) [ 23/Aug/13 ]

Generate the FSID from super_block::s_dev.

The patch for master:
http://review.whamcloud.com/#/c/7434/
The patch for b2_4:
http://review.whamcloud.com/#/c/7435/

Comment by Peter Jones [ 31/Aug/13 ]

Landed for 2.4.1 and 2.5

Generated at Sat Feb 10 01:29:14 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.