[LU-565] 1.8<->2.1 interop: Stale NFS file handle Created: 03/Aug/11  Updated: 11/Oct/11  Resolved: 11/Oct/11

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.1.0, Lustre 1.8.6
Fix Version/s: Lustre 2.1.0, Lustre 1.8.7

Type: Bug Priority: Major
Reporter: Jian Yu Assignee: Oleg Drokin
Resolution: Fixed Votes: 0
Labels: None
Environment:

NFS v3/v4 clients:
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6)
Network: IB (inkernel OFED)
Nodes: client-12-ib, client-13-ib

Lustre Client and NFS v3/v4 Server:
Tag: 1.8.6-wc1
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32_131.2.1.el6)
Build: http://newbuild.whamcloud.com/job/lustre-b1_8/100/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)
Node: fat-amd-3-ib

Lustre Servers:
Tag: v2_0_66_0
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.2.1.el6_lustre)
Build: http://newbuild.whamcloud.com/job/lustre-master/228/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/
Network: IB (inkernel OFED)
Nodes: fat-amd-1-ib (MDS), fat-amd-2-ib (OSS)


Attachments: File parallel-scale.test_compilebench.1313390770.log.tar.bz2    
Severity: 3
Rank (Obsolete): 5186

 Description   

While running compilebench on NFSv3 client (NFSv3 server was Lustre 1.8.6-wc1 client, and the Lustre server version was v2_0_66_0), it failed as follows:

NFSCLIENT mode: setup, cleanup, check config skipped
client-13-ib
fat-amd-3-ib:/mnt/lustre /mnt/lustre nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.4.134,mountvers=3,mountport=59661,mountproto=udp,local_lock=none,addr=192.168.4.134 0 0
client-12-ib
fat-amd-3-ib:/mnt/lustre /mnt/lustre nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.4.134,mountvers=3,mountport=59661,mountproto=udp,local_lock=none,addr=192.168.4.134 0 0
fat-amd-3-ib
Filesystem    Type   1K-blocks      Used Available Use% Mounted on
fat-amd-1-ib@o2ib:/lustre
            lustre    11811168    615432  10595352   6% /mnt/lustre

== test compilebench: compilebench == 23:10:50
OPTIONS:
cbench_DIR=/usr/bin
cbench_IDIRS=10
cbench_RUNS=10
client-12-ib
client-13-ib
./compilebench -D /mnt/lustre/d0.compilebench -i 10 -r 10 --makej
using working directory /mnt/lustre/d0.compilebench, 10 intial dirs 10 runs
Traceback (most recent call last):
  File "./compilebench", line 567, in <module>
    dset = dataset(options.sources, rnd)
  File "./compilebench", line 319, in __init__
    self.unpatched = native_order(self.unpatched, "unpatched")
  File "./compilebench", line 97, in native_order
    run_directory(tmplist, dirname, "native %s" % tag)
  File "./compilebench", line 225, in run_directory
    fp = file(fname, 'a+')
IOError: [Errno 116] Stale NFS file handle: '/mnt/lustre/d0.compilebench/native-0/COPYING'
 parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1 
Dumping lctl log to /home/yujian/test_logs/2011-08-02/231046/parallel-scale.test_compilebench.*.1312351852.log

Syslog on the Lustre client (NFSv3 server) fat-amd-3-ib showed that:

Aug  2 23:10:50 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 1 or version  0 1
Aug  2 23:10:50 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 1 or version  0 1
Aug  2 23:10:52 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 2 or version  0 1
Aug  2 23:10:52 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Skipped 1 previous similar message
Aug  2 23:11:06 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 2 or version  0 1
Aug  2 23:11:06 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Skipped 5 previous similar messages

Debug log on the Lustre client (NFSv3 server) fat-amd-3-ib showed that:

00000080:00000001:2:1312351850.904050:0:6705:0:(llite_lib.c:2286:ll_prep_inode()) Process entered
00000002:00000001:2:1312351850.904052:0:6705:0:(mdc_request.c:513:mdc_req2lustre_md()) Process entered
00000002:00000001:2:1312351850.904055:0:6705:0:(mdc_request.c:581:mdc_req2lustre_md()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:2:1312351850.904058:0:6705:0:(namei.c:88:fid_flatten()) Process leaving (rc=144115205255725057 : 144115205255725057 : 200000400000001)
00000080:00000001:2:1312351850.904061:0:6705:0:(namei.c:124:ll_fid_build_ino()) Process leaving (rc=144115205255725057 : 144115205255725057 : 200000400000001)
00000080:00000001:2:1312351850.904065:0:6705:0:(namei.c:165:ll_iget()) Process entered
00000080:00000001:2:1312351850.904089:0:6705:0:(llite_lib.c:1900:ll_update_inode()) Process entered
00000080:00000002:2:1312351850.904091:0:6705:0:(llite_lib.c:1902:ll_update_inode()) body->valid = 0x16fbf
00000080:00000001:2:1312351850.904093:0:6705:0:(namei.c:88:fid_flatten()) Process leaving (rc=144115205255725057 : 144115205255725057 : 200000400000001)
00000080:00000001:2:1312351850.904098:0:6705:0:(namei.c:124:ll_fid_build_ino()) Process leaving (rc=144115205255725057 : 144115205255725057 : 200000400000001)
00000080:00000001:2:1312351850.904101:0:6705:0:(namei.c:130:ll_fid_build_gen()) Process entered
00000080:00000001:2:1312351850.904103:0:6705:0:(namei.c:135:ll_fid_build_gen()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:2:1312351850.904106:0:6705:0:(llite_lib.c:2028:ll_update_inode()) Process leaving
00000080:00200000:2:1312351850.904108:0:6705:0:(namei.c:180:ll_iget()) got inode: 144115205255725057/0(ffff8800377cc190) for [0x200000400:0x1:0x0]
00000080:00000001:2:1312351850.904112:0:6705:0:(namei.c:183:ll_iget()) Process leaving (rc=18446612133245141392 : -131940464410224 : ffff8800377cc190)
00000080:00000001:2:1312351850.904116:0:6705:0:(obd_class.h:539:obd_checkmd()) Process entered
00020000:00000001:2:1312351850.904118:0:6705:0:(lov_obd.c:3139:lov_checkmd()) Process entered
00020000:00000001:2:1312351850.904119:0:6705:0:(lov_obd.c:3142:lov_checkmd()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:2:1312351850.904122:0:6705:0:(obd_class.h:545:obd_checkmd()) Process leaving (rc=0 : 0 : 0)
00000080:00000001:2:1312351850.904124:0:6705:0:(llite_lib.c:2314:ll_prep_inode()) Process leaving (rc=0 : 0 : 0)
00000100:00000001:2:1312351850.904127:0:6705:0:(client.c:1806:__ptlrpc_req_finished()) Process entered
00000100:00000040:2:1312351850.904129:0:6705:0:(client.c:1818:__ptlrpc_req_finished()) @@@ refcount now 0  req@ffff880118ebcc00 x1376100389224929/t0 o33->lustre-MDT0000_UUID@192.168.4.132@o2ib:12/10 lens 408/808 e 0 to 1 dl 1312351857 ref 1 fl Interpret:R/0/0 rc 0/0
00000100:00000001:2:1312351850.904141:0:6705:0:(client.c:1739:__ptlrpc_free_req()) Process entered
00000100:00000010:2:1312351850.904143:0:6705:0:(client.c:1770:__ptlrpc_free_req()) kfreed 'request->rq_repbuf': 808 at ffff880118f43400.
00000100:00000040:2:1312351850.904147:0:6705:0:(client.c:1775:__ptlrpc_free_req()) PUTting export ffff8801006dc200 : new refcount 4
00000020:00000001:2:1312351850.904150:0:6705:0:(genops.c:816:class_import_put()) Process entered
00000020:00000040:2:1312351850.904152:0:6705:0:(genops.c:824:class_import_put()) import ffff88021a3b5800 refcount=5 obd=lustre-MDT0000-mdc-ffff880106982400
00000020:00000001:2:1312351850.904155:0:6705:0:(genops.c:835:class_import_put()) Process leaving
00000100:00000010:2:1312351850.904156:0:6705:0:(client.c:1789:__ptlrpc_free_req()) kfreed 'request->rq_reqmsg': 408 at ffff88011853e200.
00000100:00000010:2:1312351850.904159:0:6705:0:(client.c:1792:__ptlrpc_free_req()) kfreed 'request': 584 at ffff880118ebcc00.
00000100:00000001:2:1312351850.904162:0:6705:0:(client.c:1794:__ptlrpc_free_req()) Process leaving
00000100:00000001:2:1312351850.904164:0:6705:0:(client.c:1822:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1)
00000080:00000001:2:1312351850.904167:0:6705:0:(llite_nfs.c:107:search_inode_for_lustre()) Process leaving (rc=18446612133245141392 : -131940464410224 : ffff8800377cc190)
00000080:00020000:2:1312351850.904170:0:6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 2 or version  0 1
00000080:00000001:2:1312351850.904174:0:6705:0:(llite_nfs.c:132:ll_iget_for_nfs()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c)
00000080:00000001:2:1312351850.904178:0:6705:0:(llite_nfs.c:190:ll_fh_to_dentry()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c)

Maloo report: https://maloo.whamcloud.com/test_sets/0ec3ada0-bd9f-11e0-8bdf-52540025f9af

The issue also existed on NFSv4 over Lustre client testing: https://maloo.whamcloud.com/test_sets/91dd2132-bda2-11e0-8bdf-52540025f9af

It could be easily reproduced by just touching a file under the NFS client directory:

[root@client-12-ib ~]# df -T /mnt/lustre
Filesystem    Type   1K-blocks      Used Available Use% Mounted on
fat-amd-3-ib:/
              nfs4    11811840    615424  10596352   6% /mnt/lustre
[root@client-12-ib ~]# touch /mnt/lustre/aaa
touch: cannot touch `/mnt/lustre/aaa': Stale NFS file handle


 Comments   
Comment by Peter Jones [ 09/Aug/11 ]

Oleg

Could you please look at this one?

Thanks

Peter

Comment by Oleg Drokin [ 10/Aug/11 ]

Ok, I can reproduce this one too, thanks for the instructions.

The problem appears to be due to llite_nfs code not being converted to fid API in 1.8, I am working on a patch

Comment by Oleg Drokin [ 10/Aug/11 ]

Please try http://review.whamcloud.com/1206 seems to be working for me.

Comment by Jian Yu [ 11/Aug/11 ]

Please try http://review.whamcloud.com/1206 seems to be working for me.

NFS Clients/NFS Server/Lustre Client:
Branch: b1_8
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.2.1.el6.x86_64)
Build: http://build.whamcloud.com/job/lustre-reviews/1671/arch=x86_64,build_type=client,distro=el6,ib_stack=inkernel/

Lustre Servers:
Branch: master
Distro/Arch: RHEL6/x86_64 (kernel version: 2.6.32-131.6.1.el6_lustre.gadc0fa3.x86_64)
Build: http://build.whamcloud.com/job/lustre-master/256/arch=x86_64,build_type=server,distro=el6,ib_stack=inkernel/

With the above patch, the "Stale NFS file handle" issue was fixed. However, I hit another failure while running the compilebench on the NFS v3 client (client-12-ib):

IOError: [Errno 521] Unknown error 521: '/mnt/lustre/d0.compilebench/native-0/COPYING'

Dmesg on the NFS Server/Lustre client (fat-amd-3-ib) showed that:

LustreError: 11-0: an error occurred while communicating with 192.168.4.132@o2ib. The mds_getattr_lock operation failed with -22
LustreError: 10795:0:(llite_nfs.c:276:ll_get_parent()) failure -22 inode 144115205255725057 get parent

Dmesg on the MDS node (fat-amd-1-ib) showed that:

LustreError: 10471:0:(mdt_handler.c:2422:mdt_body_unpack()) Invalid fid: [0x200000400000001:0x0:0x4000]
LustreError: 10471:0:(mdt_handler.c:2422:mdt_body_unpack()) Skipped 2 previous similar messages
LustreError: 10471:0:(ldlm_lib.c:2128:target_send_reply_msg()) @@@ processing error (-22)  req@ffff88011793f400 x1377189910020547/t0(0) o-1->ca25eacd-1845-4f73-0ea2-b0a89fe93fb0@NET_0x50000c0a80486_UUID:0/0 lens 424/0 e 0 to 0 dl 1313390774 ref 1 fl Interpret:/ffffffff/ffffffff rc -22/-1

Please refer to the attached parallel-scale.test_compilebench.1313390770.log.tar.bz2 for more logs.

Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,inkernel #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,inkernel #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » x86_64,client,el5,ofa #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » i686,client,el6,inkernel #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » x86_64,client,el6,inkernel #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » x86_64,client,ubuntu1004,inkernel #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » x86_64,server,el5,ofa #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » i686,client,el5,inkernel #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » i686,client,el5,ofa #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » i686,server,el5,inkernel #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Comment by Build Master (Inactive) [ 23/Aug/11 ]

Integrated in lustre-b1_8 » i686,server,el5,ofa #121
LU-565 Update llite_nfc.c to 2.x interop fid API

Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
Files :

  • lustre/llite/llite_nfs.c
Generated at Sat Feb 10 01:08:17 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.