[LU-565] 1.8<->2.1 interop: Stale NFS file handle Created: 03/Aug/11 Updated: 11/Oct/11 Resolved: 11/Oct/11 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.1.0, Lustre 1.8.6 |
| Fix Version/s: | Lustre 2.1.0, Lustre 1.8.7 |
| Type: | Bug | Priority: | Major |
| Reporter: | Jian Yu | Assignee: | Oleg Drokin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
NFS v3/v4 clients: Lustre Client and NFS v3/v4 Server: Lustre Servers: |
||
| Attachments: |
|
| Severity: | 3 |
| Rank (Obsolete): | 5186 |
| Description |
|
While running compilebench on NFSv3 client (NFSv3 server was Lustre 1.8.6-wc1 client, and the Lustre server version was v2_0_66_0), it failed as follows: NFSCLIENT mode: setup, cleanup, check config skipped
client-13-ib
fat-amd-3-ib:/mnt/lustre /mnt/lustre nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.4.134,mountvers=3,mountport=59661,mountproto=udp,local_lock=none,addr=192.168.4.134 0 0
client-12-ib
fat-amd-3-ib:/mnt/lustre /mnt/lustre nfs rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.4.134,mountvers=3,mountport=59661,mountproto=udp,local_lock=none,addr=192.168.4.134 0 0
fat-amd-3-ib
Filesystem Type 1K-blocks Used Available Use% Mounted on
fat-amd-1-ib@o2ib:/lustre
lustre 11811168 615432 10595352 6% /mnt/lustre
== test compilebench: compilebench == 23:10:50
OPTIONS:
cbench_DIR=/usr/bin
cbench_IDIRS=10
cbench_RUNS=10
client-12-ib
client-13-ib
./compilebench -D /mnt/lustre/d0.compilebench -i 10 -r 10 --makej
using working directory /mnt/lustre/d0.compilebench, 10 intial dirs 10 runs
Traceback (most recent call last):
File "./compilebench", line 567, in <module>
dset = dataset(options.sources, rnd)
File "./compilebench", line 319, in __init__
self.unpatched = native_order(self.unpatched, "unpatched")
File "./compilebench", line 97, in native_order
run_directory(tmplist, dirname, "native %s" % tag)
File "./compilebench", line 225, in run_directory
fp = file(fname, 'a+')
IOError: [Errno 116] Stale NFS file handle: '/mnt/lustre/d0.compilebench/native-0/COPYING'
parallel-scale test_compilebench: @@@@@@ FAIL: compilebench failed: 1
Dumping lctl log to /home/yujian/test_logs/2011-08-02/231046/parallel-scale.test_compilebench.*.1312351852.log
Syslog on the Lustre client (NFSv3 server) fat-amd-3-ib showed that: Aug 2 23:10:50 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 1 or version 0 1 Aug 2 23:10:50 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 1 or version 0 1 Aug 2 23:10:52 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 2 or version 0 1 Aug 2 23:10:52 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Skipped 1 previous similar message Aug 2 23:11:06 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 2 or version 0 1 Aug 2 23:11:06 fat-amd-3 kernel: LustreError: 6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Skipped 5 previous similar messages Debug log on the Lustre client (NFSv3 server) fat-amd-3-ib showed that: 00000080:00000001:2:1312351850.904050:0:6705:0:(llite_lib.c:2286:ll_prep_inode()) Process entered 00000002:00000001:2:1312351850.904052:0:6705:0:(mdc_request.c:513:mdc_req2lustre_md()) Process entered 00000002:00000001:2:1312351850.904055:0:6705:0:(mdc_request.c:581:mdc_req2lustre_md()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:2:1312351850.904058:0:6705:0:(namei.c:88:fid_flatten()) Process leaving (rc=144115205255725057 : 144115205255725057 : 200000400000001) 00000080:00000001:2:1312351850.904061:0:6705:0:(namei.c:124:ll_fid_build_ino()) Process leaving (rc=144115205255725057 : 144115205255725057 : 200000400000001) 00000080:00000001:2:1312351850.904065:0:6705:0:(namei.c:165:ll_iget()) Process entered 00000080:00000001:2:1312351850.904089:0:6705:0:(llite_lib.c:1900:ll_update_inode()) Process entered 00000080:00000002:2:1312351850.904091:0:6705:0:(llite_lib.c:1902:ll_update_inode()) body->valid = 0x16fbf 00000080:00000001:2:1312351850.904093:0:6705:0:(namei.c:88:fid_flatten()) Process leaving (rc=144115205255725057 : 144115205255725057 : 200000400000001) 00000080:00000001:2:1312351850.904098:0:6705:0:(namei.c:124:ll_fid_build_ino()) Process leaving (rc=144115205255725057 : 144115205255725057 : 200000400000001) 00000080:00000001:2:1312351850.904101:0:6705:0:(namei.c:130:ll_fid_build_gen()) Process entered 00000080:00000001:2:1312351850.904103:0:6705:0:(namei.c:135:ll_fid_build_gen()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:2:1312351850.904106:0:6705:0:(llite_lib.c:2028:ll_update_inode()) Process leaving 00000080:00200000:2:1312351850.904108:0:6705:0:(namei.c:180:ll_iget()) got inode: 144115205255725057/0(ffff8800377cc190) for [0x200000400:0x1:0x0] 00000080:00000001:2:1312351850.904112:0:6705:0:(namei.c:183:ll_iget()) Process leaving (rc=18446612133245141392 : -131940464410224 : ffff8800377cc190) 00000080:00000001:2:1312351850.904116:0:6705:0:(obd_class.h:539:obd_checkmd()) Process entered 00020000:00000001:2:1312351850.904118:0:6705:0:(lov_obd.c:3139:lov_checkmd()) Process entered 00020000:00000001:2:1312351850.904119:0:6705:0:(lov_obd.c:3142:lov_checkmd()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:2:1312351850.904122:0:6705:0:(obd_class.h:545:obd_checkmd()) Process leaving (rc=0 : 0 : 0) 00000080:00000001:2:1312351850.904124:0:6705:0:(llite_lib.c:2314:ll_prep_inode()) Process leaving (rc=0 : 0 : 0) 00000100:00000001:2:1312351850.904127:0:6705:0:(client.c:1806:__ptlrpc_req_finished()) Process entered 00000100:00000040:2:1312351850.904129:0:6705:0:(client.c:1818:__ptlrpc_req_finished()) @@@ refcount now 0 req@ffff880118ebcc00 x1376100389224929/t0 o33->lustre-MDT0000_UUID@192.168.4.132@o2ib:12/10 lens 408/808 e 0 to 1 dl 1312351857 ref 1 fl Interpret:R/0/0 rc 0/0 00000100:00000001:2:1312351850.904141:0:6705:0:(client.c:1739:__ptlrpc_free_req()) Process entered 00000100:00000010:2:1312351850.904143:0:6705:0:(client.c:1770:__ptlrpc_free_req()) kfreed 'request->rq_repbuf': 808 at ffff880118f43400. 00000100:00000040:2:1312351850.904147:0:6705:0:(client.c:1775:__ptlrpc_free_req()) PUTting export ffff8801006dc200 : new refcount 4 00000020:00000001:2:1312351850.904150:0:6705:0:(genops.c:816:class_import_put()) Process entered 00000020:00000040:2:1312351850.904152:0:6705:0:(genops.c:824:class_import_put()) import ffff88021a3b5800 refcount=5 obd=lustre-MDT0000-mdc-ffff880106982400 00000020:00000001:2:1312351850.904155:0:6705:0:(genops.c:835:class_import_put()) Process leaving 00000100:00000010:2:1312351850.904156:0:6705:0:(client.c:1789:__ptlrpc_free_req()) kfreed 'request->rq_reqmsg': 408 at ffff88011853e200. 00000100:00000010:2:1312351850.904159:0:6705:0:(client.c:1792:__ptlrpc_free_req()) kfreed 'request': 584 at ffff880118ebcc00. 00000100:00000001:2:1312351850.904162:0:6705:0:(client.c:1794:__ptlrpc_free_req()) Process leaving 00000100:00000001:2:1312351850.904164:0:6705:0:(client.c:1822:__ptlrpc_req_finished()) Process leaving (rc=1 : 1 : 1) 00000080:00000001:2:1312351850.904167:0:6705:0:(llite_nfs.c:107:search_inode_for_lustre()) Process leaving (rc=18446612133245141392 : -131940464410224 : ffff8800377cc190) 00000080:00020000:2:1312351850.904170:0:6705:0:(llite_nfs.c:130:ll_iget_for_nfs()) Inode 144115205255725057, Bad count: 2 2 or version 0 1 00000080:00000001:2:1312351850.904174:0:6705:0:(llite_nfs.c:132:ll_iget_for_nfs()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c) 00000080:00000001:2:1312351850.904178:0:6705:0:(llite_nfs.c:190:ll_fh_to_dentry()) Process leaving (rc=18446744073709551500 : -116 : ffffffffffffff8c) Maloo report: https://maloo.whamcloud.com/test_sets/0ec3ada0-bd9f-11e0-8bdf-52540025f9af The issue also existed on NFSv4 over Lustre client testing: https://maloo.whamcloud.com/test_sets/91dd2132-bda2-11e0-8bdf-52540025f9af It could be easily reproduced by just touching a file under the NFS client directory: [root@client-12-ib ~]# df -T /mnt/lustre
Filesystem Type 1K-blocks Used Available Use% Mounted on
fat-amd-3-ib:/
nfs4 11811840 615424 10596352 6% /mnt/lustre
[root@client-12-ib ~]# touch /mnt/lustre/aaa
touch: cannot touch `/mnt/lustre/aaa': Stale NFS file handle
|
| Comments |
| Comment by Peter Jones [ 09/Aug/11 ] |
|
Oleg Could you please look at this one? Thanks Peter |
| Comment by Oleg Drokin [ 10/Aug/11 ] |
|
Ok, I can reproduce this one too, thanks for the instructions. The problem appears to be due to llite_nfs code not being converted to fid API in 1.8, I am working on a patch |
| Comment by Oleg Drokin [ 10/Aug/11 ] |
|
Please try http://review.whamcloud.com/1206 seems to be working for me. |
| Comment by Jian Yu [ 11/Aug/11 ] |
NFS Clients/NFS Server/Lustre Client: Lustre Servers: With the above patch, the "Stale NFS file handle" issue was fixed. However, I hit another failure while running the compilebench on the NFS v3 client (client-12-ib): IOError: [Errno 521] Unknown error 521: '/mnt/lustre/d0.compilebench/native-0/COPYING' Dmesg on the NFS Server/Lustre client (fat-amd-3-ib) showed that: LustreError: 11-0: an error occurred while communicating with 192.168.4.132@o2ib. The mds_getattr_lock operation failed with -22 LustreError: 10795:0:(llite_nfs.c:276:ll_get_parent()) failure -22 inode 144115205255725057 get parent Dmesg on the MDS node (fat-amd-1-ib) showed that: LustreError: 10471:0:(mdt_handler.c:2422:mdt_body_unpack()) Invalid fid: [0x200000400000001:0x0:0x4000] LustreError: 10471:0:(mdt_handler.c:2422:mdt_body_unpack()) Skipped 2 previous similar messages LustreError: 10471:0:(ldlm_lib.c:2128:target_send_reply_msg()) @@@ processing error (-22) req@ffff88011793f400 x1377189910020547/t0(0) o-1->ca25eacd-1845-4f73-0ea2-b0a89fe93fb0@NET_0x50000c0a80486_UUID:0/0 lens 424/0 e 0 to 0 dl 1313390774 ref 1 fl Interpret:/ffffffff/ffffffff rc -22/-1 Please refer to the attached parallel-scale.test_compilebench.1313390770.log.tar.bz2 for more logs. |
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|
| Comment by Build Master (Inactive) [ 23/Aug/11 ] |
|
Integrated in Oleg Drokin : 999530eeff5849183d8849688fd0f54dbd62c0f3
|