<!-- 
RSS generated by JIRA (9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c) at Sat Feb 10 01:54:02 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>Whamcloud Community JIRA</title>
    <link>https://jira.whamcloud.com</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.4.14</version>
        <build-number>940014</build-number>
        <build-date>05-12-2023</build-date>
    </build-info>


<item>
            <title>[LU-5730] intermittent I/O errors for some directories</title>
                <link>https://jira.whamcloud.com/browse/LU-5730</link>
                <project id="10000" key="LU">Lustre</project>
                    <description>&lt;p&gt;Our users have reported an issue where the suddenly have problems editing a file in a directory, they also got I/O errors for example when trying to get the ACLs for that directory. In at least one instance the problem resolved itself overnight after we decided to investigate in more detail later, in another case today the problem went away when we renamed the problematic directory.&lt;/p&gt;

&lt;p&gt;In the most recent instance today we had some time where we were able to attempt to understand the issue and this is what we found so far: While the problem persists, some clients are seeing I/O error on calling getfacl, other clients don&apos;t have any problems running the same commands and returned the expected results. Some machines access this directory over NFS, exported from one of our clients which was showing problems in this instance, they had the same issues. Attempting to edit a file in the problematic directory with vim came up with the message that the .swp file already exists even for new files. Creating new files in the directory, for example with touch, worked with no problem.&lt;/p&gt;

&lt;p&gt;There are no error messages recorded by syslog on any of the machines involved.&lt;/p&gt;

&lt;p&gt;We&apos;ve mostly run out of ideas what to look for next to resolve this if it happens again.&lt;/p&gt;</description>
                <environment>Lustre 2.5.2 on RHEL6 servers and clients, NFS exported, ACLs</environment>
        <key id="26977">LU-5730</key>
            <summary>intermittent I/O errors for some directories</summary>
                <type id="1" iconUrl="https://jira.whamcloud.com/secure/viewavatar?size=xsmall&amp;avatarId=11303&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.whamcloud.com/images/icons/priorities/minor.svg">Minor</priority>
                        <status id="5" iconUrl="https://jira.whamcloud.com/images/icons/statuses/resolved.png" description="A resolution has been taken, and it is awaiting verification by reporter. From here issues are either reopened, or are closed.">Resolved</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="1">Fixed</resolution>
                                        <assignee username="laisiyao">Lai Siyao</assignee>
                                    <reporter username="ferner">Frederik Ferner</reporter>
                        <labels>
                    </labels>
                <created>Mon, 13 Oct 2014 13:13:44 +0000</created>
                <updated>Thu, 10 Sep 2015 18:18:40 +0000</updated>
                            <resolved>Thu, 10 Sep 2015 18:18:39 +0000</resolved>
                                    <version>Lustre 2.5.2</version>
                                                        <due></due>
                            <votes>0</votes>
                                    <watches>6</watches>
                                                                            <comments>
                            <comment id="96268" author="pjones" created="Mon, 13 Oct 2014 21:05:58 +0000"  >&lt;p&gt;Lai&lt;/p&gt;

&lt;p&gt;What do you advise here? Could this be covered by the ACLs issues fixed in the upcoming 2.5.4 release (&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5150&quot; title=&quot;NULL pointer dereference in posix_acl_valid() under mdc_get_lustre_md()&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5150&quot;&gt;&lt;del&gt;LU-5150&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3660&quot; title=&quot;Can&amp;#39;t disable ACL support with ZFS MDT&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3660&quot;&gt;&lt;del&gt;LU-3660&lt;/del&gt;&lt;/a&gt;/&lt;a href=&quot;https://jira.whamcloud.com/browse/LU-5434&quot; title=&quot;Invalid system.posix_acl_access breaks permissions enforcement&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-5434&quot;&gt;&lt;del&gt;LU-5434&lt;/del&gt;&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="96313" author="adilger" created="Tue, 14 Oct 2014 17:25:21 +0000"  >&lt;p&gt;How big is the ACL in this case?  There was a bug in the past where the ACL was too large (30ish entries) and it returned an error trying to access it.  Are the working and failing clients running the same Lustre version?&lt;/p&gt;</comment>
                            <comment id="96464" author="ferner" created="Thu, 16 Oct 2014 10:38:28 +0000"  >&lt;p&gt;The ACL in this case isn&apos;t that big, ~13 ACLs and 15 default ACLs on the directory. &lt;/p&gt;

&lt;p&gt;We are aware of the issues with ACLs in the past, the magic number was 32 ACLs and 32 default ACLs (at least on 1.8), and there was a bug that allowed to set 33 default ACLs on a directory, causing all access to files created in this directory to be inaccessible. This was a permanent issue, we still got some of these files hidden somewhere on the file system. Not in the directories causing problems now, though.&lt;/p&gt;</comment>
                            <comment id="97717" author="ferner" created="Tue, 28 Oct 2014 17:24:19 +0000"  >&lt;p&gt;Today I have looked into this a bit further, I can reproduce this or a similar issue in a directory without any ACLs (other than unix mode bits):&lt;/p&gt;

&lt;p&gt;On a NFS client, running &apos;cp -r&apos; on a directory tree with ~300 files and 45 directories, usually a number of files fail to be copied to the destination and we see errors about &quot;Stale file handle&quot; and &quot;File exists&quot;:&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@ws104 noacl]$ rm -fr t  ; cp -r s t
cp: cannot create regular file `t/fast_ep/P6322/5/nan/FEP_dmacps.sh.pe4643338&apos;: Stale file handle
cp: cannot create directory `t/fast_ep/P622/3&apos;: File exists
cp: cannot create regular file `t/fast_ep/P622/11/nan/sad_fa.lst&apos;: Stale file handle
cp: cannot create regular file `t/fast_ep/P622/11/nan/FEP_lwdype.sh.o4643261&apos;: Stale file handle
cp: cannot create regular file `t/fast_ep/P622/11/nan/FEP_lwdype.sh.pe4643261&apos;: Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/21/nan/FEP_sjtixa.sh.o4643283&apos;: Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/11/nan/FEP_qaepuy.sh.po4643281&apos;: Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/1/nan/FEP_nlauzb.sh.po4643231&apos;: Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/1/nan/FEP_soevgk.sh.pe4643243&apos;: Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/1/nan/FEP_soevgk.sh&apos;: Stale file handle
[bnh65367@ws104 noacl]$ find s -type f -print | wc -l
339
[bnh65367@ws104 noacl]$ find s -type d -print | wc -l
45
[bnh65367@ws104 noacl]$
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And repeating the same tests directly on the NFS server runs the cp without any errors.&lt;/p&gt;</comment>
                            <comment id="97794" author="laisiyao" created="Wed, 29 Oct 2014 02:29:34 +0000"  >&lt;p&gt;c�ould you upload messages and debug log? and in the mean time I&apos;ll try to reproduce it locally.&lt;/p&gt;</comment>
                            <comment id="97825" author="ferner" created="Wed, 29 Oct 2014 14:08:31 +0000"  >&lt;p&gt;syslog doesn&apos;t contain anything Lustre related on the NFS server and on the MDS I can see only a few reconnects for other clients mainly due to other work, nothing during the time I ran the tests. Nothing at all in the syslog on the NFS client.&lt;/p&gt;

&lt;p&gt;I&apos;ve just repeated the test (at Wed Oct 29 13:58:30 GMT 2014).&lt;/p&gt;

&lt;p&gt;I&apos;m also not quite sure which debug log you wanted, so I&apos;ve collected the output of &quot;lctl debug_kernel&quot; for the NFS server and the MDS just after running my latest test and will attach them. Unfortunately the file system is our production file system, so likely to be quite busy at the moment, so there might be other stuff in there...&lt;/p&gt;

&lt;p&gt;Also, running the same test in a different directory, using a different NFS server, works without problems (using the same source directory).&lt;/p&gt;</comment>
                            <comment id="97906" author="laisiyao" created="Thu, 30 Oct 2014 03:01:52 +0000"  >&lt;p&gt;it seems this debug log doesn&apos;t contain -ESTALE/-EEXIST error, maybe these error code are from NFS code. are you using NFSv3? and can you make sure this is ACL related (mount NFS with noacl and test again)?&lt;/p&gt;

&lt;p&gt;it&apos;s fine to use `lctl debug_kernel` to collect debug logs. could you run `lctl set_param debug=+trace` on NFS server and MDS to enable more debugging, and when it&apos;s reproduced, collect the logs and upload? &lt;/p&gt;</comment>
                            <comment id="97920" author="ferner" created="Thu, 30 Oct 2014 11:14:54 +0000"  >&lt;p&gt;Yes, we are using NFSv3, kernel nfs server. I&apos;ve repeated the test with the file system mounted with noacl on the NFS client, same result, a couple of stale file handles.&lt;/p&gt;

&lt;p&gt;I&apos;ve also captured a new set of debug logs, with trace turned on, capturing with lctl debug_kernel as soon as possible after completing the tests. As the file system is in use, I also started a debug_daemon on NFS server and MDS, with 2GB and 8GB file size, to try and capture the event obviously the files are too big to upload here. Grepping for stale (using &quot;grep -i&quot; to ignore case) did not return anything, not sure if you were expecting or hoping for this string to be there? I&apos;ve still got the files so if you want them, we&apos;ll need to get them to you somehow.&lt;/p&gt;

&lt;p&gt;This new test only had stale files as errors.&lt;/p&gt;</comment>
                            <comment id="98023" author="laisiyao" created="Fri, 31 Oct 2014 02:40:08 +0000"  >&lt;p&gt;yes, please upload logs to ftp site, and also could you post the test result here? I want to know which file returns -ESTALE.&lt;/p&gt;</comment>
                            <comment id="98287" author="ferner" created="Tue, 4 Nov 2014 17:01:25 +0000"  >&lt;p&gt;Apologies for the delay. After an unplanned power outage in our computer room I lost the debug files, I&apos;ve now recreated them. As I can&apos;t find the details of the ftp site at the moment, I&apos;ve made them available at &lt;a href=&quot;ftp://ftpanon.diamond.ac.uk/LU-5730/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;ftp://ftpanon.diamond.ac.uk/LU-5730/&lt;/a&gt; where they should be available for the next 7 days at least.&lt;/p&gt;

&lt;p&gt;The output of the reproducer run in the time covered by the debug output is below:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@cs03r-sc-serv-16 ~]$ cd /dls/i03/data/2014/cm4950-4/tmp/frederik/
[bnh65367@cs03r-sc-serv-16 frederik]$ ls
[bnh65367@cs03r-sc-serv-16 frederik]$ cd ../
[bnh65367@cs03r-sc-serv-16 tmp]$ ls
frederik  s  t
[bnh65367@cs03r-sc-serv-16 tmp]$ mv s frederik/
[bnh65367@cs03r-sc-serv-16 tmp]$ cd frederik/
[bnh65367@cs03r-sc-serv-16 frederik]$ rm -fr t; cp -r s t
cp: cannot create directory `t/P6322/3/nan&apos;: File exists
cp: cannot create directory `t/P6322/1&apos;: File exists
cp: cannot create regular file `t/P6322/5/nan/FEP_dmacps.sh.o4643338&apos;: Stale file handle
cp: cannot create regular file `t/P6322/5/nan/sad_fa.hkl&apos;: Stale file handle
cp: cannot create regular file `t/P6322/5/nan/sad_fa.lst&apos;: Stale file handle
cp: cannot create regular file `t/P6322/5/nan/FEP_zwpscl.sh.pe4643339&apos;: Stale file handle
cp: cannot create directory `t/P622/3/nan&apos;: File exists
cp: cannot create regular file `t/P622/21/nan/FEP_tfizjg.sh.po4643251&apos;: Stale file handle
cp: cannot create directory `t/P622/11/nan&apos;: File exists
cp: cannot create regular file `t/P6222/21/nan/FEP_rhvwgn.sh.o4643282&apos;: Stale file handle
cp: cannot create regular file `t/P6222/11/nan/FEP_lcbxae.sh.pe4643280&apos;: Stale file handle
cp: cannot create regular file `t/P6222/1/nan/FEP_nlauzb.sh.po4643231&apos;: Stale file handle
cp: cannot create regular file `t/P6222/5/nan/FEP_xicsnr.sh.o4643233&apos;: Stale file handle
[bnh65367@cs03r-sc-serv-16 frederik]$ 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="98868" author="laisiyao" created="Tue, 11 Nov 2014 09:29:32 +0000"  >&lt;p&gt;This looks to be the same issue as &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3952&quot; title=&quot;llite_nfs.c:349:ll_get_parent()) ASSERTION( body-&amp;gt;valid &amp;amp; (0x00000001ULL) ) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3952&quot;&gt;&lt;del&gt;LU-3952&lt;/del&gt;&lt;/a&gt;: MDS doesn&apos;t pack parent fid in reply, and ll_get_parent() returned -ESTALE.&lt;/p&gt;

&lt;p&gt;But unfortunately I failed to download MDS debug log, which failed with -36 (last week I was in vacation, so that I didn&apos;t download it in time). Could you check it? I need to access it to know why MDS doesn&apos;t do it.&lt;/p&gt;</comment>
                            <comment id="98870" author="ferner" created="Tue, 11 Nov 2014 10:32:15 +0000"  >&lt;p&gt;I hope you had a good vacation.&lt;/p&gt;

&lt;p&gt;Anyway, I&apos;ve just checked and the files are still on the ftp server, let me know if you still really can&apos;t download them (or if they don&apos;t contain what you expected.)&lt;/p&gt;
</comment>
                            <comment id="98942" author="laisiyao" created="Wed, 12 Nov 2014 03:53:57 +0000"  >&lt;p&gt;I&apos;ve downloaded it successfully, thanks.&lt;/p&gt;</comment>
                            <comment id="98946" author="laisiyao" created="Wed, 12 Nov 2014 05:34:24 +0000"  >&lt;p&gt;in server log I found this:&lt;/p&gt;
&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;00000004:00000001:25.0:1415114896.229956:0:15519:0:(mdt_handler.c:1573:mdt_getattr_name()) Process entered
00000004:00000001:25.0:1415114896.229962:0:15519:0:(mdt_handler.c:1277:mdt_getattr_name_lock()) Process entered
00000004:00000001:25.0:1415114896.229963:0:15519:0:(mdt_handler.c:1220:mdt_raw_lookup()) Process entered
00000004:00000001:25.0:1415114896.229963:0:15519:0:(mdd_dir.c:113:mdd_lookup()) Process entered
00000004:00000001:25.0:1415114896.229964:0:15519:0:(mdd_dir.c:71:__mdd_lookup()) Process entered
00000004:00000001:25.0:1415114896.229964:0:15519:0:(mdd_permission.c:249:__mdd_permission_internal()) Process entered
00000004:00000001:25.0:1415114896.229966:0:15519:0:(mdd_permission.c:220:mdd_check_acl()) Process entered
00000004:00000001:25.0:1415114896.229967:0:15519:0:(lod_object.c:370:lod_xattr_get()) Process entered
00000004:00000001:25.0:1415114896.229968:0:15519:0:(lod_object.c:374:lod_xattr_get()) Process leaving (rc=108 : 108 : 6c)
00000004:00000001:25.0:1415114896.229970:0:15519:0:(mdd_permission.c:236:mdd_check_acl()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:25.0:1415114896.229971:0:15519:0:(mdd_permission.c:309:__mdd_permission_internal()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:25.0:1415114896.229972:0:15519:0:(mdd_dir.c:90:__mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:25.0:1415114896.229973:0:15519:0:(mdd_dir.c:115:mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:25.0:1415114896.229973:0:15519:0:(mdt_handler.c:1247:mdt_raw_lookup()) Process leaving (rc=1 : 1 : 1)
00000004:00000001:25.0:1415114896.229974:0:15519:0:(mdt_handler.c:1335:mdt_getattr_name_lock()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:25.0:1415114896.229976:0:15519:0:(mdt_handler.c:1594:mdt_getattr_name()) Process leaving
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;when lustre client calls ll_get_parent() to connect disconnected dentry to parent, it calls md_getattr_name() to fetch parent fid, but the log shows, when MDS lookup &quot;..&quot; for the parent it failed on mdd_check_acl(), so that parent fid was not piggy back, and finally -ESTALE is returned. Could you check the ACL for the parent directory of the -ESTALE file, eg. &quot;t/P6322/5/nan&quot;, &quot;t/P622/21/nan&quot;? &lt;/p&gt;

&lt;p&gt;BTW once I asked you to disable ACL for NFS, but this doesn&apos;t really disable ACL for Lustre, so ACL is still checked. And if possible, could you disable ACL for Lustre (this should be done on MDS mount with option &quot;noacl&quot;) and test again?&lt;/p&gt;</comment>
                            <comment id="101846" author="ferner" created="Wed, 17 Dec 2014 18:54:12 +0000"  >&lt;p&gt;Apologies for the delay in getting back to you. The test directory we used previously had been automatically cleaned before you asked for the ACLs, then I lost the ability to reproduce it for a while (i.e. the test never triggered any problem). We have recently started to see the problem again.&lt;/p&gt;

&lt;p&gt;Unfortunately disabling ACLs completely on the production file system is not at all easy and I can&apos;t reproduce the problem on any of our test file systems even though they are running the same versions.&lt;/p&gt;

&lt;p&gt;On our production file system where we see the problem, this doesn&apos;t always happen but I&apos;ve just managed to reproduce it (without lustre debugging). I&apos;ve collected the ACLs for the parent directory of the most recent files reported as &quot;stale file handle&quot;, see below. I&apos;m hoping some of this might be useful?&lt;/p&gt;

&lt;div class=&quot;preformatted panel&quot; style=&quot;border-width: 1px;&quot;&gt;&lt;div class=&quot;preformattedContent panelContent&quot;&gt;
&lt;pre&gt;[bnh65367@ws104 ff]$ rm -fr t; cp -r s t 
cp: cannot create regular file `t/P6322/21/nan/FEP_wgeoyi.sh.pe4643346&apos;: Stale file handle
cp: cannot create regular file `t/P6322/21/nan/FEP_wgeoyi.sh.o4643346&apos;: Stale file handle
cp: cannot create regular file `t/P6322/21/nan/FEP_ghvwzi.sh.po4643349&apos;: Stale file handle
cp: cannot create regular file `t/P6322/21/nan/FEP_ghvwzi.sh.pe4643349&apos;: Stale file handle
cp: cannot create directory `t/P6322/11&apos;: File exists
cp: cannot create directory `t/P6322/1&apos;: File exists
[bnh65367@ws104 ff]$ getfacl t/P6322/21/nan/
# file: t/P6322/21/nan/
# owner: bnh65367
# group: cm4950_5
user::rwx
user:i03user:rwx		#effective:r-x
user:gda2:rwx			#effective:r-x
user:i03detector:rwx		#effective:r-x
group::rwx			#effective:r-x
group:dcs:rwx			#effective:r-x
group:dls_dasc:rwx		#effective:r-x
group:dls_sysadmin:rwx		#effective:r-x
group:dls-detectors:rwx		#effective:r-x
group:i03_data:r-x
group:i03_staff:rwx		#effective:r-x
group:i03detector:rwx		#effective:r-x
group:cm4950_5:rwx		#effective:r-x
mask::r-x
other::---
default:user::rwx
default:user:i03user:rwx
default:user:gda2:rwx
default:user:i03detector:rwx
default:group::rwx
default:group:dcs:rwx
default:group:dls_dasc:rwx
default:group:dls_sysadmin:rwx
default:group:dls-detectors:rwx
default:group:i03_data:r-x
default:group:i03_staff:rwx
default:group:i03detector:rwx
default:group:cm4950_5:rwx
default:mask::rwx
default:other::---

[bnh65367@ws104 ff]$ ls -ld t/P6322/21/nan/
drwxrwx---+ 2 bnh65367 cm4950_5 4096 Dec 17 18:49 t/P6322/21/nan/
[bnh65367@ws104 ff]$ 
&lt;/pre&gt;
&lt;/div&gt;&lt;/div&gt;</comment>
                            <comment id="102164" author="laisiyao" created="Mon, 22 Dec 2014 03:12:28 +0000"  >&lt;p&gt;I suspect this is a duplicate of &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3727&quot; title=&quot;LBUG (llite_nfs.c:281:ll_get_parent()) ASSERTION(body-&amp;gt;valid &amp;amp; OBD_MD_FLID) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3727&quot;&gt;&lt;del&gt;LU-3727&lt;/del&gt;&lt;/a&gt;, though 2.5 code already has a workaround to not panic on client side. The patch for &lt;a href=&quot;https://jira.whamcloud.com/browse/LU-3727&quot; title=&quot;LBUG (llite_nfs.c:281:ll_get_parent()) ASSERTION(body-&amp;gt;valid &amp;amp; OBD_MD_FLID) failed&quot; class=&quot;issue-link&quot; data-issue-key=&quot;LU-3727&quot;&gt;&lt;del&gt;LU-3727&lt;/del&gt;&lt;/a&gt; &lt;a href=&quot;http://review.whamcloud.com/#/c/7327/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/7327/&lt;/a&gt; is in review phase, and you may give it a try, though it works, but I don&apos;t think it&apos;s a correct.&lt;/p&gt;</comment>
                            <comment id="102237" author="laisiyao" created="Tue, 23 Dec 2014 09:13:38 +0000"  >&lt;p&gt;Patch is ready, will you apply it and test again?&lt;/p&gt;</comment>
                            <comment id="102632" author="ferner" created="Tue, 6 Jan 2015 16:24:03 +0000"  >&lt;p&gt;Thanks for the updates!&lt;/p&gt;

&lt;p&gt;I&apos;ll certainly be interested in testing a version with the patch. Can you confirm if we need to run the patched version on the client and MDS or if only the client/NFS server needs to be updated? The second option is certainly much easier, which is why I ask.&lt;/p&gt;</comment>
                            <comment id="102704" author="laisiyao" created="Wed, 7 Jan 2015 01:35:48 +0000"  >&lt;p&gt;It&apos;s a change on MDS only, so you only needs to update MDS.&lt;/p&gt;</comment>
                            <comment id="102737" author="ferner" created="Wed, 7 Jan 2015 13:57:35 +0000"  >&lt;p&gt;I&apos;m did try to cherry-pick that patch to our b2_5 branch, which we run on the MDS already. Unfortunately it seems the patch doesn&apos;t cleanly apply to b2_5. Is there a version of the patch for lustre 2.5? (I don&apos;t really want to move to 2.6 on our servers just yet.)&lt;/p&gt;</comment>
                            <comment id="102744" author="laisiyao" created="Wed, 7 Jan 2015 15:44:27 +0000"  >&lt;p&gt;patch for 2.5 was on &lt;a href=&quot;http://review.whamcloud.com/#/c/13270/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;http://review.whamcloud.com/#/c/13270/&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="108770" author="ferner" created="Wed, 4 Mar 2015 18:47:55 +0000"  >&lt;p&gt;Sorry for the long silence.&lt;/p&gt;

&lt;p&gt;We&apos;ve so far we&apos;ve never been able to reproduce this on our test file systems. However we will have a maintenance window coming up on our production file system where users are seeing it with increasing frequency, looking at the code review for the patch (for 2.5) it seems relatively safe to apply to our production file system. I have however noticed it seems it hasn&apos;t landed on 2.5 yet? Could someone confirm if it should be safe enough to try on a production file system? (After running it on or test file system for a while...)&lt;/p&gt;</comment>
                            <comment id="108782" author="pjones" created="Wed, 4 Mar 2015 19:30:11 +0000"  >&lt;p&gt;Frederik&lt;/p&gt;

&lt;p&gt;We do know of other sites running with this code&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="110152" author="ferner" created="Thu, 19 Mar 2015 19:58:25 +0000"  >&lt;p&gt;Peter,&lt;/p&gt;

&lt;p&gt;thanks for the confirmation. We have today upgraded our servers to 2.5.3 plus this suggested patch. Time will tell, we&apos;ve so far only been able to reproduce this after the file system has been up for a few weeks.&lt;/p&gt;

&lt;p&gt;Frederik&lt;/p&gt;</comment>
                            <comment id="110153" author="pjones" created="Thu, 19 Mar 2015 20:05:48 +0000"  >&lt;p&gt;Thanks for the update Frederik. Keep us posted.&lt;/p&gt;</comment>
                            <comment id="113246" author="pjones" created="Thu, 23 Apr 2015 17:36:13 +0000"  >&lt;p&gt;Hi Frederik&lt;/p&gt;

&lt;p&gt;I&apos;m just checking in to see whether you are now comfortable to consider this issue resolved by the patch or whether you want to see a longer stretch without a reoccurrence.&lt;/p&gt;

&lt;p&gt;Peter&lt;/p&gt;</comment>
                            <comment id="113299" author="ferner" created="Fri, 24 Apr 2015 14:46:59 +0000"  >&lt;p&gt;Peter,&lt;/p&gt;

&lt;p&gt;thanks for checking. Judging by the time it previously took our users previously to experience the issue, I might prefer to leave the ticket open a while longer. On the other hand, we can always re-open it if we see the same problem again.&lt;/p&gt;

&lt;p&gt;However in the mean time, we are now seeing a (potentially) different problem with NFS, this time we don&apos;t get stale NFS file handle errors, we get permission denied instead when trying to create a file in a newly created directory. This is intermittent as well but reproducible. I&apos;m currently unsure if I should continue using this ticket or if I should open a new ticket. I&apos;m leaning towards opening a new ticket to avoid confusion. (I&apos;m also currently still gathering information...)&lt;/p&gt;

&lt;p&gt;Cheers,&lt;br/&gt;
Frederik&lt;/p&gt;</comment>
                            <comment id="126960" author="ferner" created="Thu, 10 Sep 2015 18:10:42 +0000"  >&lt;p&gt;A quick update. We have not had any reports from our users that they were still seeing this, not with the patched version and not with 2.7 which we are running now. So I guess this can be closed.&lt;/p&gt;</comment>
                            <comment id="126964" author="pjones" created="Thu, 10 Sep 2015 18:18:40 +0000"  >&lt;p&gt;Great - thanks for confirming&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10011">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="29719">LU-6528</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="20245">LU-3727</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="16268" name="mds_server_debug.xz" size="231" author="ferner" created="Wed, 29 Oct 2014 14:10:23 +0000"/>
                            <attachment id="16267" name="nfs_server_debug.xz" size="11856" author="ferner" created="Wed, 29 Oct 2014 14:10:23 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                            <customfield id="customfield_10890" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10390" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzwyd3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10090" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>16090</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10060" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Severity</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10022"><![CDATA[3]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        </customfields>
    </item>
</channel>
</rss>