[LU-5730] intermittent I/O errors for some directories Created: 13/Oct/14  Updated: 10/Sep/15  Resolved: 10/Sep/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.5.2
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Frederik Ferner (Inactive) Assignee: Lai Siyao
Resolution: Fixed Votes: 0
Labels: None
Environment:

Lustre 2.5.2 on RHEL6 servers and clients, NFS exported, ACLs


Attachments: Text File mds_server_debug.xz     File nfs_server_debug.xz    
Issue Links:
Related
is related to LU-6528 permission denied over NFS (intermittent Resolved
is related to LU-3727 LBUG (llite_nfs.c:281:ll_get_parent()... Resolved
Severity: 3
Rank (Obsolete): 16090

 Description   

Our users have reported an issue where the suddenly have problems editing a file in a directory, they also got I/O errors for example when trying to get the ACLs for that directory. In at least one instance the problem resolved itself overnight after we decided to investigate in more detail later, in another case today the problem went away when we renamed the problematic directory.

In the most recent instance today we had some time where we were able to attempt to understand the issue and this is what we found so far: While the problem persists, some clients are seeing I/O error on calling getfacl, other clients don't have any problems running the same commands and returned the expected results. Some machines access this directory over NFS, exported from one of our clients which was showing problems in this instance, they had the same issues. Attempting to edit a file in the problematic directory with vim came up with the message that the .swp file already exists even for new files. Creating new files in the directory, for example with touch, worked with no problem.

There are no error messages recorded by syslog on any of the machines involved.

We've mostly run out of ideas what to look for next to resolve this if it happens again.



 Comments   
Comment by Peter Jones [ 13/Oct/14 ]

Lai

What do you advise here? Could this be covered by the ACLs issues fixed in the upcoming 2.5.4 release (LU-5150/LU-3660/LU-5434)

Peter

Comment by Andreas Dilger [ 14/Oct/14 ]

How big is the ACL in this case? There was a bug in the past where the ACL was too large (30ish entries) and it returned an error trying to access it. Are the working and failing clients running the same Lustre version?

Comment by Frederik Ferner (Inactive) [ 16/Oct/14 ]

The ACL in this case isn't that big, ~13 ACLs and 15 default ACLs on the directory.

We are aware of the issues with ACLs in the past, the magic number was 32 ACLs and 32 default ACLs (at least on 1.8), and there was a bug that allowed to set 33 default ACLs on a directory, causing all access to files created in this directory to be inaccessible. This was a permanent issue, we still got some of these files hidden somewhere on the file system. Not in the directories causing problems now, though.

Comment by Frederik Ferner (Inactive) [ 28/Oct/14 ]

Today I have looked into this a bit further, I can reproduce this or a similar issue in a directory without any ACLs (other than unix mode bits):

On a NFS client, running 'cp -r' on a directory tree with ~300 files and 45 directories, usually a number of files fail to be copied to the destination and we see errors about "Stale file handle" and "File exists":

[bnh65367@ws104 noacl]$ rm -fr t  ; cp -r s t
cp: cannot create regular file `t/fast_ep/P6322/5/nan/FEP_dmacps.sh.pe4643338': Stale file handle
cp: cannot create directory `t/fast_ep/P622/3': File exists
cp: cannot create regular file `t/fast_ep/P622/11/nan/sad_fa.lst': Stale file handle
cp: cannot create regular file `t/fast_ep/P622/11/nan/FEP_lwdype.sh.o4643261': Stale file handle
cp: cannot create regular file `t/fast_ep/P622/11/nan/FEP_lwdype.sh.pe4643261': Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/21/nan/FEP_sjtixa.sh.o4643283': Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/11/nan/FEP_qaepuy.sh.po4643281': Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/1/nan/FEP_nlauzb.sh.po4643231': Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/1/nan/FEP_soevgk.sh.pe4643243': Stale file handle
cp: cannot create regular file `t/fast_ep/P6222/1/nan/FEP_soevgk.sh': Stale file handle
[bnh65367@ws104 noacl]$ find s -type f -print | wc -l
339
[bnh65367@ws104 noacl]$ find s -type d -print | wc -l
45
[bnh65367@ws104 noacl]$

And repeating the same tests directly on the NFS server runs the cp without any errors.

Comment by Lai Siyao [ 29/Oct/14 ]

could you upload messages and debug log? and in the mean time I'll try to reproduce it locally.

Comment by Frederik Ferner (Inactive) [ 29/Oct/14 ]

syslog doesn't contain anything Lustre related on the NFS server and on the MDS I can see only a few reconnects for other clients mainly due to other work, nothing during the time I ran the tests. Nothing at all in the syslog on the NFS client.

I've just repeated the test (at Wed Oct 29 13:58:30 GMT 2014).

I'm also not quite sure which debug log you wanted, so I've collected the output of "lctl debug_kernel" for the NFS server and the MDS just after running my latest test and will attach them. Unfortunately the file system is our production file system, so likely to be quite busy at the moment, so there might be other stuff in there...

Also, running the same test in a different directory, using a different NFS server, works without problems (using the same source directory).

Comment by Lai Siyao [ 30/Oct/14 ]

it seems this debug log doesn't contain -ESTALE/-EEXIST error, maybe these error code are from NFS code. are you using NFSv3? and can you make sure this is ACL related (mount NFS with noacl and test again)?

it's fine to use `lctl debug_kernel` to collect debug logs. could you run `lctl set_param debug=+trace` on NFS server and MDS to enable more debugging, and when it's reproduced, collect the logs and upload?

Comment by Frederik Ferner (Inactive) [ 30/Oct/14 ]

Yes, we are using NFSv3, kernel nfs server. I've repeated the test with the file system mounted with noacl on the NFS client, same result, a couple of stale file handles.

I've also captured a new set of debug logs, with trace turned on, capturing with lctl debug_kernel as soon as possible after completing the tests. As the file system is in use, I also started a debug_daemon on NFS server and MDS, with 2GB and 8GB file size, to try and capture the event obviously the files are too big to upload here. Grepping for stale (using "grep -i" to ignore case) did not return anything, not sure if you were expecting or hoping for this string to be there? I've still got the files so if you want them, we'll need to get them to you somehow.

This new test only had stale files as errors.

Comment by Lai Siyao [ 31/Oct/14 ]

yes, please upload logs to ftp site, and also could you post the test result here? I want to know which file returns -ESTALE.

Comment by Frederik Ferner (Inactive) [ 04/Nov/14 ]

Apologies for the delay. After an unplanned power outage in our computer room I lost the debug files, I've now recreated them. As I can't find the details of the ftp site at the moment, I've made them available at ftp://ftpanon.diamond.ac.uk/LU-5730/ where they should be available for the next 7 days at least.

The output of the reproducer run in the time covered by the debug output is below:

[bnh65367@cs03r-sc-serv-16 ~]$ cd /dls/i03/data/2014/cm4950-4/tmp/frederik/
[bnh65367@cs03r-sc-serv-16 frederik]$ ls
[bnh65367@cs03r-sc-serv-16 frederik]$ cd ../
[bnh65367@cs03r-sc-serv-16 tmp]$ ls
frederik  s  t
[bnh65367@cs03r-sc-serv-16 tmp]$ mv s frederik/
[bnh65367@cs03r-sc-serv-16 tmp]$ cd frederik/
[bnh65367@cs03r-sc-serv-16 frederik]$ rm -fr t; cp -r s t
cp: cannot create directory `t/P6322/3/nan': File exists
cp: cannot create directory `t/P6322/1': File exists
cp: cannot create regular file `t/P6322/5/nan/FEP_dmacps.sh.o4643338': Stale file handle
cp: cannot create regular file `t/P6322/5/nan/sad_fa.hkl': Stale file handle
cp: cannot create regular file `t/P6322/5/nan/sad_fa.lst': Stale file handle
cp: cannot create regular file `t/P6322/5/nan/FEP_zwpscl.sh.pe4643339': Stale file handle
cp: cannot create directory `t/P622/3/nan': File exists
cp: cannot create regular file `t/P622/21/nan/FEP_tfizjg.sh.po4643251': Stale file handle
cp: cannot create directory `t/P622/11/nan': File exists
cp: cannot create regular file `t/P6222/21/nan/FEP_rhvwgn.sh.o4643282': Stale file handle
cp: cannot create regular file `t/P6222/11/nan/FEP_lcbxae.sh.pe4643280': Stale file handle
cp: cannot create regular file `t/P6222/1/nan/FEP_nlauzb.sh.po4643231': Stale file handle
cp: cannot create regular file `t/P6222/5/nan/FEP_xicsnr.sh.o4643233': Stale file handle
[bnh65367@cs03r-sc-serv-16 frederik]$ 
Comment by Lai Siyao [ 11/Nov/14 ]

This looks to be the same issue as LU-3952: MDS doesn't pack parent fid in reply, and ll_get_parent() returned -ESTALE.

But unfortunately I failed to download MDS debug log, which failed with -36 (last week I was in vacation, so that I didn't download it in time). Could you check it? I need to access it to know why MDS doesn't do it.

Comment by Frederik Ferner (Inactive) [ 11/Nov/14 ]

I hope you had a good vacation.

Anyway, I've just checked and the files are still on the ftp server, let me know if you still really can't download them (or if they don't contain what you expected.)

Comment by Lai Siyao [ 12/Nov/14 ]

I've downloaded it successfully, thanks.

Comment by Lai Siyao [ 12/Nov/14 ]

in server log I found this:

00000004:00000001:25.0:1415114896.229956:0:15519:0:(mdt_handler.c:1573:mdt_getattr_name()) Process entered
00000004:00000001:25.0:1415114896.229962:0:15519:0:(mdt_handler.c:1277:mdt_getattr_name_lock()) Process entered
00000004:00000001:25.0:1415114896.229963:0:15519:0:(mdt_handler.c:1220:mdt_raw_lookup()) Process entered
00000004:00000001:25.0:1415114896.229963:0:15519:0:(mdd_dir.c:113:mdd_lookup()) Process entered
00000004:00000001:25.0:1415114896.229964:0:15519:0:(mdd_dir.c:71:__mdd_lookup()) Process entered
00000004:00000001:25.0:1415114896.229964:0:15519:0:(mdd_permission.c:249:__mdd_permission_internal()) Process entered
00000004:00000001:25.0:1415114896.229966:0:15519:0:(mdd_permission.c:220:mdd_check_acl()) Process entered
00000004:00000001:25.0:1415114896.229967:0:15519:0:(lod_object.c:370:lod_xattr_get()) Process entered
00000004:00000001:25.0:1415114896.229968:0:15519:0:(lod_object.c:374:lod_xattr_get()) Process leaving (rc=108 : 108 : 6c)
00000004:00000001:25.0:1415114896.229970:0:15519:0:(mdd_permission.c:236:mdd_check_acl()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:25.0:1415114896.229971:0:15519:0:(mdd_permission.c:309:__mdd_permission_internal()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:25.0:1415114896.229972:0:15519:0:(mdd_dir.c:90:__mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:25.0:1415114896.229973:0:15519:0:(mdd_dir.c:115:mdd_lookup()) Process leaving (rc=18446744073709551603 : -13 : fffffffffffffff3)
00000004:00000001:25.0:1415114896.229973:0:15519:0:(mdt_handler.c:1247:mdt_raw_lookup()) Process leaving (rc=1 : 1 : 1)
00000004:00000001:25.0:1415114896.229974:0:15519:0:(mdt_handler.c:1335:mdt_getattr_name_lock()) Process leaving (rc=0 : 0 : 0)
00000004:00000001:25.0:1415114896.229976:0:15519:0:(mdt_handler.c:1594:mdt_getattr_name()) Process leaving

when lustre client calls ll_get_parent() to connect disconnected dentry to parent, it calls md_getattr_name() to fetch parent fid, but the log shows, when MDS lookup ".." for the parent it failed on mdd_check_acl(), so that parent fid was not piggy back, and finally -ESTALE is returned. Could you check the ACL for the parent directory of the -ESTALE file, eg. "t/P6322/5/nan", "t/P622/21/nan"?

BTW once I asked you to disable ACL for NFS, but this doesn't really disable ACL for Lustre, so ACL is still checked. And if possible, could you disable ACL for Lustre (this should be done on MDS mount with option "noacl") and test again?

Comment by Frederik Ferner (Inactive) [ 17/Dec/14 ]

Apologies for the delay in getting back to you. The test directory we used previously had been automatically cleaned before you asked for the ACLs, then I lost the ability to reproduce it for a while (i.e. the test never triggered any problem). We have recently started to see the problem again.

Unfortunately disabling ACLs completely on the production file system is not at all easy and I can't reproduce the problem on any of our test file systems even though they are running the same versions.

On our production file system where we see the problem, this doesn't always happen but I've just managed to reproduce it (without lustre debugging). I've collected the ACLs for the parent directory of the most recent files reported as "stale file handle", see below. I'm hoping some of this might be useful?

[bnh65367@ws104 ff]$ rm -fr t; cp -r s t 
cp: cannot create regular file `t/P6322/21/nan/FEP_wgeoyi.sh.pe4643346': Stale file handle
cp: cannot create regular file `t/P6322/21/nan/FEP_wgeoyi.sh.o4643346': Stale file handle
cp: cannot create regular file `t/P6322/21/nan/FEP_ghvwzi.sh.po4643349': Stale file handle
cp: cannot create regular file `t/P6322/21/nan/FEP_ghvwzi.sh.pe4643349': Stale file handle
cp: cannot create directory `t/P6322/11': File exists
cp: cannot create directory `t/P6322/1': File exists
[bnh65367@ws104 ff]$ getfacl t/P6322/21/nan/
# file: t/P6322/21/nan/
# owner: bnh65367
# group: cm4950_5
user::rwx
user:i03user:rwx		#effective:r-x
user:gda2:rwx			#effective:r-x
user:i03detector:rwx		#effective:r-x
group::rwx			#effective:r-x
group:dcs:rwx			#effective:r-x
group:dls_dasc:rwx		#effective:r-x
group:dls_sysadmin:rwx		#effective:r-x
group:dls-detectors:rwx		#effective:r-x
group:i03_data:r-x
group:i03_staff:rwx		#effective:r-x
group:i03detector:rwx		#effective:r-x
group:cm4950_5:rwx		#effective:r-x
mask::r-x
other::---
default:user::rwx
default:user:i03user:rwx
default:user:gda2:rwx
default:user:i03detector:rwx
default:group::rwx
default:group:dcs:rwx
default:group:dls_dasc:rwx
default:group:dls_sysadmin:rwx
default:group:dls-detectors:rwx
default:group:i03_data:r-x
default:group:i03_staff:rwx
default:group:i03detector:rwx
default:group:cm4950_5:rwx
default:mask::rwx
default:other::---

[bnh65367@ws104 ff]$ ls -ld t/P6322/21/nan/
drwxrwx---+ 2 bnh65367 cm4950_5 4096 Dec 17 18:49 t/P6322/21/nan/
[bnh65367@ws104 ff]$ 
Comment by Lai Siyao [ 22/Dec/14 ]

I suspect this is a duplicate of LU-3727, though 2.5 code already has a workaround to not panic on client side. The patch for LU-3727 http://review.whamcloud.com/#/c/7327/ is in review phase, and you may give it a try, though it works, but I don't think it's a correct.

Comment by Lai Siyao [ 23/Dec/14 ]

Patch is ready, will you apply it and test again?

Comment by Frederik Ferner (Inactive) [ 06/Jan/15 ]

Thanks for the updates!

I'll certainly be interested in testing a version with the patch. Can you confirm if we need to run the patched version on the client and MDS or if only the client/NFS server needs to be updated? The second option is certainly much easier, which is why I ask.

Comment by Lai Siyao [ 07/Jan/15 ]

It's a change on MDS only, so you only needs to update MDS.

Comment by Frederik Ferner (Inactive) [ 07/Jan/15 ]

I'm did try to cherry-pick that patch to our b2_5 branch, which we run on the MDS already. Unfortunately it seems the patch doesn't cleanly apply to b2_5. Is there a version of the patch for lustre 2.5? (I don't really want to move to 2.6 on our servers just yet.)

Comment by Lai Siyao [ 07/Jan/15 ]

patch for 2.5 was on http://review.whamcloud.com/#/c/13270/

Comment by Frederik Ferner (Inactive) [ 04/Mar/15 ]

Sorry for the long silence.

We've so far we've never been able to reproduce this on our test file systems. However we will have a maintenance window coming up on our production file system where users are seeing it with increasing frequency, looking at the code review for the patch (for 2.5) it seems relatively safe to apply to our production file system. I have however noticed it seems it hasn't landed on 2.5 yet? Could someone confirm if it should be safe enough to try on a production file system? (After running it on or test file system for a while...)

Comment by Peter Jones [ 04/Mar/15 ]

Frederik

We do know of other sites running with this code

Peter

Comment by Frederik Ferner (Inactive) [ 19/Mar/15 ]

Peter,

thanks for the confirmation. We have today upgraded our servers to 2.5.3 plus this suggested patch. Time will tell, we've so far only been able to reproduce this after the file system has been up for a few weeks.

Frederik

Comment by Peter Jones [ 19/Mar/15 ]

Thanks for the update Frederik. Keep us posted.

Comment by Peter Jones [ 23/Apr/15 ]

Hi Frederik

I'm just checking in to see whether you are now comfortable to consider this issue resolved by the patch or whether you want to see a longer stretch without a reoccurrence.

Peter

Comment by Frederik Ferner (Inactive) [ 24/Apr/15 ]

Peter,

thanks for checking. Judging by the time it previously took our users previously to experience the issue, I might prefer to leave the ticket open a while longer. On the other hand, we can always re-open it if we see the same problem again.

However in the mean time, we are now seeing a (potentially) different problem with NFS, this time we don't get stale NFS file handle errors, we get permission denied instead when trying to create a file in a newly created directory. This is intermittent as well but reproducible. I'm currently unsure if I should continue using this ticket or if I should open a new ticket. I'm leaning towards opening a new ticket to avoid confusion. (I'm also currently still gathering information...)

Cheers,
Frederik

Comment by Frederik Ferner (Inactive) [ 10/Sep/15 ]

A quick update. We have not had any reports from our users that they were still seeing this, not with the patched version and not with 2.7 which we are running now. So I guess this can be closed.

Comment by Peter Jones [ 10/Sep/15 ]

Great - thanks for confirming

Generated at Sat Feb 10 01:54:01 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.