[LU-8012] lustre 2.8.0 getattr error rc = -2 Created: 12/Apr/16  Updated: 19/Apr/17  Resolved: 19/Apr/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: Lustre 2.9.0

Type: Bug Priority: Minor
Reporter: christopher coffey Assignee: Emoly Liu
Resolution: Duplicate Votes: 0
Labels: easy, llnl
Environment:

Lustre 2.8.0 (intel) client and server, EL6.7 clients and servers, kernel 2.6.32-573.12.1.el6.x86_64 client and server, redhat ofed client and server, mellanox fdr hca client and server. 7 combined OSS/OST


Issue Links:
Duplicate
duplicates LU-7122 Document -n switch for lctl changelog... Resolved
Severity: 2
Rank (Obsolete): 9223372036854775807

 Description   

The combined MDS/MGS server reports the following in /var/log/messages:

Apr 11 09:33:39 mds1 kernel: LustreError: 43754:0:(mdt_handler.c:893:mdt_getattr_internal()) blizzard-MDT0000: getattr error for [0x2000403f7:0x1e0f4:0x0]: rc = -2

The error implies that the file does not exist, and this also shows the same:

sudo lfs fid2path /scratch 0x2000403f7:0x1e0f4:0x0
fid2path: error on FID 0x2000403f7:0x1e0f4:0x0: No such file or directory

Thanks,
Chris



 Comments   
Comment by Alex Zhuravlev [ 12/Apr/16 ]

this is actually a valid situation under load where few threads are working on the same set of files. what kind of load were you running?

Comment by christopher coffey [ 12/Apr/16 ]

Hi Alex,

I haven't isolated which app/workoad is creating these messages. The workloads are very diverse, ranging from mpi to simple single threaded apps.

Comment by Olaf Faaland [ 28/Apr/16 ]

Hi Alex,

We are seeing this as well. I haven't yet identified the sequence of events. We see it with a set of test scripts that race to mkdir/rmdir/create/unlink/read/write within a common directory, in a randomized manner. I'll try narrow the set of operations down.

Am I correct that the object existed at the very beginning of mdt_getattr (I see assert), but then after mdt_getattr->mdt_getattr_internal->mdt_attr_get_complex does not?

Is this supposed to be protected entirely by an LDLM lock held by the client?

thanks,
Olaf

Comment by Christopher Morrone [ 30/Aug/16 ]

Still seeing this in testing. We need to get this message resolved. Having a flood of messages on the console that sysadmins need to ignore leads to sysadmins that stop looking at the logs and we miss important things.

Comment by Andreas Dilger [ 19/Apr/17 ]

This is a normal situation if there are multiple threads racing to unlink a single file. The patch http://review.whamcloud.com/18145 "LU-7712 mdd: migration is too noisy" turned off this error for the common -ENOENT case.

Comment by Andreas Dilger [ 19/Apr/17 ]

Patch landed as v2_8_55_0-133-gc3e03f3 so it is included in 2.9.0.

Generated at Sat Feb 10 02:13:51 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.