[LU-11970] Using changelog reader causes fid2path process to lockup in kernel space Created: 14/Feb/19  Updated: 23/Mar/19  Resolved: 23/Mar/19

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.8.0
Fix Version/s: None

Type: Question/Request Priority: Minor
Reporter: Joe Mervini Assignee: Mikhail Pershin
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Dell servers running TOSS (RHEL 7.5) IB connected to DDN SFA hardware


Issue Links:
Duplicate
duplicates LU-8821 double find in mdt_path_current() Resolved
Related
is related to LU-11501 use the dcache properly with .lustre/fid In Progress
Epic/Theme: Lustre-2.8.0
Rank (Obsolete): 9223372036854775807

 Description   

We are evaluating Starfish for file system usage detail and have successfully scanned numerous lustre and NFS file systems. However, when we start ingesting data via changelogs we are running into a condition where it will hang and the only resolution is to power cycle the client we are testing with.

In trying to identify the problem, we found when we tried to unmount the file system that process would also hang and could not be aborted.

lsof of the file system showed the hung process was stuck on  /<file system>/.lustre/fid. Up until this point, we didn't even know that the hidden directory even existed nor its purpose. In scanning Jira, it is involve in lustre rsync and lfsck operations but not a lot of information regarding other roles it plays.

One thing is certain: Starfish uses FIDs in there monitoring tools and we can see that .lustre/fid is being identified by the Starfish process.

We're hoping we can get some additional information on what's going on with changelogs/.lustre.



 Comments   
Comment by Peter Jones [ 14/Feb/19 ]

Mike

Can you advise please?

Thanks

Peter

Comment by Peter Jones [ 14/Feb/19 ]

Joe

Is this the Astra system?

Peter

Comment by Joe Mervini [ 14/Feb/19 ]

Peter

No - this is on our regular production clusters. We're seeing this behavior on all three of the file systems. 

One thing we're curious about is why lustre is spitting out FIDs to a directory that is essentially unreadable. One clue is we had the system locked up on a FID that was not in the .lustre directory (although the .lustre directory was still being held by a process identified with lsof) but once the system was rebooted, using fid2path on that FID produced a no such file or directory message.

When a file is deleted does it have any interaction with the .lustre directory? If so could it be a race condition? Another question is; is there any client side read operation that would cause a changelog change?  

Comment by Joe Mervini [ 19/Feb/19 ]

Peter,

Has there been any activity on this ticket?

Regards,

Joe

Comment by Mikhail Pershin [ 20/Feb/19 ]

Joe, I am looking into this. Considering this is about Lustre 2.8.0 I am checking tickets which can be related to this problem, probably it is addressed already.

Comment by Mikhail Pershin [ 22/Feb/19 ]

This can be resolved by LU-8821, there was potential deadlock case in the code. The patch in that ticket can be updated for 2.8 if needed.

As for other questions above - /.lustre/fid directory allows to get access to a file by its FID and is used often to get paths to that file from its LinkEA attribute what is fid2path exactly does. If you know FID of object you may access it and modify it but cannot delete it.
The only interaction with deleted files I can think of is that unlinked files which are still opened cannot be found by FID, you'll get 'no such file ..' message while file is being still used by some process. See LU-11638 for details.
As for changelog logged read operation I assume you meant 'non-modification' operation, yes, we have CL_GETXATTR at least and also CL_DN_OPEN which are non-modification operations but can be recorded in changelog. Also CL_OPEN can be enabled to track OPENs.

Comment by Peter Jones [ 23/Mar/19 ]

Believed to be a duplicate of LU-8821

Generated at Sat Feb 10 02:48:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.