[LU-11970] Using changelog reader causes fid2path process to lockup in kernel space Created: 14/Feb/19 Updated: 23/Mar/19 Resolved: 23/Mar/19 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.8.0 |
| Fix Version/s: | None |
| Type: | Question/Request | Priority: | Minor |
| Reporter: | Joe Mervini | Assignee: | Mikhail Pershin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Dell servers running TOSS (RHEL 7.5) IB connected to DDN SFA hardware |
||
| Issue Links: |
|
||||||||||||||||
| Epic/Theme: | Lustre-2.8.0 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
We are evaluating Starfish for file system usage detail and have successfully scanned numerous lustre and NFS file systems. However, when we start ingesting data via changelogs we are running into a condition where it will hang and the only resolution is to power cycle the client we are testing with. In trying to identify the problem, we found when we tried to unmount the file system that process would also hang and could not be aborted. lsof of the file system showed the hung process was stuck on /<file system>/.lustre/fid. Up until this point, we didn't even know that the hidden directory even existed nor its purpose. In scanning Jira, it is involve in lustre rsync and lfsck operations but not a lot of information regarding other roles it plays. One thing is certain: Starfish uses FIDs in there monitoring tools and we can see that .lustre/fid is being identified by the Starfish process. We're hoping we can get some additional information on what's going on with changelogs/.lustre. |
| Comments |
| Comment by Peter Jones [ 14/Feb/19 ] |
|
Mike Can you advise please? Thanks Peter |
| Comment by Peter Jones [ 14/Feb/19 ] |
|
Joe Is this the Astra system? Peter |
| Comment by Joe Mervini [ 14/Feb/19 ] |
|
Peter No - this is on our regular production clusters. We're seeing this behavior on all three of the file systems. One thing we're curious about is why lustre is spitting out FIDs to a directory that is essentially unreadable. One clue is we had the system locked up on a FID that was not in the .lustre directory (although the .lustre directory was still being held by a process identified with lsof) but once the system was rebooted, using fid2path on that FID produced a no such file or directory message. When a file is deleted does it have any interaction with the .lustre directory? If so could it be a race condition? Another question is; is there any client side read operation that would cause a changelog change? |
| Comment by Joe Mervini [ 19/Feb/19 ] |
|
Peter, Has there been any activity on this ticket? Regards, Joe |
| Comment by Mikhail Pershin [ 20/Feb/19 ] |
|
Joe, I am looking into this. Considering this is about Lustre 2.8.0 I am checking tickets which can be related to this problem, probably it is addressed already. |
| Comment by Mikhail Pershin [ 22/Feb/19 ] |
|
This can be resolved by As for other questions above - /.lustre/fid directory allows to get access to a file by its FID and is used often to get paths to that file from its LinkEA attribute what is fid2path exactly does. If you know FID of object you may access it and modify it but cannot delete it. |
| Comment by Peter Jones [ 23/Mar/19 ] |
|
Believed to be a duplicate of |