[LU-13758] corrupt directory entry: FID is invalid Created: 07/Jul/20  Updated: 02/Dec/21  Resolved: 02/Dec/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Olaf Faaland Assignee: Andreas Dilger
Resolution: Incomplete Votes: 0
Labels: llnl
Environment:

server
toss 3.6-3 (RH 7.8)
lustre-2.10.8_10.chaos-1.ch6.x86_64
zfs-0.7.11-9.4llnl.ch6.x86_64
3.10.0-1127.8.2.1chaos.ch6.x86_64


Issue Links:
Related
is related to LU-13823 Two hard links to the same directory Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   
 [[root@surface86:worker_73_2_file_system]# pwd
/p/lustre3/vaccines/runs_nCoV_lustre3/run_nCoV_80R_20200315_0045/worker_73_2_file_system
root@surface86:worker_73_2_file_system]# ls -l
ls: cannot access workspace: No such file or directory
total 0
d????????? ? ? ? ?            ? workspace
[root@surface86:worker_73_2_file_system]# lfs getstripe workspace 
error opening workspace: Input/output error (5)
llapi_semantic_traverse: Failed to open 'workspace': Input/output error (5)
error: getstripe failed for workspace.

Server log shows FID ... is invalid

Jul  6 16:47:48 porter81 kernel: LustreError: 31368:0:(mdt_open.c:1515:mdt_reint_open()) lustre3-MDT0000: name 'workspace' present, but FID [0x200014cf7:0x177c6:0x0] is invalid


 Comments   
Comment by Olaf Faaland [ 07/Jul/20 ]

We're working on figuring out when this occurred so we know what logs are relevant. Also running lfsck --dry-run to see what it reports.

Comment by Olaf Faaland [ 07/Jul/20 ]

This file system has 2 MDTs.

All the directory components are on MDT 0 (not striped, and not remote)

 /p/lustre3/vaccines
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
/p/lustre3/vaccines/runs_nCoV_lustre3
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
/p/lustre3/vaccines/runs_nCoV_lustre3/run_nCoV_80R_20200315_0045
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
/p/lustre3/vaccines/runs_nCoV_lustre3/run_nCoV_80R_20200315_0045/worker_73_2_file_system
lmv_stripe_count: 0 lmv_stripe_offset: 0 lmv_hash_type: none
Comment by Andreas Dilger [ 07/Jul/20 ]

Does "lfs fid2path" work on this "invalid FID"? The FID is itself valid looking, and the error message is misleading, in that it should be "FID not found" rather than "invalid".

That said, does the FID of that directory match the problematic directory?

Comment by Olaf Faaland [ 07/Jul/20 ]

No joy.

[root@oslic12:worker_73_2_file_system]# lfs fid2path /p/lustre3/ 0x200014cf8:0x177c6:0x0
fid2path: error on FID 0x200014cf8:0x177c6:0x0: No such file or directory
Comment by Olaf Faaland [ 07/Jul/20 ]

More FIDs

[root@oslic12:run_nCoV_80R_20200315_0045]# lfs path2fid /p/lustre3/vaccines
[0x2000138f7:0x13a51:0x0]
[root@oslic12:run_nCoV_80R_20200315_0045]# lfs path2fid /p/lustre3/vaccines/runs_nCoV_lustre3
[0x200013c86:0x13:0x0]
[root@oslic12:run_nCoV_80R_20200315_0045]# lfs path2fid /p/lustre3/vaccines/runs_nCoV_lustre3/run_nCoV_80R_20200315_0045
[0x200014cf2:0x1a690:0x0]
[root@oslic12:worker_73_2_file_system]# lfs path2fid /p/lustre3/vaccines/runs_nCoV_lustre3/run_nCoV_80R_20200315_0045/worker_73_2_file_system
[0x200014cf7:0x177be:0x0]
Comment by Andreas Dilger [ 07/Jul/20 ]

Based on the parent directory FID [0x200014cf7:0x177be:0x0], it seems probable that this is the bad FID, since each client initially is given two consecutive FID SEQuence numbers (first part), and the OID (middle part) wraps at 0x20000. It is likely that the creation time of the parent directory is somewhat before that of the broken directory as a result, though the directory was created about 128k files earlier and this gap will be highly application dependent (it may be seconds or hours), but should get you into the right ballpark. You could try (as root) "stat /p/lustre3/.lustre/fid/0x200014cf8:0x177c5:0x0" or other nearby FIDs to see if any of those still exist and have a timestamp.

Side note, in 2.14 we are exporting actual creation time from ZFS and ldiskfs via statx() to clients that would make this debugging easier. It is on the MDT today, but just not exposed, so you could also try zdb on nearby dnodes to get the actual creation time of the parent ctime is not useful.

The FID SEQ is for MDT0000, so this couldn't be a DNE remote directory issue, AFAIK, unless there was a failed "lfs migrate -m" or something involved?

You could look into /p/lustre3/.lustre/lost+found/MDT0000/ to see if there are any orphaned files/directories found? It seems unlikely at this point, because the FID is not mapped to anything.

Comment by Olaf Faaland [ 07/Jul/20 ]

Thanks Andreas, I'll look at those avenues. I've seen nothing so far that makes me think that DNE plays into this. We don't allow ordinary users to create remote/striped directories. The sysadmins typically create remote dirs only under specific circumstances that wouldn't apply here, and the sysadmins have been involved in triaging this so would have mentioned it if they'd done anything special for this user.

Comment by Olaf Faaland [ 07/Jul/20 ]

There are no other valid FIDs with the same SEQ portion; that is, I performed

stat /p/lustre3/.lustre/fid/0x200014cf8:${XX}:0x0

where XX ranged from 0x1 to 0x197c6, and I got a struct stat back only for the parent directory.  It looks like everything else has already been deleted.

I learned a bit more from the user - this was detected when was deleting an old dataset, by running
rm -fr /p/lustre3/vaccines/runs_nCoV_lustre3/run_nCoV_80R_20200315_0045

So that I think explains why there aren't any nearby FIDs to get timestamps from.

Lost+Found is empty:

# ls /p/lustre3/.lustre/lost+found/*   
/p/lustre3/.lustre/lost+found/MDT0000:

/p/lustre3/.lustre/lost+found/MDT0001:
Comment by Andreas Dilger [ 08/Jul/20 ]

Since the user was trying to delete the entry, it should be possible to use "lfs rmentry /p/lustre3/vaccines/runs_nCoV_lustre3/run_nCoV_80R_20200315_0045/workspace" to delete the entry.

It would be interesting to know if there was some sort of filesystem event that caused this directory entry to survive after the inode was deleted. This used to sometimes happen in the past with regular files when the MDS was deleting the file, then the OST onject(s) were deleted, but the MDS crashed before the MDT part of the change was committed. When the MDS rebooted the partial transaction with the name unlink was discarded, but the OSS (which didn't crash) committed the removal of the objects, resulting in a "???" file when accessed.

However, this is a directory, which is (AFAICS) stored in the local MDT filesystem, so it is hard to see how it could become inconsistent, since the name unlink and dnode removal should be in the same local transaction. This could be understandable with DNE, since the name and dnode are on different file systems, but I'm at a loss to explain this.

Comment by Olaf Faaland [ 08/Jul/20 ]

During the window of time when the user was deleting this subtree and noticed the error (on Monday 7/6), there was nothing along the lines of a loss of power or kernel panic. The console log just has messages like these:

1) MGS messages about connections and evictions
2) Lustre: lustre3-MDT0000: Client 07ec80b3-a5bf-d5e7-f5ae-ba7c239bcdd5 (at 192.168.128.60@o2ib35) reconnecting 
3) Lustre: lustre3-MDT0000: Connection restored to  (at 172.16.70.105@tcp) 
4) Lustre: lustre3-MDT0000: Connection restored to 07ec80b3-a5bf-d5e7-f5ae-ba7c239bcdd5 (at 192.168.128.60@o2ib35)
5) Lustre: lustre3-MDT0000: haven't heard from client 241cac03-88f2-b14c-cea9-6d065a5981b3 (at 192.168.128.89@o2ib36) in 227 s
6) LustreError: 81968:0:(ldlm_lib.c:3239:target_bulk_io()) @@@ timeout on bulk READ after 100+0s  req@ffff959f9c560c50 x166869

I'm in the process now of combing through log messages since May 10, in case the problem occurred earlier and wasn't noticed. May 10 is when he last created a tarball of the directory successfully.

Comment by Olaf Faaland [ 08/Jul/20 ]

Looking back in logs, I found mdt_reint_open() has reported this issue for other dirents, on other file systems. Three of the instances are on our test file system, and in all three cases the dirent's name indicates it was created by simul, a test application you may remember. I'm looking into reproducing the issue there. I'll let you know what I find out.

Comment by Olaf Faaland [ 14/Jul/20 ]

I haven't reproduced successfully so far. Still working on it.

Comment by Olaf Faaland [ 22/Jul/20 ]

I still haven't reproduced it, but I may have found another clue. Here are the log entries from one instance.  Note two dirents appear to have the same FID:

[faaland1@jeti ~]$zgrep simul_link $(cat /tmp/list.1)
2020-04-20 17:15:31 [14765.954378] LustreError: 17368:0:(mdt_open.c:1515:mdt_reint_open()) lquake-MDT0008: name 'simul_link.1072' present, but FID [0x2c0000414:0x76e9:0x0] is invalid
2020-04-20 17:15:32 [14766.480021] LustreError: 14114:0:(mdt_open.c:1515:mdt_reint_open()) lquake-MDT0008: name 'simul_link.1357' present, but FID [0x2c0000414:0x76e9:0x0] is invalid
2020-04-20 17:15:33 [14767.488724] LustreError: 20028:0:(mdt_open.c:1515:mdt_reint_open()) lquake-MDT0008: name 'simul_link.484' present, but FID [0x2c0000be0:0x8529:0x0] is invalid

So perhaps the problem is that during creat or mkdir, two or more files/dirs are assigned the same FID. Then when one of them is deleted, the object the FID refers to is deleted, and the FID is removed from the object index, and so the other dirent now has an invalid FID.

Might that explain this?

Comment by Andreas Dilger [ 22/Jul/20 ]

Hmm, if this is the hard-link phase of simul, then the FID should be the same for all the filenames linking to the same target? Since the timestamps are within half a second of each other, it is likely they are from the same test run. However, since this is resulting in invalid entries in the directory there is definitely something going wrong there (e.g. the dirent is not being removed properly and referencing a dangling FID, the link count on the inode is not correct and is being deleted before all dirents have been removed, etc).

The FID numbers are generated on the client (the FID SEQuence number e.g. 0x2c0000414 is assigned by the MDT to client at mount, client increments OID e.g. 0x76e9 within the SEQ for each new create until the SEQ is consumed after 128k creates, or the client unmounts), and are sent to the MDS with each file/directory create RPC (the MDS will ignore the client FID if the filename already exists).

Conceivably, there could somehow be a race on the client using the same FID for two different files. However, I don't think that is a likely cause, since the client is not even assigning new FIDs when creating a hard link. In this case, the client does the lookup for the target name first, gets the FID of the target file, and then sends the MDS RPC to link the new name to the target FID. It seems hard to get that part wrong, unless it relates to updating the link count during create? That should be done atomically under a lock, as well as within a transaction on disk. The other alternative source of inconsistency is on the unlink side, where the name entry might not be removed from the directory properly during a delete, the inode link count decremented incorrectly, etc. If there was a crash in the middle of that simul run it would open a bigger window for things to go wrong.

This inconsistency might be easier to explain for a DNE-striped directory (e.g. name and target are on different MDTs), but you've said that DNE was not involved in that directory. The fact that it is on MDT0008 should be irrelevant, since the local directory operations are exactly the same as if the directory was on MDT0000 (the selection of the MDT is done from the parent directory before the link RPC is even sent).

Comment by Olaf Faaland [ 22/Jul/20 ]

> if this is the hard-link phase of simul, then the FID should be the same for all the filenames linking to the same target?
You're right.

I found there wasn't a simul job being run at the time of the error, so the error message must have been triggered by cleanup for a previous simul job which was canceled, crashed, etc. Looking to see if I can find any more information about that particular instance.

Comment by Andreas Dilger [ 22/Jul/20 ]

The other possibility (not easy to prove either way) is if two unlink("simul_link.1072") calls are running at the same time on different clients (e.g. "readdir(); delete all entries"), or maybe "node1: open("simul_link.1072"); node2: unlink("simul_link.1072")", then there may be spurious error messages printed in the logs when the file is missing. I thought we caught all of those kind of messages over time, but I'm not sure.

Comment by Olaf Faaland [ 22/Jul/20 ]

I understand what you mean about spurious error messages due to a delete racing with another operation (maybe also a delete).

But in several cases, I have actual directories with bad dirents, and not even all on the same file system.

root@surface86:worker_73_2_file_system]# ls -l
ls: cannot access workspace: No such file or directory
total 0
d????????? ? ? ? ?            ? workspace

and

[root@oslic9:yang30]# ls -l */dock
scratch002/dock:
ls: cannot access scratch002/dock/s2_hras1371: No such file or directory
total 0
d????????? ? ? ? ?            ? s2_hras1371

scratch010/dock:
ls: cannot access scratch010/dock/s2_hras1371: No such file or directory
total 0
d????????? ? ? ? ?            ? s2_hras1371

scratch025/dock:
ls: cannot access scratch025/dock/s2_hras1371: No such file or directory
total 0
d????????? ? ? ? ?            ? s2_hras1371

scratch026/dock:
ls: cannot access scratch026/dock/s1_hras1371: No such file or directory
total 0
d????????? ? ? ? ?            ? s1_hras1371

scratch028/dock:
ls: cannot access scratch028/dock/s2_hras1371: No such file or directory
total 0
d????????? ? ? ? ?            ? s2_hras1371

scratch033/dock:
ls: cannot access scratch033/dock/s1_hras1371: No such file or directory
total 0
d????????? ? ? ? ?            ? s1_hras1371
Comment by Olaf Faaland [ 22/Jul/20 ]

I tried several operations on one of these entries, just to make sure I'm not mixing up two different issues. rm, cp, and ln don't produce an error message on the MDS console. "touch" does result in the "FID ... is invalid" error message there.

Comment by Andreas Dilger [ 23/Jul/20 ]

I understand what you mean about spurious error messages due to a delete racing with another operation (maybe also a delete).

But in several cases, I have actual directories with bad dirents, and not even all on the same file system.

What I meant is that there could be a race where the user/script is trying to delete the file, but for some reason the dirent is left in the directory pointing to an old FID. So I think it isn't so much a problem of an inode mysteriously "disappearing" from underneath the directory entry, so much as "an old directory entry is left behind while deleting files" (for whatever reason). While both of these are definitely consistency issues, the former represents unexpected loss of data, while the latter is more of an inconvenience resulting in a directory that cannot be deleted.

In all of the above cases, it appears that the directory is otherwise empty ("total 0" implies no other entries), and in the original report about "workspace" you also mentioned that the user had previously tried to delete the files in this directory when they hit the problem.

It is quite interesting (almost too much to be a coincidence?) that all of the filenames in the above example are "s1_hras1371" or "s2_hras1371" even though they are each in a different directory tree, which leads me wonder if this is either something to do with the name itself (e.g. somehow it is hashed/processed incorrectly and then doesn't get removed from the directory), or something in particular in the way the application is creating or deleting this name?

Is the ctime of the "dock" directories the same in all cases (e.g. maybe there was a problem at time X that caused the unlink to fail) or are they spread over a longer time than would be expected by "rm -r scratch*" (e.g. hours/days)? Unfortunately, given that these entries are pointing at a missing inode, there isn't much on the MDS that could debug this.

Comment by Olaf Faaland [ 23/Jul/20 ]

Ah, OK.

Most of the instances of this issue were found while someone was deleting a directory tree with drm, one of the utilities from mpifileutils.  So far I've been unable to reproduce the issue by filling up trees with fake data and then running drm on them.  That made me question whether the problem was happening during the delete, or had happened earlier and was discovered by the delete.  I still haven't been able to prove that either way.

Below are listed parents with bad dirents.  ".../4k3g_A_4782_minimized/gbsa" has two bad dirents.  The ctimes of the parent directories are:

[root@oslic8:faaland1]# cat /tmp/t11.change
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance1/dank/3jxt_A_7784_minimized'       Change:  2019-02-20  14:08:44.000000000  -0800
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance1/dank/4k3g_A_4782_minimized/gbsa'  Change:  2019-01-14  17:33:27.000000000  -0800
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance1/dank/4k3g_A_4782_minimized/gbsa'  Change:  2019-01-14  17:33:27.000000000  -0800
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance1/yang30/a'                         Change:  2019-02-20  14:05:56.000000000  -0800

File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance2/dank/3byi_A_5862_minimized'       Change:  2020-03-18  17:42:47.000000000  -0700
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance2/yang30/dock1'                     Change:  2020-03-18  17:44:22.000000000  -0700

File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance2/yang30/scratch002/dock'           Change:  2019-11-23  16:07:42.000000000  -0800
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance2/yang30/scratch010/dock'           Change:  2019-11-23  16:25:11.000000000  -0800
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance2/yang30/scratch025/dock'           Change:  2019-11-23  16:50:00.000000000  -0800
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance2/yang30/scratch026/dock'           Change:  2019-11-23  16:52:43.000000000  -0800
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance2/yang30/scratch028/dock'           Change:  2019-11-23  16:54:40.000000000  -0800
File:  '/p/lustre1/faaland1/CORRUPT-TOSS-4441/instance2/yang30/scratch033/dock'           Change:  2019-11-23  16:57:53.000000000  -0800

File:  '/p/lustre2/faaland1/CORRUPT-TOSS-4441/yang30/incident1/4dst'                      Change:  2020-06-08  11:36:25.000000000  -0700
File:  '/p/lustre2/faaland1/CORRUPT-TOSS-4441/yang30/incident1/doc017'                    Change:  2020-06-08  11:36:11.000000000  -0700
File:  '/p/lustre2/faaland1/CORRUPT-TOSS-4441/yang30/incident1/dock'                      Change:  2020-06-08  11:35:50.000000000  -0700
File:  '/p/lustre2/faaland1/CORRUPT-TOSS-4441/yang30/scratch/com/lv02'                    Change:  2019-03-01  16:39:25.000000000  -0800

Comment by Olaf Faaland [ 28/Jan/21 ]

I wonder if this is a duplicate of LU-14379 / LU-13437

Comment by Peter Jones [ 28/Jan/21 ]

laisiyao what do you think?

Comment by Lai Siyao [ 02/Feb/21 ]

The issue LU-13437 fixed is rename may not take LOOKUP lock on correct MDT for remote object, which will only happen on DNE system.

Comment by Olaf Faaland [ 05/Feb/21 ]

The instances observed before last week involved cases where there were almost certainly no remote objects - that is, the parent directory and subdirectory should have been on the same MDT based on our general practice and usage patterns, and no "lfs migrate" was involved. 

So I think the original issue is probably different than LU-13437, and I won't close the issue for now.

However, since we seem to have no new information to point to an area of code, removing label "topllnl" from this issue.  We'll ignore this until we find new evidence.

Generated at Sat Feb 10 03:03:57 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.