[LU-5475] readdir missing a directory Created: 11/Aug/14  Updated: 11/May/15  Resolved: 19/Feb/15

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.2
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Major
Reporter: Christopher Morrone Assignee: Oleg Drokin
Resolution: Duplicate Votes: 0
Labels: llnl

Issue Links:
Related
is related to LU-5924 conf-sanity test_32b: list verificati... Resolved
is related to LU-3573 lustre-rsync-test test_8: @@@@@@ FAIL... Resolved
is related to LU-5254 readdir missing a directory Resolved
Severity: 3
Rank (Obsolete): 15252

 Description   

A directory tree was copied into a subdirectory of Lustre. At least one of the subdirectories in the newly created tree in Lustre does not appear in its parent's directory listing. The directory does exist, and it is possible to cd into that unlisted directory.

The same behavior is exhibited from multiple clients, so it is not a problem of just one client's cache being corrupt.

We are seeing this with the LLNL branch of Lustre 2.4.2 (github.com/chaos/lustre).

We could not identify any console messages associated with the problem.

The problem was seen on the secure network, so we cannot directly provide any logs.

This problem has suspiciously similar symptoms in common with LU-5254.



 Comments   
Comment by Oleg Drokin [ 12/Aug/14 ]

Hm, not a lot of data in here, unfortunately.
Do you have any idea if the clients that don't see the directory all had accessed the parent dir before the problem ensued (meaning the all might have a stale cache problem) or if a client not previously exposed to this directory also does not see it (a server side of some problem)?
I think in 5254 new clients saw the directory so there it appeared like a cache problem on client side to me.

It is odd that in both cases zfs is used server side. Lustre itself does not really cache anything server-side, so should there be some odd interaction within zfs (or between lustre and zfs of course) that would hide a directory from appearing in readdir output, this is exactly what would be seen.

Is there anything else known? E.g. was there a lot of stuff in the parent dir (possibly an issue of skipping an entry between pages or something)? If you create the same parent dir with the same names inside in a different lustre place - does the problem reappear by any chance?

Comment by Christopher Morrone [ 19/Aug/14 ]

Do you have any idea if the clients that don't see the directory all had accessed the parent dir before the problem ensued (meaning the all might have a stale cache problem) or if a client not previously exposed to this directory also does not see it (a server side of some problem)?

It is not a stale cache problem. Nodes that have never seen the directory before (or have had their cache cleared) do not see the directory.

Is there anything else known?

Note really.

If you create the same parent dir with the same names inside in a different lustre place - does the problem reappear by any chance?

No, it is not that easily reproduced.

Comment by Nathaniel Clark [ 06/Oct/14 ]

I have reproduced a very similar issue where readdir is missing a file. I can reproduce this on a ZFS backed MDT with high regularity.

Comment by Peter Jones [ 20/Oct/14 ]

Just to be clear - updates are on LU-3573

Comment by Peter Jones [ 10/Nov/14 ]

Chris

The patch for LU-3573 has landed to master. Do you have any known cases of affected files to verify the fix?

Peter

Comment by Peter Jones [ 24/Nov/14 ]

Heads up that LU-5924 seems related to the LU-3573 fix

Comment by Peter Jones [ 02/Feb/15 ]

Just to capture that the latest version of LU-3573 has been landed to master for over a month and back ported to b2_5 also. I think that this should be safe to try out

Comment by Nathaniel Clark [ 19/Feb/15 ]

This was fixed by LU-3573

Generated at Sat Feb 10 01:51:48 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.