Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
None
-
None
-
3
-
6320
Description
This bug is mainly open for information and to let the community aware, just in case ...
So, here is the story, running with our/Bull Lustre 2.1.2 version, customer started to report files corruptions where a Client can successfully create/write/re-read files until an other Client tries to access the same file. At this time the files content became corrupted for all (missing blocks of data or zéro file-size ...) !!
On the other hand, the corruptions have been identified to occur only on OSTs/OSCs where the file-creator Client had no more grant ("/proc/fs/lustre/osc/<OST-import>/cur_
{dirty|grant|lost_grant}_bytes" are Null) and also they seem to never be automatically recovered but only when we run a small program doing O_DIRECT writes to these OSTs ...
Finally doing a full Lustre-trace of a program/command on a Client writing to these "zero-grant" OSTs, we found that the -EDQUOT was coming up during the cached-write standard path/routines and then the direct-IO path was attempted but ended with -EALREADY finally substitued with a successfull Null/0 return-value and no page written/flushed to the Server at all nor new grants recovered !!!
Having a look to the concerned source-code, this would only occur if the written page(s) was not set Dirty ...
Finally we found that this behavior/bug (missing "set_page_dirty()" vvp_io_commit_write() in case of -EDQUOT return from cl_page_add_cache() and implicit switch to direct-IO path/vvp_page_sync_io()) was not in Lustre v2.1.2 Base but has been introduced by patch from LU-1442 that our R&D included due to its high-level critical ...
This bug has been since fixed by LU-1703 that we (I mean Bull R&D) need to integrate asap !!!!
Also, I think that an explicit link/comment has to be added in LU-1442 to detail its running-dependency with LU-1703 patch.